KR102559658B1

KR102559658B1 - Scheduling method and apparatus thereof

Info

Publication number: KR102559658B1
Application number: KR1020210071066A
Authority: KR
Inventors: 윤찬현; 김우중
Original assignee: 한국과학기술원; 국방과학연구소
Priority date: 2020-12-16
Filing date: 2021-06-01
Publication date: 2023-07-26
Also published as: KR20220086449A

Abstract

스케줄링 방법 및 장치가 제공된다. 본 개시에 따른 스케줄링 방법은, 각 가속기 노드에 대하여 상기 딥 러닝 모델에 포함된 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간을 결정하는 동작, 상기 최소 처리 시간에 기초하여 상기 각 가속기 노드에서 상기 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지를 결정하는 동작, 상기 최소 처리 시간 및 기 설정된 처리 제한 시간에 기초하여 상기 각 가속기 노드에서 처리할 수 있는 최대 할당 데이터를 결정하는 동작, 상기 소비 에너지 및 상기 최대 할당 데이터에 기초하여 상기 각 가속기 노드의 에너지 비용 효율을 결정하는 동작 및 상기 각 가속기 노드의 에너지 비용 효율을 서로 비교하여 상기 각 가속기 노드 중 입력 데이터를 할당할 적어도 하나의 가속기 노드를 선택하는 동작을 포함할 수 있다. A scheduling method and apparatus are provided. The scheduling method according to the present disclosure includes determining a minimum processing time required to process a plurality of components included in the deep learning model for each accelerator node, determining energy consumption consumed for processing the plurality of components in each accelerator node based on the minimum processing time, determining maximum allocated data that can be processed by each accelerator node based on the minimum processing time and a preset processing limit time, and determining energy cost efficiency of each accelerator node based on the consumed energy and the maximum allocated data. and selecting at least one accelerator node from among the accelerator nodes to allocate input data by comparing the energy cost efficiency of each accelerator node with each other.

Description

Scheduling method and apparatus {SCHEDULING METHOD AND APPARATUS THEREOF}

본 개시는 스케줄링 방법 및 장치에 관한 것이다.The present disclosure relates to a scheduling method and apparatus.

고성능 컴퓨팅 환경은 대규모 데이터에 대한 복잡도 높은 처리를 가능하게 하는 컴퓨팅 환경으로 최근에는 대규모 딥러닝 처리를 위해서 GPU와 FPGA와 같은 가속기를 지원하는 클러스터가 활용되고 있다. 기존에는 동종 컴퓨팅 환경(Homogeneous Computing Environment)의 클러스터가 주로 활용되었으나, GPU 및 FPGA와 같은 이종 가속기를 적용한 이종 컴퓨팅 환경의 클러스터(Heterogeneous Computing Environment)의 활용이 증가하고 있다.A high-performance computing environment is a computing environment that enables high-complexity processing of large-scale data. Recently, clusters supporting accelerators such as GPUs and FPGAs are being utilized for large-scale deep learning processing. In the past, a cluster of a homogeneous computing environment was mainly used, but the use of a cluster of a heterogeneous computing environment to which heterogeneous accelerators such as GPUs and FPGAs are applied is increasing.

이종 가속기 지원 고성능 컴퓨팅 환경이란, CPU뿐만 아니라 GPU 및 FPGA와 같은 딥러닝 모델 처리에 적합한 가속기로 구성된 클러스터 환경을 의미한다. 딥러닝 모델은 데이터 분석에 혁신적인 정확도를 보이는 대신 높은 복잡도를 갖기 때문에 CPU 수준의 범용적인 컴퓨팅 성능으로는 처리 속도가 느려, 빠른 처리를 위해 FPGA, GPU와 같은 이종 가속기를 사용하는 것이 일반적이다.A heterogeneous accelerator-supported high-performance computing environment refers to a cluster environment composed of accelerators suitable for processing deep learning models such as GPUs and FPGAs as well as CPUs. Because deep learning models have high complexity instead of showing innovative accuracy in data analysis, the processing speed is slow with general-purpose computing performance at the CPU level, so it is common to use heterogeneous accelerators such as FPGAs and GPUs for fast processing.

스케줄링 기술이란 주어진 작업을 주어진 컴퓨팅 환경에서 어떻게 처리할 것인가를 특정 목표에 따라 미리 계획하고 실행하는 기술이다, 고정적인 환경에서 고정적인 스케줄이 사용되기도 하나, 보통 적용되는 시스템 환경은 동적으로 변하기 때문에 그에 맞게 적응적인 스케줄링을 적용하는 것이 효과적이다. 이와 같은 적응적인 스케줄링은 주어진 처리 제한 시간을 만족시키면서 처리 비용을 최소화하는 데 목적을 둔다.Scheduling technology is a technology that plans and executes a given task in advance according to a specific goal in a given computing environment. A fixed schedule is sometimes used in a fixed environment, but it is effective to apply adaptive scheduling accordingly because the system environment that is usually applied dynamically changes. Such adaptive scheduling is aimed at minimizing processing cost while satisfying a given processing time limit.

본 개시의 일부 실시예는, 이종 가속기를 지원하는 고성능 컴퓨팅 환경을 고려하여, 가속기 노드의 컴퓨팅 자원에 각 내부 컴포넌트 처리 작업을 순차적으로 분배함으로써 딥 러닝 모델에 대한 다수의 요청을 가속 처리하는 스케줄링 방법 및 장치를 제공하는 것을 목적으로 한다.An object of some embodiments of the present disclosure is to provide a scheduling method and apparatus for accelerating processing of multiple requests for a deep learning model by sequentially distributing each internal component processing task to computing resources of an accelerator node in consideration of a high-performance computing environment supporting heterogeneous accelerators.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 개시의 제1 측면은, 각 가속기 노드에 대하여 딥 러닝 모델에 포함된 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간을 결정하는 동작, 상기 최소 처리 시간에 기초하여 상기 각 가속기 노드에서 상기 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지를 결정하는 동작, 상기 최소 처리 시간 및 기 설정된 처리 제한 시간에 기초하여 상기 각 가속기 노드에서 처리할 수 있는 최대 할당 데이터를 결정하는 동작, 상기 소비 에너지 및 상기 최대 할당 데이터에 기초하여 상기 각 가속기 노드의 에너지 비용 효율을 결정하는 동작 및 상기 각 가속기 노드의 에너지 비용 효율을 서로 비교하여 상기 각 가속기 노드 중 입력 데이터를 할당할 적어도 하나의 가속기 노드를 선택하는 동작을 포함하는 스케줄링 방법을 제공할 수 있다.As a technical means for achieving the above-described technical problem, a first aspect of the present disclosure is an operation of determining a minimum processing time required to process a plurality of components included in a deep learning model for each accelerator node, an operation of determining consumption energy consumed to process the plurality of components in each accelerator node based on the minimum processing time, an operation of determining maximum allocation data that can be processed by each accelerator node based on the minimum processing time and a preset processing limit time, and based on the consumed energy and the maximum allocation data An operation of determining the energy cost efficiency of each accelerator node and an operation of comparing the energy cost efficiency of each accelerator node with each other to select at least one accelerator node to which input data is allocated among the respective accelerator nodes. A scheduling method may be provided.

상기 각 가속기 노드에 대하여 상기 딥 러닝 모델에 포함된 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간을 결정하는 동작은, 상기 각 가속기 노드에 포함된 복수의 가속기가 각 컴포넌트를 처리하는 데 걸리는 시간이 각 가속기 별로 서로 동일하도록 상기 각 컴포넌트에 대한 가속기 별 처리 시간을 결정하는 동작, 상기 각 컴포넌트에 대한 가속기 별 처리 시간을 각 컴포넌트에 대한 최소 처리 시간으로 결정하는 동작, 상기 각 컴포넌트에 대한 최소 처리 시간을 모두 합하여 상기 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간으로 결정하는 동작을 포함할 수 있다.The operation of determining the minimum processing time required to process the plurality of components included in the deep learning model for each accelerator node is the operation of determining the processing time per accelerator for each component so that the time required for the plurality of accelerators included in each accelerator node to process each component is the same for each accelerator, the operation of determining the processing time per accelerator for each component as the minimum processing time for each component, and the minimum processing time required to process the plurality of components by adding all the minimum processing times for each component It may include an operation to determine.

상기 각 컴포넌트에 대한 가속기 별 처리 시간은 데이터 처리 시간을 포함하며, 입력 데이터 전송시간 및 출력 데이터 전송시간 중 적어도 하나를 포함할 수 있다.The processing time for each accelerator for each component includes a data processing time and may include at least one of an input data transmission time and an output data transmission time.

상기 최소 처리 시간에 기초하여 상기 각 가속기 노드에서 상기 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지를 결정하는 동작은, 상기 각 가속기 노드에 포함된 CPU의 유효전력과 유휴전력, 상기 CPU의 전처리 시간 및 상기 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간에 기초하여 상기 각 가속기 노드에 포함된 CPU의 소비 에너지를 결정하는 동작, 및 상기 각 가속기 노드에 포함된 각 가속기의 유효전력과 유휴전력, 상기 각 컴포넌트에 대한 가속기 별 처리 시간에 기초하여 상기 각 가속기 노드에 포함된 복수의 가속기의 소비 에너지를 결정하는 동작을 포함할 수 있다.The operation of determining the energy consumption consumed for processing the plurality of components in each accelerator node based on the minimum processing time, the operation of determining the energy consumption of the CPU included in each accelerator node based on the active power and idle power of the CPU included in each accelerator node, the pre-processing time of the CPU, and the minimum processing time required to process the plurality of components, and the active power and idle power of each accelerator included in each accelerator node, and the processing time for each accelerator for each component. An operation of determining consumed energy of a plurality of accelerators included in the accelerator node may be included.

상기 각 가속기 노드의 에너지 비용 효율을 서로 비교하여 상기 각 가속기 노드 중 입력 데이터를 할당할 적어도 하나의 가속기 노드를 선택하는 동작은, 상기 기 설정된 처리 제한 시간 이내에 상기 입력 데이터를 모두 처리할 수 있도록 상기 각 가속기 노드 중 에너지 비용 효율이 낮은 가속기 노드부터 순차적으로 선택하는 것일 수 있다.The operation of comparing the energy cost efficiency of each accelerator node with each other and selecting at least one accelerator node from among the accelerator nodes to which input data is to be allocated may include sequentially selecting an accelerator node having a low energy cost efficiency among the accelerator nodes so as to process all of the input data within the predetermined processing time limit.

또한, 본 개시의 제2 측면은, 메모리 및 적어도 하나의 프로세서를 포함하며, 상기 적어도 하나의 프로세서는 각 가속기 노드에 대하여 딥 러닝 모델에 포함된 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간을 결정하고, 상기 최소 처리 시간에 기초하여 상기 각 가속기 노드에서 상기 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지를 결정하고, 상기 최소 처리 시간 및 기 설정된 처리 제한 시간에 기초하여 상기 각 가속기 노드에서 처리할 수 있는 최대 할당 데이터를 결정하고, 상기 소비 에너지 및 상기 최대 할당 데이터에 기초하여 상기 각 가속기 노드의 에너지 비용 효율을 결정하고, 상기 각 가속기 노드의 에너지 비용 효율을 서로 비교하여 상기 각 가속기 노드 중 입력 데이터를 할당할 적어도 하나의 가속기 노드를 선택하는 스케줄링 장치를 제공할 수 있다.In addition, a second aspect of the present disclosure includes a memory and at least one processor, wherein the at least one processor determines a minimum processing time required to process a plurality of components included in a deep learning model for each accelerator node, determines consumption energy consumed to process the plurality of components at each accelerator node based on the minimum processing time, determines maximum allocation data that can be processed at each accelerator node based on the minimum processing time and a preset processing limit time, and determines the consumed energy and the maximum allocation data Based on this, the energy cost efficiency of each accelerator node is determined, and the energy cost efficiency of each accelerator node is compared with each other to select at least one accelerator node from among the accelerator nodes to which input data is allocated. A scheduling device may be provided.

너지를 결정하는 동작, 상기 최소 처리 시간 및 기 설정된 처리 제한 시간에 기초하여 상기 각 가속기 노드에서 처리할 수 있는 최대 할당 데이터를 결정하는 동작, 상기 소비 에너지 및 상기 최대 할당 데이터에 기초하여 상기 각 가속기 노드의 에너지 비용 효율을 결정하는 동작 및 상기 각 가속기 노드의 에너지 비용 효율을 서로 비교하여 상기 각 가속기 노드 중 입력 데이터를 할당할 적어도 하나의 가속기 노드를 선택하는 동작을 포함하는 스케줄링 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체A computer-readable recording medium recording a program for executing a scheduling method on a computer, which includes determining energy, determining maximum allocation data that can be processed by each accelerator node based on the minimum processing time and a predetermined processing time limit, determining energy cost efficiency of each accelerator node based on the consumed energy and the maximum allocation data, and selecting at least one accelerator node among the accelerator nodes to which input data is to be allocated by comparing the energy cost efficiency of each accelerator node with each other.

또한, 본 개시의 제3 측면은, 각 가속기 노드에 대하여 상기 딥 러닝 모델에 포함된 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간을 결정하는 동작, 상기 최소 처리 시간에 기초하여 상기 각 가속기 노드에서 상기 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지를 결정하는 동작, 상기 최소 처리 시간 및 기 설정된 처리 제한 시간에 기초하여 상기 각 가속기 노드에서 처리할 수 있는 최대 할당 데이터를 결정하는 동작, 상기 소비 에너지 및 상기 최대 할당 데이터에 기초하여 상기 각 가속기 노드의 에너지 비용 효율을 결정하는 동작, 및 상기 각 가속기 노드의 에너지 비용 효율을 서로 비교하여 상기 각 가속기 노드 중 입력 데이터를 할당할 적어도 하나의 가속기 노드를 선택하는 동작을 포함하는 스케줄링 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함하는 컴퓨터 판독 가능한 기록매체를 제공할 수 있다.In addition, in a third aspect of the present disclosure, an operation of determining a minimum processing time required to process a plurality of components included in the deep learning model for each accelerator node, an operation of determining energy consumption consumed for processing the plurality of components in each accelerator node based on the minimum processing time, an operation of determining maximum allocation data that can be processed by each accelerator node based on the minimum processing time and a predetermined processing limit time, and energy cost efficiency of each accelerator node based on the consumed energy and the maximum allocation data A computer readable recording medium containing instructions for causing a processor to perform a scheduling method including an operation of determining, and an operation of selecting at least one accelerator node to allocate input data among the accelerator nodes by comparing the energy cost efficiency of each accelerator node with each other. A computer readable recording medium may be provided.

또한, 본 개시의 제4 측면은, 각 가속기 노드에 대하여 상기 딥 러닝 모델에 포함된 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간을 결정하는 동작, 상기 최소 처리 시간에 기초하여 상기 각 가속기 노드에서 상기 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지를 결정하는 동작, 상기 최소 처리 시간 및 기 설정된 처리 제한 시간에 기초하여 상기 각 가속기 노드에서 처리할 수 있는 최대 할당 데이터를 결정하는 동작, 상기 소비 에너지 및 상기 최대 할당 데이터에 기초하여 상기 각 가속기 노드의 에너지 비용 효율을 결정하는 동작, 및 상기 각 가속기 노드의 에너지 비용 효율을 서로 비교하여 상기 각 가속기 노드 중 입력 데이터를 할당할 적어도 하나의 가속기 노드를 선택하는 동작을 포함하는 스케줄링 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함하는 컴퓨터 프로그램을 제공할 수 있다.In addition, in a fourth aspect of the present disclosure, an operation of determining a minimum processing time required to process a plurality of components included in the deep learning model for each accelerator node, an operation of determining energy consumption consumed for processing the plurality of components in each accelerator node based on the minimum processing time, an operation of determining maximum allocation data that can be processed by each accelerator node based on the minimum processing time and a preset processing limit time, and energy cost efficiency of each accelerator node based on the consumed energy and the maximum allocation data A computer program including instructions for causing a processor to perform a scheduling method including an operation of determining and an operation of selecting at least one accelerator node to which input data is to be allocated among the respective accelerator nodes by comparing the energy cost efficiency of each accelerator node with each other.

본 개시에 의하면, 이종 가속기를 지원하는 고성능 컴퓨팅 환경을 고려하여, 가속기 노드의 컴퓨팅 자원에 각 내부 컴포넌트 처리 작업을 순차적으로 분배함으로써 딥 러닝 모델에 대한 다수의 요청을 가속 처리하는 스케줄링 방법 및 장치를 제공할 수 있는 효과가 있다.According to the present disclosure, in consideration of a high-performance computing environment supporting heterogeneous accelerators, a scheduling method and apparatus for accelerating processing of multiple requests for a deep learning model by sequentially distributing processing tasks of each internal component to computing resources of an accelerator node. There is an effect that can be provided.

도 1은 일부 실시예에 따른 스케줄링 장치가 입력 데이터를 각 가속기 노드에 할당하는 방법을 나타내는 개념도이다.
도 2는 일부 실시예에 따른 스케줄링 방법을 나타내는 흐름도이다.
도 3은 일부 실시예에 따른 가속기 노드가 딥러닝 모델에 포함된 컴포넌트를 처리하는 속도 및 시간을 나타내는 도면이다.
도 4는 일부 실시예에 따른 가속기 노드가 딥러닝 모델에 포함된 컴포넌트를 처리하는 데 소비되는 소비 에너지를 나타내는 도면이다.
도 5는 일부 실시예에 따른 스케줄링 장치의 구성을 나타내는 블록도이다.1 is a conceptual diagram illustrating a method in which a scheduling apparatus allocates input data to each accelerator node according to some embodiments.
2 is a flowchart illustrating a scheduling method according to some embodiments.
3 is a diagram showing speed and time at which an accelerator node processes a component included in a deep learning model according to some embodiments.
4 is a diagram illustrating consumed energy consumed when an accelerator node processes a component included in a deep learning model according to some embodiments.
5 is a block diagram illustrating a configuration of a scheduling device according to some embodiments.

아래에서는 첨부한 도면을 참조하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 개시의 실시예를 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present disclosure will be described in detail so that those skilled in the art can easily practice them with reference to the accompanying drawings. However, the present disclosure may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly describe the present disclosure in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is said to be “connected” to another part, this includes not only the case where it is “directly connected” but also the case where it is “electrically connected” with another element interposed therebetween. In addition, when a part "includes" a certain component, it means that it may further include other components without excluding other components unless otherwise stated.

이하 첨부된 도면을 참고하여 본 개시를 상세히 설명하기로 한다.Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

도 1은 일부 실시예에 따른 스케줄링 장치가 입력 데이터를 각 가속기 노드에 할당하는 방법을 나타내는 개념도이다.1 is a conceptual diagram illustrating a method in which a scheduling apparatus allocates input data to each accelerator node according to some embodiments.

도 1을 참조하면, 일부 실시예에 따른 딥러닝 모델에 대한 각 요청은 입력 데이터로 주어지고, 스케줄링 장치(10)는 복수의 요청에 대한 총 입력 데이터 D를 딥 러닝 모델에서 처리하기 위해 M개의 가속기 노드 [S₁,…,S_i,…,S_M]로 구성되는 이종 가속기 클러스터(12)를 사용할 수 있다. Referring to FIG. 1 , each request for a deep learning model according to some embodiments is given as input data, and the scheduling device 10 uses M accelerator nodes to process the total input data D for a plurality of requests in the deep learning model [S ₁ , . . . ,S _i ,... , S _M ] may be used.

이종 가속기 클러스터(12)에 포함된 각 가속기 노드는 적어도 하나의 가속기를 포함할 수 있다. 일부 실시예에 따른 스케줄링 장치(10)는 총 입력 데이터 D를 각 가속기 노드에 할당하기 위해 가속기 노드 할당 전략 X=[x₁,…,x_i,…,x_M]를 사용할 수 있다. 이때 x_i={0,1}는 가속기 노드 Si에 대한 데이터 할당 여부를 나타낸다. 가속기 노드 할당 전략 X에 따라, 각 가속기 노드에 할당 데이터 [D₁,…,D_i,…,D_M]가 할당될 수 있다.Each accelerator node included in the heterogeneous accelerator cluster 12 may include at least one accelerator. The scheduling device 10 according to some embodiments uses an accelerator node allocation strategy X=[x ₁ , . . . ] to allocate the total input data D to each accelerator node. ,x _i ,... ,x _M ] can be used. At this time, x _i ={0,1} indicates whether data is allocated to the accelerator node Si. According to the accelerator node allocation strategy X, each accelerator node allocates data [D ₁ ,… ,D _i ,... , D _M ] may be assigned.

일부 실시예에 따른 딥러닝 모델은 복수의 컴포넌트 [DL₁,…,DL_k,…,DL_K]로 구성될 수 있다. 딥러닝 모델의 실행 전에 이루어지는 전처리 작업인 0번째 컴포넌트 DL₀는 각 가속기 노드의 CPU에서 처리될 수 있다. 이후 딥러닝 모델을 구성하는 복수의 컴포넌트는 각 가속기 노드에서 순차적으로 처리될 수 있다. A deep learning model according to some embodiments has a plurality of components [DL ₁ , . . . ,DL _k ,… ,DL _K ]. The 0th component, DL ₀ , which is a preprocessing task before the execution of the deep learning model, can be processed by the CPU of each accelerator node. Afterwards, the plurality of components constituting the deep learning model can be sequentially processed in each accelerator node.

가속기 노드 Si에 주어진 할당 데이터 D_i는 컴포넌트 DL₀에 입력된다. 본 개시에서 컴포넌트 DL_k의 입력 데이터는 D_k ⁱ, 출력 데이터는 D^_k ⁱ로 정의되며, 컴포넌트 DL₀에 입력되는 D_i는 D₀ ⁱ로 표시할 수 있다. 할당 데이터 D_i와 딥 러닝 모델의 구조가 정해지면 D_k ⁱ/D^_k ⁱ는 고정값을 가진다.Assignment data D _i given to accelerator node Si is input to component DL ₀ . In the present disclosure, input data of component DL _k is defined as D _k ⁱ , output data is defined as D^ _k ⁱ , and D _i input to component DL ₀ can be denoted as D ₀ ⁱ . When the allocation data D _i and the structure of the deep learning model are determined, D _k ⁱ /D^ _k ⁱ has a fixed value.

본 개시에서, 가속기 노드 Si의 할당 데이터 D_i 처리 시간 ptⁱ를 다음의 식으로 정의할 수 있다.In the present disclosure, the allocation data D _i processing time pt ⁱ of the accelerator node Si may be defined as the following equation.

[수학식 1][Equation 1]

수학식 1에서 pt₀ ⁱ는 i번째 가속기 노드인 가속기 노드 S_i의 CPU에서 D₀ ⁱ에 대한 딥러닝 모델의 전처리에 걸리는 시간이며, pt_K ⁱ는 i번째 가속기 노드인 가속기 노드 S_i의 k번째 가속기에서 D_k ⁱ에 대한 딥러닝 모델의 컴포넌트 DL_k를 처리하는 데 걸리는 시간이다.In Equation 1, pt ₀ ⁱ is the time required for pre-processing of the deep learning model for D ₀ ⁱ ⁱⁿ the CPU _of accelerator node _S _i , which is the i-th accelerator node, and pt _K ⁱ is the i-th accelerator _node .

도 2는 일부 실시예에 따른 스케줄링 방법을 나타내는 흐름도이다. 2 is a flowchart illustrating a scheduling method according to some embodiments.

도 2를 참조하면, 단계 S201에서 스케줄링 장치(10)는 각 가속기 노드에 대하여 딥 러닝 모델에 포함된 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간을 결정할 수 있다.Referring to FIG. 2 , in step S201, the scheduling device 10 may determine a minimum processing time required to process a plurality of components included in the deep learning model for each accelerator node.

일부 실시예에 따르면, 스케줄링 장치(10)는 각 가속기 노드에 포함된 복수의 가속기가 각 컴포넌트를 처리하는 데 걸리는 시간이 각 가속기 별로 서로 동일하도록 컴포넌트에 대한 가속기 별 처리 시간을 결정할 수 있다. 이때 스케줄링 장치(10)는 각 컴포넌트에 대한 가속기 별 처리 시간을 각 컴포넌트에 대한 최소 처리 시간으로 결정할 수 있다. According to some embodiments, the scheduling device 10 may determine the processing time for each accelerator for the component such that the time required for a plurality of accelerators included in each accelerator node to process each component is the same for each accelerator. In this case, the scheduling apparatus 10 may determine the processing time for each accelerator for each component as the minimum processing time for each component.

각 컴포넌트에 대한 최소 처리 시간이 결정되면, 스케줄링 장치(10)는 각 컴포넌트에 대한 최소 처리 시간을 모두 합하여, 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간으로 결정할 수 있다. When the minimum processing time for each component is determined, the scheduling device 10 may add up all minimum processing times for each component to determine the minimum processing time required to process a plurality of components.

일부 실시예에 따른 스케줄링 장치(10)가 각 가속기 노드에 대하여 딥 러닝 모델에 포함된 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간을 결정하는 구체적인 방법은 도 3을 통해 후술한다.A specific method for determining the minimum processing time required for the scheduling device 10 to process a plurality of components included in the deep learning model for each accelerator node according to some embodiments will be described later with reference to FIG. 3 .

복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간이 결정되면, 단계 S202에서 스케줄링 장치(10)는 최소 처리 시간에 기초하여 각 가속기 노드에서 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지를 결정할 수 있다.When the minimum processing time required to process the plurality of components is determined, in step S202, the scheduling device 10 may determine consumed energy consumed in processing the plurality of components at each accelerator node based on the minimum processing time.

일부 실시예에 따르면, 스케줄링 장치(10)는 각 가속기 노드에 포함된 CPU의 유효전력과 유휴전력, CPU의 전처리 시간 및 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간에 기초하여 각 가속기 노드의 CPU에서 소비되는 소비 에너지를 결정할 수 있다.According to some embodiments, the scheduling device 10 may determine the consumed energy consumed by the CPU of each accelerator node based on the active power and idle power of the CPU included in each accelerator node, the preprocessing time of the CPU, and the minimum processing time required to process the plurality of components.

이와 유사하게, 스케줄링 장치(10)는 각 가속기 노드에 포함된 각 가속기의 유효전력과 유휴전력, 각 컴포넌트에 대한 가속기 별 처리 시간에 기초하여 각 가속기 노드에 포함된 복수의 가속기에서 소비되는 소비 에너지를 결정할 수 있다.Similarly, the scheduling device 10 consumes energy consumed in a plurality of accelerators included in each accelerator node based on the active power and idle power of each accelerator included in each accelerator node, and the processing time of each accelerator for each component. Energy consumption can be determined.

CPU에서 소비되는 소비 에너지 및 복수의 가속기에서 소비되는 소비 에너지가 결정되면, 스케줄링 장치(10)는 CPU에서 소비되는 소비 에너지 및 복수의 가속기에서 소비되는 소비 에너지를 모두 합하여 각 가속기 노드에서 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지를 결정할 수 있다. When the consumed energy consumed by the CPU and the consumed energy consumed by the plurality of accelerators are determined, the scheduling apparatus 10 sums up the consumed energy consumed by the CPU and the consumed energy consumed by the plurality of accelerators to determine the consumed energy consumed in processing the plurality of components at each accelerator node.

일부 실시예에 따른 스케줄링 장치(10)가 최소 처리 시간에 기초하여 각 가속기 노드에서 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지를 결정하는 방법은 도 4를 통해 후술한다.A method of determining consumed energy consumed in processing a plurality of components in each accelerator node based on a minimum processing time by the scheduling apparatus 10 according to some embodiments will be described later with reference to FIG. 4 .

각 가속기 노드에서 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지가 결정되면, 단계 S203에서 스케줄링 장치(10)는 최소 처리 시간 및 기 설정된 처리 제한 시간에 기초하여 각 가속기 노드에서 처리할 수 있는 최대 할당 데이터를 결정할 수 있다.When the consumed energy consumed to process the plurality of components in each accelerator node is determined, in step S203, the scheduling device 10 can determine the maximum allocated data that can be processed in each accelerator node based on the minimum processing time and the predetermined processing limit time.

각 가속기 노드에서 처리할 수 있는 최대 할당 데이터가 결정되면, 단계 S204에서 스케줄링 장치(10)는 소비 에너지 및 최대 할당 데이터에 기초하여 각 가속기 노드의 에너지 비용 효율을 결정할 수 있다.When the maximum allocation data that can be processed by each accelerator node is determined, in step S204, the scheduling device 10 may determine the energy cost efficiency of each accelerator node based on the consumed energy and the maximum allocation data.

각 가속기 노드의 에너지 비용 효율이 결정되면, 단계 S205에서 스케줄링 장치(10)는 각 가속기 노드의 에너지 비용 효율을 서로 비교하여 각 가속기 노드 중 입력 데이터를 할당할 적어도 하나의 가속기 노드를 선택할 수 있다.When the energy cost efficiency of each accelerator node is determined, in step S205, the scheduling device 10 compares the energy cost efficiency of each accelerator node with each other and selects at least one accelerator node from among the accelerator nodes to which input data is allocated.

일부 실시예에 따르면, 스케줄링 장치(10)는 기 설정된 처리 제한 시간 이내에 입력 데이터를 모두 처리할 수 있도록 각 가속기 노드 중 에너지 비용 효율이 낮은 가속기 노드부터 순차적으로 선택할 수 있다.According to some embodiments, the scheduling device 10 may sequentially select an accelerator node having a low energy cost efficiency among accelerator nodes so as to process all input data within a predetermined processing time limit.

도 3은 일부 실시예에 따른 가속기 노드가 딥러닝 모델에 포함된 컴포넌트를 처리하는 속도 및 시간을 나타내는 도면이다.3 is a diagram showing speed and time at which an accelerator node processes a component included in a deep learning model according to some embodiments.

할당 데이터 Di를 입력받은 가속기 노드 Si는 딥러닝 모델에 포함된 복수의 컴포넌트를 순차적으로 처리한다. 스케줄링 장치(10)는 컴포넌트 DL_k의 처리에 주어지는 D_k ⁱ를 N_i 개의 가속기에서 처리하기 위해 분배한다. 본 개시에서 가속기 노드 Si에서 N_i 개의 가속기에 분배되는 데이터를 [D_k ^i,1,…,D_k ^i,j,…,D_k ^i,Ni]로 표기한다. 이때, = 를 만족한다.Upon receiving the allocation data Di, the accelerator node Si sequentially processes a plurality of components included in the deep learning model. The scheduling apparatus 10 distributes D _k ⁱ given to the processing of component DL _k to be processed by N _i accelerators. In the present disclosure, the data distributed to N _i accelerators at the accelerator node Si [D _k ^i,1 ,... ,D _k ^i,j ,... ,D _k ^i,Ni ]. At this time, = satisfies

본 개시에서 N_i 개의 가속기의 복수의 컴포넌트 (DL₁ 내지 DL_K)에 대한 처리 속도(Throughput)는 [TH_k ^i,1,…,TH_k ^i,j,…,TH_k ^i,Ni]로 표기한다. 이때, N_i 개의 가속기가 각각 할당받은 데이터 [D_k ^i,1,…,D_k ^i,j,…,D_k ^i,Ni]를 처리하는 시간 [cl_k ^i,1,…,cl_k ^i,j,…,cl_k ^i,Ni]은 다음과 같이 계산될 수 있다.In the present disclosure, throughput for a plurality of components (DL ₁ to DL _K ) of N _i accelerators is [TH _k ^i,1 ,... ,TH _k ^i,j ,... ,TH _k ^i,Ni ]. At this time, the data allocated to each of the N _i accelerators [D _k ^i,1 ,... ,D _k ^i,j ,... ,D _k ^i,Ni ] processing time [cl _k ^i,1 ,… ,cl _k ^i,j ,... ,cl _k ^i,Ni ] can be calculated as follows.

[수학식 2][Equation 2]

[], [ ],

전술한 것과 같이, pt_K ⁱ는 i번째 가속기 노드인 가속기 노드 S_i의 k번째 가속기에서 D_k ⁱ에 대한 딥러닝 모델의 컴포넌트 DL_k를 처리하는 데 걸리는 시간이다. 이때 pt_K ⁱ는 세부적으로 다음의 수학식과 같이 계산될 수 있다.As described above, pt _K ⁱ is the time required to process the component DL _k of the deep learning model for D _k ⁱ at the k-th accelerator of the i-th accelerator node, accelerator node S _i . At this time, pt _K ⁱ can be calculated in detail as in the following equation.

[수학식 3][Equation 3]

수학식 3에 따르면, 가속기 노드 S_i에서 컴포넌트 DL_k에 대한 입력 데이터 D_k ⁱ를 처리하는 경우, 처리 시간은 입력 데이터 전송시간 ti_k ⁱ, 데이터 처리 시간 cl_k ⁱ 및 출력 데이터 전송시간 to_k ⁱ의 합으로 정의된다.According to Equation 3, when the accelerator node S _i processes the input data D _k ⁱ for the component DL _k , the processing time is defined as the sum of the input data transfer time ti _k ⁱ , the data processing time cl _k ⁱ , and the output data transfer time to _k ⁱ .

가속기 노드 S_i의 메인 메모리(Main Memory)와 가속기의 글로벌 메모리(global Memory) 사이의 PCI express 대역폭 BW_pci은 각 가속기의 처리량과 비교하여 상당히 높다고 가정하면, 입력 데이터 D_k ⁱ와 컴포넌트 DL_k에 대한 가속기 노드 S_i의 pt_K ⁱ는 다음의 수학식과 같이 cl_k ⁱ로 근사할 수 있다.Assuming that the PCI express bandwidth BW _pci between the main memory of the accelerator node S _i and the global memory of the accelerator is significantly higher than the throughput of each accelerator, the pt _K ⁱ of the accelerator node S _i for the input data D _k ⁱ and the component DL _k can be approximated as cl _k ⁱ as shown in the following equation.

[수학식 4][Equation 4]

각 가속기가 분배받은 데이터 [D_k ^i,1,…,D_k ^i,j,…,D_k ^i,Ni]를 모두 처리한 이후에 출력 데이터 전송이 이루어지므로, 데이터 처리 시간 cl_k ⁱ는 각 가속기의 처리 시간 cl_k ^i,j중 가장 큰 값으로 결정될 수 있다. 일부 실시예에 따르면, 분배받은 데이터 [D_k ^i,1,…,D_k ^i,j,…,D_k ^i,Ni]에 대한 각 가속기의 처리 시간이 서로 동일할 때, 최소 수행시간 pt_K ^*i가 달성될 수 있다.Data distributed to each accelerator [D _k ^i,1 ,… ,D _k ^i,j ,... ,D _k ^i,Ni ], the data processing time cl _k ⁱ may be determined as the largest value among the processing times cl _k ^i,j of each accelerator. According to some embodiments, the distributed data [D _k ^i,1 ,... ,D _k ^i,j ,... ,D _k ^i,Ni ], when the processing time of each accelerator is equal to each other, the minimum execution time pt _K ^*i can be achieved.

입력 데이터 D_k ⁱ와 컴포넌트 DL_k에 대한 가속기 노드 S_i의 최소 수행시간 pt_K ^*i는 다음의 수학식과 같이 유도될 수 있다.The minimum execution time pt _K ^*i of the accelerator node S _i for the input data D _k ⁱ and the component DL _k can be derived as in the following equation.

[수학식 5][Equation 5]

도 4는 일부 실시예에 따른 가속기 노드가 딥러닝 모델에 포함된 컴포넌트를 처리하는 데 소비되는 소비 에너지를 나타내는 도면이다.4 is a diagram illustrating consumed energy consumed when an accelerator node processes a component included in a deep learning model according to some embodiments.

가속기 노드 S_i가 딥러닝 모델의 복수의 컴포넌트를 처리하는 데 소비하는 소비 에너지 E_i는 CPU 유효 소비 에너지(Active Energy), CPU 유휴 소비 에너지(Idle Energy) 및 GPU 유효 소비 에너지의 합(GPU 유휴 소비 에너지는 상대적으로 적어 무시할 수 있음)으로, 다음의 수학식과 같이 계산될 수 있다.The energy consumption E _i consumed by the accelerator node _Si to process the plurality of components of the deep learning model is the sum of CPU active energy, CPU idle energy, and GPU effective energy consumption (GPU idle energy is relatively small and can be ignored), and can be calculated as in the following equation.

[수학식 6][Equation 6]

수학식 6에서 P_act ^i,0는 가속기 노드 S_i에서 전처리(즉, 입력 이미지 배칭 및 디코딩)를 수행하기 위한 CPU의 유효 전력이고, P_idl ^i,0는 가속기 노드 S_i에서 전처리 이외의 동작에 대한 유휴 CPU 전력이며, P_k,act ^i,j≠0는 가속기 노드 S_i에서 컴포넌트 DL_k를 처리하기 위한 j번째 가속기의 유효 전력이다.In Equation 6, P _act ^i,0 is the CPU's active power for preprocessing (i.e., input image batching and decoding) at the accelerator node S _i , P _idl ^i,0 is idle CPU power for operations other than preprocessing at the accelerator node S _i , and P _k,act ^i,j≠0 is the active power of the jth accelerator for processing the component DL _k at the accelerator node S _i .

pt₀ ⁱ는 가속기 노드 S_i에서 전처리를 수행하는데 걸리는 CPU의 전처리 시간이고, pt_k ⁱ는 가속기 노드 S_i에서 컴포넌트 DL_k를 처리하는 데 걸리는 총 처리 시간이며, cl_k ^i,j는 가속기 노드 S_i에서 컴포넌트 DL_k를 처리하는 데 걸리는 j번째 가속기의 처리 시간이다.pt ₀ ⁱ is the CPU pre-processing time required to perform pre-processing at _accelerator node S _i , pt _k ⁱ is the total processing time required to process component DL _k at accelerator node S _i , and cl _k ^i,j is the processing time of the j-th accelerator required to process component DL _k at accelerator node S i.

본 개시의 스케줄링 장치(10)의 목표는 총 입력 데이터 D에 대한 딥러닝 모델의 처리에 있어서 서비스 수준 목표를 만족시키면서 최소한의 에너지 비용을 달성하는 것이다. 일부 실시예에 따른 스케줄링 장치(10)는 서비스 수준 목표를 처리 제한 시간(Deadline)으로 정의하며, 본 개시에서는 L^Inf로 표기한다. The goal of the scheduling device 10 of the present disclosure is to achieve minimum energy cost while satisfying a service level target in processing a deep learning model for total input data D. The scheduling apparatus 10 according to some embodiments defines a service level target as a processing deadline (Deadline), and is denoted as L ^Inf in the present disclosure.

가속기 노드 S_i가 처리 제한 시간 L^Inf이내에 처리할 수 있는 최대 데이터량을 갖는 할당 데이터 Di를 다음과 같이 계산할 수 있다.The allocated data Di having the maximum amount of data that the accelerator node S _i can process within the processing time limit L ^Inf can be calculated as follows.

[수학식 7][Equation 7]

수학식 7에서 D_i=D₀ ⁱ이고, 딥러닝 모델의 컴포넌트 DL_k에 대한 입력 D_k ⁱ는 딥러닝 모델의 고정된 구조 상 입력 데이터 당 컴포넌트 DL_k의 단위 데이터량인 σ_k를 사용하여 D_k ⁱ=σ_k*Dⁱ로 표현할 수 있다. In Equation 7, D _i =D ₀ ⁱ , and the input D _k ⁱ for the component DL _k of the deep learning model is D _k ⁱ =σ _k *D ⁱ using σ _k , which is the unit data amount of the component DL _k per input data in the fixed structure of the deep learning model.

일부 다른 실시예에 따른 스케줄링 장치는 위해 가속기 노드 할당 전략 X=[x₁,…,x_i,…,x_M]을 결정하기 위해, 최적화 문제(Optimal Problem) 방식에 기초하여 결정 변수를 X로 설정한 다음과 같은 목적 함수를 사용할 수 있다.A scheduling device according to some other embodiments has an accelerator node allocation strategy for X=[x ₁ , . . . ,x _i ,... ,x _M ], the following objective function with the decision variable set to X based on the Optimal Problem method can be used.

[수학식 8][Equation 8]

다만, 상기 목적 함수의 사용은 NP-hard(Non-deterministic Polynomial-time hard)에 관한 문제점이 존재할 수 있어, 일부 실시예에 따른 스케줄링 장치(10)는 차선의 솔루션을 찾는 휴리스틱 알고리즘을 사용하여 가속기 노드 할당 전략 X=[x₁,…,x_i,…,x_M]의 결정에 사용되는 계산 복잡성을 줄일 수 있다.However, since the use of the objective function may present a problem related to NP-hard (Non-deterministic Polynomial-time hard), the scheduling device 10 according to some embodiments uses a heuristic algorithm to find a suboptimal solution and uses an accelerator node allocation strategy X=[x ₁ ,... ,x _i ,... , x _M ] can reduce the computational complexity used in the determination.

휴리스틱 알고리즘을 사용하는 스케줄링 장치(10)는 모든 가속기 노드에 대한 에너지 비용 효율을 다음과 같이 계산할 수 있다.The scheduling apparatus 10 using a heuristic algorithm may calculate energy cost efficiency for all accelerator nodes as follows.

[수학식 9][Equation 9]

, where , where

수학식 9에서 할당 데이터 Di는, 수학식 7을 통해 전술한 것과 같이, 가속기 노드 S_i가 처리 제한 시간 L^Inf이내에 처리할 수 있는 최대 데이터량으로 이루어질 수 있다. 가속기 노드 S_i가 딥러닝 모델의 복수의 컴포넌트를 처리하는 데 소비하는 소비 에너지 E_i와 할당 데이터 Di의 비율은 가속기 노드 S_i의 에너지 비용 효율에 해당한다.As described above through Equation 7, the allocated data Di in Equation 9 may consist of the maximum amount of data that the accelerator node S _i can process within the processing time limit L ^Inf . The ratio of the energy consumption E _i consumed by the accelerator node S _i to process the plurality of components of the deep learning model and the allocated data Di corresponds to the energy cost efficiency of the accelerator node S _i .

일부 실시예에 따른 스케줄링 장치(10)는 기 설정된 처리 제한 시간 이내에 입력 데이터를 모두 처리할 수 있는 조건 하에, 이종 가속기 클러스터(12)에 포함된 각 가속기 노드 중 에너지 비용 효율이 낮은 가속기 노드부터 적어도 하나의 가속기 노드를 순차적으로 선택할 수 있다. 이때, 스케줄링 장치(10)는 선택된 가속기 노드에만 입력 데이터를 할당하여, 딥 러닝 모델에 대한 다수의 요청을 가속 처리할 수 있다.The scheduling device 10 according to some embodiments may sequentially select at least one accelerator node from an accelerator node having low energy cost efficiency among accelerator nodes included in the heterogeneous accelerator cluster 12 under the condition that all input data can be processed within a predetermined processing time limit. In this case, the scheduling device 10 may allocate input data only to the selected accelerator node to accelerate processing of multiple requests for the deep learning model.

도 5는 일부 실시예에 따른 스케줄링 장치의 구성을 나타내는 블록도이다.5 is a block diagram illustrating a configuration of a scheduling device according to some embodiments.

도 5를 참조하면, 일부 실시예에 따른 스케줄링 장치(10)는 메모리(101) 및 프로세서(103)를 포함할 수 있다. Referring to FIG. 5 , a scheduling device 10 according to some embodiments may include a memory 101 and a processor 103 .

메모리(101)는 스케줄링 장치(10)의 동작을 제어하기 위한 프로그램을 저장할 수 있다. 메모리(101)는 스케줄링 장치(10)의 동작을 제어하기 위한 적어도 하나의 인스트럭션을 포함할 수 있다. 메모리(101)에 저장된 프로그램들은 그 기능에 따라 복수 개의 모듈들로 분류될 수 있다.The memory 101 may store a program for controlling the operation of the scheduling device 10 . The memory 101 may include at least one instruction for controlling the operation of the scheduling device 10 . Programs stored in the memory 101 may be classified into a plurality of modules according to their functions.

일부 실시예에 따른 메모리(101)는 딥러닝 모델, 가속기 노드 할당 전략, 입력 데이터 및 할당 데이터에 관한 정보를 저장할 수 있다.The memory 101 according to some embodiments may store information about a deep learning model, an accelerator node allocation strategy, input data, and allocation data.

메모리(101)는, 예를 들어, 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(RAM, Random Access Memory) SRAM(Static Random Access Memory), 롬(ROM, Read-Only Memory), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있으나, 이에 제한되지 않는다.The memory 101 may include, for example, a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (eg SD or XD memory, etc.), RAM (Random Access Memory), SRAM (Static Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory) Memory), a magnetic memory, a magnetic disk, and an optical disk, but may include at least one type of storage medium, but is not limited thereto.

프로세서(103)는 메모리(101)에 저장된 프로그램들을 실행하여 스케줄링 장치(10)의 전반적인 동작을 제어할 수 있다.The processor 103 may control overall operations of the scheduling device 10 by executing programs stored in the memory 101 .

일부 실시예에 따른 프로세서(103)는 각 가속기 노드에 대하여 딥 러닝 모델에 포함된 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간을 결정할 수 있다. The processor 103 according to some embodiments may determine a minimum processing time required to process a plurality of components included in the deep learning model for each accelerator node.

일부 실시예에 따른 프로세서(103)는 각 가속기 노드에 포함된 복수의 가속기가 각 컴포넌트를 처리하는 데 걸리는 시간이 각 가속기 별로 서로 동일하도록 각 컴포넌트에 대한 가속기 별 처리 시간을 결정할 수 있다. 프로세서(103)는 각 컴포넌트에 대한 가속기 별 처리 시간을 각 컴포넌트에 대한 최소 처리 시간으로 결정하고, 각 컴포넌트에 대한 최소 처리 시간을 모두 합하여 상기 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간으로 결정할 수 있다.According to some embodiments, the processor 103 may determine a processing time for each accelerator for each component so that a plurality of accelerators included in each accelerator node take the same time to process each component. The processor 103 may determine the processing time for each accelerator as the minimum processing time for each component, and sum the minimum processing times for each component to determine the minimum processing time required to process the plurality of components.

일부 실시예에 따른 프로세서(103)는 최소 처리 시간에 기초하여 각 가속기 노드에서 상기 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지를 결정할 수 있다.The processor 103 according to some embodiments may determine consumed energy consumed to process the plurality of components at each accelerator node based on a minimum processing time.

일부 실시예에 따른 프로세서(103)는 각 가속기 노드에 포함된 CPU의 유효전력과 유휴전력, CPU의 전처리 시간 및 복수의 컴포넌트를 처리하는 데 걸리는 최소 처리 시간에 기초하여 각 가속기 노드에 포함된 CPU의 소비 에너지를 결정할 수 있다. 프로세서(103)는 각 가속기 노드에 포함된 각 가속기의 유효전력과 유휴전력, 각 컴포넌트에 대한 가속기 별 처리 시간에 기초하여 각 가속기 노드에 포함된 복수의 가속기의 소비 에너지를 결정할 수 있다. 프로세서(103)는 각 가속기 노드에 포함된 CPU의 소비 에너지와 복수의 가속기의 소비 에너지를 합하여, 각 가속기 노드에서 복수의 컴포넌트를 처리하는 데 소비되는 소비 에너지를 결정할 수 있다.The processor 103 according to some embodiments may determine energy consumption of a CPU included in each accelerator node based on active power and idle power of a CPU included in each accelerator node, a preprocessing time of the CPU, and a minimum processing time required to process a plurality of components. The processor 103 may determine energy consumption of a plurality of accelerators included in each accelerator node based on active power and idle power of each accelerator included in each accelerator node, and processing time for each accelerator for each component. The processor 103 may determine consumed energy consumed in processing a plurality of components in each accelerator node by summing the consumed energy of the CPU included in each accelerator node and the consumed energy of the plurality of accelerators.

일부 실시예에 따른 프로세서(103)는 최소 처리 시간 및 기 설정된 처리 제한 시간에 기초하여 상기 각 가속기 노드에서 처리할 수 있는 최대 할당 데이터를 결정하고, 상기 소비 에너지 및 상기 최대 할당 데이터에 기초하여 상기 각 가속기 노드의 에너지 비용 효율을 결정할 수 있다. The processor 103 according to some embodiments determines the maximum allocated data that can be processed by each accelerator node based on the minimum processing time and the predetermined processing time limit, and determines the energy cost efficiency of each accelerator node based on the consumed energy and the maximum allocated data.

일부 실시예에 따른 프로세서(103)는 각 가속기 노드의 에너지 비용 효율을 서로 비교하여 각 가속기 노드 중 입력 데이터를 할당할 적어도 하나의 가속기 노드를 선택할 수 있다.The processor 103 according to some embodiments may select at least one accelerator node to allocate input data from among accelerator nodes by comparing energy cost efficiency of each accelerator node with each other.

일부 실시예에 따른 프로세서(103)는 기 설정된 처리 제한 시간 이내에 상기 입력 데이터를 모두 처리할 수 있도록 각 가속기 노드 중 에너지 비용 효율이 낮은 가속기 노드부터 순차적으로 선택하여, 각 가속기 노드 중 입력 데이터를 할당할 적어도 하나의 가속기 노드를 선택할 수 있다.The processor 103 according to some embodiments may select at least one accelerator node to which input data is allocated among accelerator nodes by sequentially selecting an accelerator node having a low energy cost efficiency among accelerator nodes so as to process all of the input data within a predetermined processing time limit.

일부 실시예에 따른 프로세서(103)는, 예를 들어, 인공지능 연산을 수행할 수 있다. 프로세서(103)는, 예를 들어, CPU(Central Processing Unit), GPU(Graphics Processing Unit), NPU(Neural Processing Unit), FPGA(Field Programmable Gate Array), ASIC(application specific integrated circuit) 중 어느 하나일 수 있으나, 이에 제한되지 않는다.The processor 103 according to some embodiments may, for example, perform artificial intelligence calculations. The processor 103 may be, for example, any one of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Neural Processing Unit (NPU), a Field Programmable Gate Array (FPGA), and an application specific integrated circuit (ASIC), but is not limited thereto.

일부 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. Some embodiments may be implemented in the form of a recording medium including instructions executable by a computer, such as program modules executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media.

또한, 컴퓨터 판독 가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.Also, computer readable media may include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 개시의 설명은 예시를 위한 것이며, 본 개시가 속하는 기술 분야의 통상의 지식을 가진 자는 본 개시의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The description of the present disclosure described above is for illustrative purposes, and those skilled in the art to which the present disclosure pertains can easily be modified into other specific forms without changing the technical spirit or essential features of the present disclosure. It will be appreciated. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 개시의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 개시의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present disclosure is indicated by the claims to be described later rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts thereof should be construed as being included in the scope of the present disclosure.

10: 스케줄링 장치
12: 이종 가속기 클러스터10: scheduling device
12: heterogeneous accelerator cluster

Claims

A scheduling method for distributing input data to a plurality of accelerator nodes each including a deep learning model composed of a plurality of components by a scheduling device,
determining a minimum processing time required to process the plurality of components at each accelerator node;
determining consumed energy consumed in processing the plurality of components at each accelerator node based on the minimum processing time;
determining maximum allocated data that can be processed by each accelerator node based on the minimum processing time and a predetermined processing time limit;
Determining energy cost efficiency of each accelerator node based on the consumed energy and the maximum allocation data; and
and comparing the energy cost efficiency of each accelerator node with each other and selecting at least one accelerator node from among the accelerator nodes to which the input data is allocated.

According to claim 1,
The operation of determining the minimum processing time required to process the plurality of components,
determining a processing time for each accelerator for each component such that a time required for each of the plurality of accelerators included in each accelerator node to process each component is equal to each other; and
Determining the processing time for each accelerator for each component as the minimum processing time for each component: and
and determining the minimum processing time by summing up the minimum processing times of the respective components.

According to claim 2,
The processing time for each accelerator for each component,
A scheduling method comprising at least one of data processing time, input data transmission time, and output data transmission time.

According to claim 2,
The operation of determining the consumed energy consumed in processing the plurality of components,
determining energy consumption of the CPU based on active power and idle power of the CPU included in each accelerator node, a preprocessing time of the CPU, and the minimum processing time; and
and determining consumed energy of the plurality of accelerators based on active power and idle power for each accelerator of each accelerator node and processing time for each accelerator.

According to claim 1,
The operation of selecting at least one accelerator node to which the input data is to be allocated,
The scheduling method of sequentially selecting an accelerator node having a low energy cost efficiency among the accelerator nodes so as to process all of the input data within the predetermined processing time limit.

A scheduling device for distributing input data to a plurality of accelerator nodes each including a deep learning model composed of a plurality of components,
Memory; and
at least one processor;
the at least one processor
Determining a minimum processing time required to process the plurality of components at each accelerator node, determining energy consumption consumed for processing the plurality of components at each accelerator node based on the minimum processing time, determining maximum allocation data that can be processed at each accelerator node based on the minimum processing time and a predetermined processing time limit, determining energy cost efficiency of each accelerator node based on the consumed energy and the maximum allocation data, and comparing the energy cost efficiency of each accelerator node with each other to determine the energy cost efficiency of each accelerator node. Scheduling apparatus for selecting at least one accelerator node to allocate the input data among the nodes.

According to claim 6,
The at least one processor,
A scheduling device that determines a processing time per accelerator for each component so that the time required for each of the plurality of accelerators included in each accelerator node to process each component is the same, determines the processing time per accelerator for each component as the minimum processing time for each component, and sums the minimum processing times of each component to determine the minimum processing time.

According to claim 7,
The processing time for each accelerator for each component,
A scheduling device including at least one of data processing time, input data transmission time, and output data transmission time.

According to claim 7,
The at least one processor,
The scheduling device determines energy consumption of the CPU based on active power and idle power of the CPU included in each accelerator node, preprocessing time of the CPU, and the minimum processing time, determines energy consumption of the plurality of accelerators based on active power and idle power of each accelerator of each accelerator node, and processing time of each accelerator, and determines energy consumption consumed to process the plurality of components.

According to claim 6,
The at least one processor,
Scheduling device for selecting at least one accelerator node to allocate the input data among the accelerator nodes by sequentially selecting an accelerator node having a low energy cost efficiency among the accelerator nodes so as to process all the input data within the predetermined processing time limit.

A computer-readable recording medium storing a computer program,
The computer program,
determining a minimum processing time required to process the plurality of components at each of a plurality of accelerator nodes each of which includes a deep learning model composed of the plurality of components;
determining consumed energy consumed in processing the plurality of components at each accelerator node based on the minimum processing time;
determining maximum allocated data that can be processed by each accelerator node based on the minimum processing time and a predetermined processing time limit;
Determining energy cost efficiency of each accelerator node based on the consumed energy and the maximum allocation data; and
Instructions for causing a processor to perform a scheduling method comprising an operation of comparing the energy cost efficiency of each accelerator node with each other and selecting at least one accelerator node to allocate input data from among the accelerator nodes A computer readable recording medium storing a computer program.

As a computer program stored on a computer-readable recording medium,
The computer program,
determining a minimum processing time required to process the plurality of components at each of a plurality of accelerator nodes each of which includes a deep learning model composed of the plurality of components;
determining consumed energy consumed in processing the plurality of components at each accelerator node based on the minimum processing time;
determining maximum allocated data that can be processed by each accelerator node based on the minimum processing time and a predetermined processing time limit;
Determining energy cost efficiency of each accelerator node based on the consumed energy and the maximum allocation data; and
A computer program comprising instructions for causing a processor to perform a scheduling method comprising an operation of comparing the energy cost efficiency of each accelerator node with each other and selecting at least one accelerator node to allocate input data from among the accelerator nodes.