KR20220064338A

KR20220064338A - Method and apparatus for lightweight and parallelization of accelerator task scheduling

Info

Publication number: KR20220064338A
Application number: KR1020210154797A
Authority: KR
Inventors: 전병곤; 유경인; 권우석
Original assignee: 서울대학교산학협력단
Priority date: 2020-11-11
Filing date: 2021-11-11
Publication date: 2022-05-18

Abstract

Disclosed are a method for lightening and parallelizing accelerator operation scheduling and an electronic device comprising the same. According to an embodiment, the method for lightening and parallelizing accelerator operation scheduling comprises the steps of: pre -running a deep learning model with sample input data having a predetermined data format; and generating a scheduling result through preliminary execution.

Description

Method and apparatus for weight reduction and parallelization of accelerator computation scheduling

아래 실시예들은 가속기 연산 스케줄링의 경량화 및 병렬화 방법 및 장치에 관한 것이다.The following embodiments relate to a method and apparatus for reducing weight and parallelizing accelerator operation scheduling.

인공 지능(Artificial Intelligence; AI) 기술이 발전함에 따라 인공 지능만을 위한 독자적인 하드웨어의 필요성이 증가하고 있다. 인공 지능은 예를 들어, 특정한 연산을 통해 추론과 학습을 수행할 수 있다. 이와 같이 인공 지능을 구현하고 실행하기 위한 전용 하드웨어로서 다양한 장치들이 개발되고 있다.As artificial intelligence (AI) technology develops, the need for proprietary hardware only for artificial intelligence is increasing. Artificial intelligence can, for example, perform reasoning and learning through specific operations. As such, various devices are being developed as dedicated hardware for implementing and executing artificial intelligence.

인공 지능을 위한 전용 하드웨어는 예를 들어, CPU(Central Processing Unit), GPU(Graphics Processing Unit) 등에 의해 구현될 수도 있고, 용도 변경이 가능한 FPGA(Field Programmable Gate Array), 및 ASIC(Application Specific Integrated Circuit) 등에 의해 구현될 수도 있다.Dedicated hardware for artificial intelligence may be implemented by, for example, a central processing unit (CPU), a graphics processing unit (GPU), or the like, and a field programmable gate array (FPGA) that can be repurposed, and an application specific integrated circuit (ASIC). ) may be implemented by

일 실시예에 따른 가속기 연산 스케줄링의 경량화 및 병렬화 방법은 연산-스트림 할당 알고리즘(operator-to-stream mapping algorithm)에 기초하여, 딥러닝 모델을 변환하는 단계; 미리 정해진 데이터 형태를 갖는 샘플 입력 데이터로 상기 변환된 딥러닝 모델을 예비 수행(pre-run)하는 단계; 및 상기 예비 수행을 통해, 스케줄링 결과를 생성하는 단계를 포함한다.A method for reducing and parallelizing accelerator computation scheduling according to an embodiment includes: transforming a deep learning model based on an operator-to-stream mapping algorithm; pre-running the converted deep learning model with sample input data having a predetermined data type; and generating a scheduling result through the preliminary execution.

일 실시예에 따른 가속기 연산 스케줄링의 경량화 및 병렬화 방법은 입력 데이터를 수신하는 단계; 및 상기 입력 데이터에 대한 별도의 스케줄링 없이, 상기 스케줄링 결과에 기초하여 상기 입력 데이터에 대한 딥러닝 연산을 수행하는 단계를 더 포함할 수 있다.A method for reducing and parallelizing accelerator operation scheduling according to an embodiment includes: receiving input data; and performing a deep learning operation on the input data based on the scheduling result without separate scheduling on the input data.

상기 예비 수행하는 단계는 상기 예비 수행 도중에 발생한 가속기 연산 수행 요청을 기록하는 단계; 및 상기 예비 수행 도중에 발생한 가속기 메모리 할당 또는 해제 요청을 기록하는 단계를 포함할 수 있다.The preliminary execution may include: recording an accelerator operation execution request generated during the preliminary execution; and recording an accelerator memory allocation or release request generated during the preliminary execution.

상기 스케줄링하는 단계는 상기 기록된 가속기 연산 수행 요청에 기초하여, 가속기 연산 수행 요청 기록을 생성하는 단계; 및 상기 가속기 메모리 할당 또는 해체 요청에 기초하여, 상기 가속기에 메모리를 할당하는 단계를 포함할 수 있다.The scheduling may include: generating an accelerator operation execution request record based on the recorded accelerator operation execution request; and allocating memory to the accelerator based on the accelerator memory allocation or release request.

상기 딥러닝 모델은 상기 딥러닝 모델의 연산자(operator)를 의미하는 노드와 상기 연산자 사이의 관계를 의미하는 엣지로 구성된 그래프로 표현될 수 있다.The deep learning model may be expressed as a graph composed of a node indicating an operator of the deep learning model and an edge indicating a relationship between the operator.

상기 딥러닝 모델을 변환하는 단계는 상기 딥러닝 모델을 최소 등가 그래프(minimum equivalent graph)로 변환하는 단계; 상기 최소 등가 그래프에 대한 이분법 그래프(bipartite graph)를 생성하는 단계; 상기 이분법 그래프의 최대 매칭(mximum matching)을 결정하는 단계; 및 상기 최대 매칭에 기초하여, 상기 노드를 상기 가속기의 스트림에 할당하는 단계를 포함할 수 있다.The step of transforming the deep learning model may include transforming the deep learning model into a minimum equivalent graph; generating a bipartite graph for the minimum equivalence graph; determining a maximum matching of the dichotomous graph; and assigning the node to the stream of the accelerator based on the maximum match.

상기 딥러닝 모델은 정적 뉴럴 네트워크(static neural network)를 포함할 수 있다.The deep learning model may include a static neural network.

일 실시예에 따른 가속기 연산 스케줄링의 경량화 및 병렬화 장치는 연산-스트림 할당 알고리즘(operator-to-stream mapping algorithm)에 기초하여, 딥러닝 모델을 변환하고, 미리 정해진 데이터 형태를 갖는 샘플 입력 데이터로 상기 변환된 딥러닝 모델을 예비 수행(pre-run)하고, 상기 예비 수행을 통해, 스케줄링 결과를 생성하는 프로세서를 포함한다.An apparatus for lightweighting and parallelizing accelerator computation scheduling according to an embodiment converts a deep learning model based on an operator-to-stream mapping algorithm, and converts the deep learning model into sample input data having a predetermined data format. and a processor for pre-running the converted deep learning model and generating a scheduling result through the pre-run.

상기 프로세서는 상기 예비 수행 도중에 발생한 가속기 연산 수행 요청을 기록하고, 상기 예비 수행 도중에 발생한 가속기 메모리 할당 또는 해제 요청을 기록할 수 있다.The processor may record an accelerator operation execution request generated during the preliminary execution, and record an accelerator memory allocation or release request generated during the preliminary execution.

상기 프로세서는 상기 기록된 가속기 연산 수행 요청에 기초하여, 가속기 연산 수행 요청 기록을 생성하고, 상기 가속기 메모리 할당 또는 해체 요청에 기초하여, 상기 가속기에 메모리를 할당할 수 있다.The processor may generate an accelerator operation execution request record based on the recorded accelerator operation execution request, and allocate memory to the accelerator based on the accelerator memory allocation or release request.

상기 프로세서는 상기 딥러닝 모델을 최소 등가 그래프(minimum equivalent graph)로 변환하고, 상기 최소 등가 그래프에 대한 이분법 그래프(bipartite graph)를 생성하고, 상기 이분법 그래프의 최대 매칭(mximum matching)을 결정하고, 상기 최대 매칭에 기초하여, 상기 노드를 상기 가속기의 스트림에 할당할 수 있다.The processor converts the deep learning model into a minimum equivalent graph, generates a bipartite graph for the minimum equivalent graph, and determines the maximum matching of the bipartite graph, Based on the maximum match, the node may be assigned to the accelerator's stream.

일 실시예에 따른 전자 장치는 연산-스트림 할당 알고리즘(operator-to-stream mapping algorithm)에 기초하여, 딥러닝 모델을 변환하고, 미리 정해진 데이터 형태를 갖는 샘플 입력 데이터로 상기 변환된 딥러닝 모델을 예비 수행(pre-run)하고, 상기 예비 수행을 통해, 스케줄링 결과를 생성하는 호스트 프로세서; 및 상기 스케줄러에서 결정된 스케줄에 따라 상기 딥러닝 모델을 실행하는 가속기를 포함한다.The electronic device according to an embodiment converts the deep learning model based on an operator-to-stream mapping algorithm, and converts the converted deep learning model into sample input data having a predetermined data format. a host processor that pre-runs and generates a scheduling result through the pre-run; and an accelerator for executing the deep learning model according to the schedule determined by the scheduler.

상기 호스트 프로세서는 입력 데이터를 수신하고, 상기 가속기는 상기 입력 데이터에 대한 별도의 스케줄링 없이, 상기 스케줄링 결과에 기초하여 상기 입력 데이터에 대한 딥러닝 연산을 수행할 수 있다.The host processor may receive input data, and the accelerator may perform a deep learning operation on the input data based on the scheduling result without separate scheduling for the input data.

도 1은 일실시예에 따른 전자 장치를 설명하기 위한 도면이다.
도 2는 일 실시예에 따른 호스트 프로세서의 블록도를 도시한 도면이다.
도 3은 일 실시예에 따른 딥러닝 모델 변환기 및 사전 스케줄러의 동작 방법을 설명하기 위한 도면이다.
도 4는 일 실시예에 따른 가속기 연산 스케줄링의 경량화 및 병렬화 방법의 예시를 도시한 도면이다.
도 5a 내지 도 5b는 일 실시예에 따른 연산-스트림 할당 알고리즘을 설명하기 위한 도면이다.
도 6 및 도 7는 일실시예에 따른 전자 장치의 예시들을 나타낸 도면이다.1 is a diagram for describing an electronic device according to an embodiment.
2 is a diagram illustrating a block diagram of a host processor according to an embodiment.
3 is a diagram for explaining a method of operating a deep learning model converter and a pre-scheduler according to an embodiment.
4 is a diagram illustrating an example of a method for reducing weight and parallelizing accelerator operation scheduling according to an embodiment.
5A to 5B are diagrams for explaining an operation-stream allocation algorithm according to an embodiment.
6 and 7 are diagrams illustrating examples of an electronic device according to an embodiment.

본 명세서에서 개시되어 있는 특정한 구조적 또는 기능적 설명들은 단지 기술적 개념에 따른 실시예들을 설명하기 위한 목적으로 예시된 것으로서, 실제로 구현된 형태는 다양한 다른 모습을 가질 수 있으며 본 명세서에 설명된 실시예로만 한정되지 않는다. Specific structural or functional descriptions disclosed in this specification are only exemplified for the purpose of describing embodiments according to technical concepts, and the actually implemented forms may have various other appearances and are limited only to the embodiments described herein doesn't happen

제1 또는 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 이해되어야 한다. 예를 들어 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various elements, but these terms should be understood only for the purpose of distinguishing one element from another element. For example, a first component may be termed a second component, and similarly, a second component may also be termed a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 표현들, 예를 들어 "~간의"와 "바로~간의" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When an element is referred to as being “connected” or “connected” to another element, it is understood that it may be directly connected or connected to the other element, but other elements may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle. Expressions describing the relationship between elements, for example, “between” and “between” or “neighboring to” and “directly adjacent to”, etc. should be interpreted similarly.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 실시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as "comprise" or "have" are intended to designate the existence of an embodied feature, number, step, operation, component, part, or combination thereof, but one or more other features or numbers , it is to be understood that it does not preclude the possibility of the existence or addition of steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

실시예들은 퍼스널 컴퓨터, 랩톱 컴퓨터, 태블릿 컴퓨터, 스마트 폰, 텔레비전, 스마트 가전 기기, 지능형 자동차, 키오스크, 웨어러블 장치 등 다양한 형태의 제품으로 구현될 수 있다. 이하, 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.The embodiments may be implemented in various types of products, such as personal computers, laptop computers, tablet computers, smart phones, televisions, smart home appliances, intelligent cars, kiosks, wearable devices, and the like. Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Like reference numerals in each figure indicate like elements.

도 1은 일실시예에 따른 전자 장치를 설명하기 위한 도면이다.1 is a diagram for describing an electronic device according to an embodiment.

도 1을 참조하면, 일실시예에 따른 전자 장치(100)는 호스트 프로세서(110), 오프-칩 메모리(off-chip memory)(120), 메모리 컨트롤러(130) 및 가속기(140)를 포함할 수 있다. 호스트 프로세서(110), 오프-칩 메모리(120), 메모리 컨트롤러(130) 및 가속기(140)는 버스(bus), NoC(Network on a Chip), PCIe(Peripheral Component Interconnect Express) 등을 통하여 서로 통신할 수 있다.Referring to FIG. 1 , an electronic device 100 according to an embodiment may include a host processor 110 , an off-chip memory 120 , a memory controller 130 , and an accelerator 140 . can The host processor 110 , the off-chip memory 120 , the memory controller 130 , and the accelerator 140 communicate with each other through a bus, Network on a Chip (NoC), Peripheral Component Interconnect Express (PCIe), etc. can do.

호스트 프로세서(110)는 전자 장치(100)에 포함된 컴포넌트들의 동작을 제어하는 장치로, 예를 들어, 중앙 처리 장치(CPU; Central Processing Unit)를 포함할 수 있다. 호스트 프로세서(110)는 뉴럴 네트워크를 가속기(140)에서 처리하기 위한 하나 이상의 요청을 수신하고, 해당 요청에 응답하여 가속기(140)에서 실행 가능한 명령어를 생성한다. 요청은 뉴럴 네트워크에 기반한 데이터 추론을 위한 것으로, 예를 들어, 객체 인식, 패턴 인식, 컴퓨터 비전, 음성 인식, 기계 번역, 기계 통역 등을 위해 가속기(140)로 하여금 뉴럴 네트워크를 실행하게 하여 데이터 추론 결과를 얻기 위한 것일 수 있다. 호스트 프로세서(110)는 추론 대상 데이터와 뉴럴 네트워크의 파라미터들을 가속기(140)로 전달할 수 있다. 나아가, 요청은 뉴럴 네트워크 학습을 위한 것을 더 포함할 수 있고, 이 경우 호스트 프로세서(110)는 학습 대상 데이터와 뉴럴 네트워크의 파라미터들을 가속기(140)로 전달할 수 있다.The host processor 110 is a device that controls operations of components included in the electronic device 100 , and may include, for example, a central processing unit (CPU). The host processor 110 receives one or more requests for processing the neural network in the accelerator 140 , and generates instructions executable in the accelerator 140 in response to the request. The request is for data inference based on a neural network, for example, for object recognition, pattern recognition, computer vision, speech recognition, machine translation, machine interpretation, etc. It could be to get results. The host processor 110 may transmit inference target data and parameters of the neural network to the accelerator 140 . Furthermore, the request may further include a one for learning the neural network. In this case, the host processor 110 may transmit the learning target data and parameters of the neural network to the accelerator 140 .

오프-칩 메모리(120)는 가속기(140)의 외부에 배치된 메모리로서, 예를 들어, 전자 장치(100)의 메인 메모리로 활용되는 DRAM(Dynamic Random Access Memory)일 수 있다. 오프-칩 메모리(120)는 추론 대상 데이터 및/또는 가속기(140)에서 실행할 뉴럴 네트워크의 파라미터들을 저장할 수 있으며, 저장된 데이터는 이후 추론 수행을 위해 가속기(140)로 전달될 수 있다. 또한, 오프-칩 메모리(120)는 가속기(140)에서 뉴럴 네트워크를 실행하는 데 가속기(140) 내부의 온-칩 메모리(on-chip memory)가 충분하지 않은 경우에 활용될 수도 있다.The off-chip memory 120 is a memory disposed outside the accelerator 140 , and may be, for example, a dynamic random access memory (DRAM) used as the main memory of the electronic device 100 . The off-chip memory 120 may store data to be inferred and/or parameters of a neural network to be executed by the accelerator 140 , and the stored data may then be transferred to the accelerator 140 to perform inference. In addition, the off-chip memory 120 may be utilized when the on-chip memory inside the accelerator 140 is not sufficient to execute the neural network in the accelerator 140 .

오프-칩 메모리(120)는 가속기(140) 내부의 온-칩 메모리보다 큰 메모리 용량을 가지고 있으나, 뉴럴 네트워크 실행 시 가속기(140)가 오프-칩 메모리(120)로 액세스하는 비용이 내부의 온-칩 메모리로 액세스하는 비용보다 크다. 메모리 액세스 비용은 해당 메모리에 액세스하여 데이터를 읽거나 쓸 때 요구되는 전력 및/또는 시간을 나타낼 수 있다.The off-chip memory 120 has a larger memory capacity than the on-chip memory inside the accelerator 140 , but when the neural network is executed, the cost of the accelerator 140 accessing the off-chip memory 120 is increased. - greater than the cost of accessing the chip memory. A memory access cost may represent the power and/or time required to access that memory and read or write data.

가속기(140)는 호스트 프로세서(110)의 명령어에 따른 뉴럴 네트워크를 실행하여 입력되는 데이터를 추론하는 AI 가속기(Artificial Intelligence accelerator)로서, 호스트 프로세서(110)와 구별되는 별도의 프로세서일 수 있다. 예를 들어, 가속기(140)는 GPU, NPU(Neural Processing Unit), TPU(Tensor Processing Unit), DSP(Digital Signal Processor) 등일 수 있다.The accelerator 140 is an artificial intelligence accelerator that infers input data by executing a neural network according to instructions of the host processor 110 , and may be a separate processor from the host processor 110 . For example, the accelerator 140 may be a GPU, a Neural Processing Unit (NPU), a Tensor Processing Unit (TPU), or a Digital Signal Processor (DSP).

가속기(140)는 뉴럴 네트워크에 따른 연산들의 특성 상 범용의 호스트 프로세서(110)에서 처리되기 보다는 별도의 전용 프로세서(다시 말해, 가속기(140))에서 처리되는 것이 보다 효율적인 작업들을 처리할 수 있다. 이때 가속기(140)에 포함된 하나 이상의 프로세싱 엘리먼트들(PEs; Processing Elements) 및 온-칩 메모리가 활용될 수 있다. 온-칩 메모리는 가속기(140) 내부에 포함된 글로벌 쉐어드 버퍼(global shared buffer) 및/또는 로컬 버퍼(local buffer)를 포함하는 장치로서, 가속기(140) 외부에 위치하는 오프-칩 메모리(120)와 구분될 수 있다. 예를 들어, 온-칩 메모리는 주소 공간(address space)을 통해 액세스 가능한 스크래치패드 메모리(scratchpad memory), SRAM(Static Random Access Memory) 등을 포함할 수 있다.The accelerator 140 may process tasks that are more efficient to be processed by a separate dedicated processor (that is, the accelerator 140) rather than being processed by the general-purpose host processor 110 due to the characteristics of operations according to the neural network. In this case, one or more processing elements (PEs) and on-chip memory included in the accelerator 140 may be utilized. The on-chip memory is a device including a global shared buffer and/or a local buffer included in the accelerator 140, and an off-chip memory located outside the accelerator 140 ( 120) can be distinguished. For example, the on-chip memory may include a scratchpad memory accessible through an address space, a static random access memory (SRAM), or the like.

뉴럴 네트워크는 복수의 레이어들을 포함한다. 일실시예에서, 뉴럴 네트워크는 입력 레이어, 복수의 히든 레이어들 및 출력 레이어를 포함한다. 각각의 레이어들은 인공 뉴런이라고도 불리는 복수의 노드들을 포함한다. 각 노드는 하나 이상의 입력 및 출력을 가지는 계산 단위를 나타내고, 노드들은 상호 연결될 수 있다. 노드들 간의 연결에는 가중치가 설정될 수 있으며, 이러한 가중치는 조정 또는 변경될 수 있다. 가중치는 연관된 데이터 값을 증폭, 감소 또는 유지시킴으로써 해당 데이터 값이 최종 결과에 미치는 영향도를 결정할 수 있다. 출력 레이어에 포함된 각각의 노드에는 이전 레이어에 포함된 노드들의 가중된 입력들이 입력될 수 있다. 가중된 데이터가 임의의 레이어로부터 다음 레이어로 입력되는 과정을 전파(propagation)라고 지칭할 수 있다.A neural network includes a plurality of layers. In one embodiment, the neural network includes an input layer, a plurality of hidden layers and an output layer. Each layer includes a plurality of nodes, also called artificial neurons. Each node represents a computational unit having one or more inputs and outputs, and the nodes may be interconnected. Weights may be set for connections between nodes, and these weights may be adjusted or changed. A weight may determine the degree of influence of a data value on the final result by amplifying, decreasing, or maintaining an associated data value. Weighted inputs of nodes included in the previous layer may be input to each node included in the output layer. A process in which weighted data is input from an arbitrary layer to a next layer may be referred to as propagation.

일 실시예에 따른 가속기(140)에서 딥러닝 학습 및 추론을 수행하려면 가속기(140)에 연산 수행을 요청하기 이전에 가속기(140) 연산 스케줄링 과정이 선행되어야 한다. 가속기(140) 연산 스케줄링은 가속기(140)에 연산 수행을 요청하기 위해 필요한 일련의 작업들로, 입력 데이터의 형태에 따른 가속기 연산 종류 선택, 선택된 가속기 연산 종류와 입력 데이터 형태에 따라 결졍되는 출력 데이터 및 연산 스크래치패드를 위한 가속기 메모리 할당, 가속기 연산 함수 인자 준비 등을 포함할 수 있다.In order to perform deep learning learning and inference in the accelerator 140 according to an embodiment, the accelerator 140 operation scheduling process must be preceded prior to requesting the accelerator 140 to perform the operation. The accelerator 140 operation scheduling is a series of tasks necessary to request the accelerator 140 to perform an operation. The accelerator operation type is selected according to the type of input data, and the output data is determined according to the selected accelerator operation type and input data type. and allocating accelerator memory for the computation scratchpad, preparing accelerator computation function arguments, and the like.

기존 딥러닝 수행 시스템은 딥러닝 수행할 때마다 모든 가속기 연산에 대하여 상기 스케줄링 과정을 반복해야했고, 이에, 가속기 연산 스케줄링에 드는 비용이 전체 딥러닝 수행 시간의 많은 부분을 차지하였다. 또한 기존 딥러닝 수행 시스템은 가속기 연산을 한 번에 하나만 수행하도록 스케줄링 하는데, 이 때문에 가속기에 유휴 자원이 있어도 이를 다 활용하지 못하는 단점이 있다.The existing deep learning system had to repeat the above scheduling process for all accelerator operations whenever deep learning was performed, and thus, the cost of scheduling the accelerator operation took up a large portion of the total deep learning execution time. In addition, the existing deep learning system schedules only one accelerator operation at a time.

아래에서 상세히 설명하겠지만, 일 실시예에 따른 가속기 연산 스케줄링의 경량화 및 병렬화 방법은 가속기 스케줄링에 드는 비용을 최소화할 수 있다. 보다 구체적으로, 일 실시예에 따른 전자 장치는 딥러닝 모델의 연산들에 대한 스케줄링을 사전에 한 번만 거치고, 이후 반복에 대해선 스케줄링 과정을 생략함으로서 비용을 최소화하고, 가속기 연산을 한 번에 여럿 수행할수 있도록 스케줄링 방법을 개선함으로써 딥러닝 수행 시간을 줄일 수 있다.As will be described in detail below, the method for reducing and parallelizing accelerator operation scheduling according to an embodiment may minimize the cost of accelerator scheduling. More specifically, the electronic device according to an embodiment performs scheduling of operations of the deep learning model only once in advance, and for subsequent iterations, the scheduling process is omitted to minimize cost and perform several accelerator operations at once. It is possible to reduce the deep learning execution time by improving the scheduling method.

도 2는 일 실시예에 따른 호스트 프로세서의 블록도를 도시한 도면이다.2 is a diagram illustrating a block diagram of a host processor according to an embodiment.

도 2를 참조하면, 일 실시예에 따른 호스트 프로세서(200)는 딥러닝 모델 변환기(210) 및 사전 스케줄러(220)를 포함할 수 있다. 다만, 도 2의 실시 예에서 호스트 프로세서(200)의 각각의 구성요소들은 기능 및 논리적으로 분리될 수 있음을 나타나기 위해 별도로 도면에 표시한 것이며, 물리적으로 반드시 별도의 구성요소이거나 별도의 코드로 구현되는 것을 의미하는 것은 아니다.Referring to FIG. 2 , the host processor 200 according to an embodiment may include a deep learning model converter 210 and a pre-scheduler 220 . However, in the embodiment of FIG. 2 , each component of the host processor 200 is separately indicated in the drawing to indicate that it can be functionally and logically separated, and must be physically separate components or implemented as separate codes. It doesn't mean to be

일 실시예에 따른 호스트 프로세서(200)는 딥러닝 모델을 수신할 수 있다. 일 실시예에 따른 딥러닝 모델은 학습 및 추론 과정을 반복함에 따라 변하지 않는 동일한 가속기 연산들로 이루어진 정적 뉴럴 네트워크(static neural network)를 포함할 수 있다. 보다 구체적으로, 일 실시예에 따른 딥러닝 모델은 정적 뉴럴 네트워크로 구성된 부분과 그렇지 않은 부분을 모두 포함할 수 있다. 일 실시예에 따른 호스트 프로세서(200)는 수신한 딥러닝 모델 중에서 정적 뉴럴 네트워크에 해당하는 부분에만 일 실시예에 따른 가속기 연산 스케줄링의 경량화 및 병렬화 방법을 적용하여 스케줄링 결과를 생성할 수 있다. 아래에서 상세히 설명하겠지만, 런타임 단계에서 정적 뉴럴 네트워크가 아닌 부분은 통상적인 방법에 따라 딥러닝 연산을 수행하지만, 스케줄링 결과과 생성되어 있는 정적 뉴럴 네트워크 부분은 스케줄링 결과에 기초하여 추가 스케줄링 과정 없이 딥러닝 연산을 수행할 수 있다.The host processor 200 according to an embodiment may receive the deep learning model. The deep learning model according to an embodiment may include a static neural network composed of the same accelerator operations that do not change as the learning and inference processes are repeated. More specifically, the deep learning model according to an embodiment may include both a part configured with a static neural network and a part not configured with a static neural network. The host processor 200 according to an embodiment may generate a scheduling result by applying the method of reducing and parallelizing accelerator operation scheduling according to the embodiment only to a portion corresponding to a static neural network among the received deep learning models. As will be described in detail below, in the runtime stage, the part that is not a static neural network performs deep learning operations according to a conventional method, but the scheduling result and the generated static neural network part perform deep learning operations without additional scheduling based on the scheduling result. can be performed.

일 실시예에 따른 딥러닝 모델은 입력(input)의 성질(예를 들어, 데이터 형태)에 따라 분기되어 각기 다른 정적 뉴럴 네트워크를 포함할 수 있다. 예를 들어, 입력의 성질은 데이터 형태에 따라, 3*200*200 크기의 이미지 또는 3*250*250 크기의 이미지일 수 있다. 또는, 입력의 성질은 데이터 형태에 따라, 제1 배치 크기(입력 데이터 형태 = 1*3*200*200) 이거나 제4 배치 크기(입력 데이터 형태= 4*3*200*200)일 수 있다.The deep learning model according to an embodiment may include different static neural networks that are branched according to the nature of the input (eg, data type). For example, the property of the input may be an image of a size of 3*200*200 or an image of a size of 3*250*250 according to a data type. Alternatively, the property of the input may be a first batch size (input data type = 1*3*200*200) or a fourth batch size (input data type = 4*3*200*200) depending on the data type.

일 실시예에 따른 호스트 프로세서(200)는 각각의 경우에 해당하는 정적 뉴럴 네트워크에 일 실시예에 따른 가속기 연산 스케줄링의 경량화 및 병렬화 방법을 적용하여 스케줄링 결과를 생성할 수 있다. 나아가, 런 타임 단계에서 입력의 성질에 따라 특정 정적 뉴럴 네트워크가 선택되면, 해당 정적 뉴럴 네트워크에 해당하는 스케줄링 결과에 기초하여, 추가 스케줄링 과정 없이 딥러닝 연산을 수행할 수 있다.The host processor 200 according to an embodiment may generate a scheduling result by applying the method of reducing and parallelizing accelerator operation scheduling according to the embodiment to a static neural network corresponding to each case. Furthermore, when a specific static neural network is selected according to the nature of the input in the run-time step, a deep learning operation can be performed without an additional scheduling process based on the scheduling result corresponding to the static neural network.

이 경우, 모든 경우의 수에 해당하는 정적 뉴럴 네트워크의 스케줄링 결과를 전부 다 생성해 둘 필요는 없고, 자주 쓰이는 몇몇 정적 뉴럴 네트워크에만 일 실시예에 따른 가속기 연산 스케줄링의 경량화 및 병렬화 방법을 적용하고(즉, 미리 스케줄링 결과를 생성해 이를 재사용하고), 나머지 정적 뉴럴 네트워크에 대해서는 통상적인 방법을 그대로 이용하여 딥러닝 연산을 수행할 수도 있다.In this case, it is not necessary to generate all the scheduling results of the static neural network corresponding to the number of all cases, and the method of reducing and parallelizing accelerator operation scheduling according to an embodiment is applied only to some frequently used static neural networks ( That is, the scheduling result is generated in advance and reused), and for the remaining static neural networks, deep learning operations can be performed using the conventional method as it is.

아래에서, 설명의 편의를 위해 일 실시예에 따른 가속기 연산은 GPU 연산을 예시로 설명될 수 있다. 그러나, 일 실시예에 따른 가속기 연산이 GPU 연산으로 한정되는 것은 아니다.Below, for convenience of description, the accelerator operation according to an embodiment may be described with a GPU operation as an example. However, the accelerator operation according to the embodiment is not limited to the GPU operation.

일 실시예에 따른 호스트 프로세서(200)는 딥러닝 모델 변환기(210)를 이용하여 사용자가 정의한 딥러닝 모델을 변환하고, 사전 스케줄러(220)를 이용하여 스케줄링을 한 번 수행하고, 스케줄링 결과를 가속기에 전달할 수 있다. 일 실시예에 따른 가속기는 사전 스케줄러(220)가 제공하는 결과물을 바탕으로 딥러닝 학습 및 추론을 반복 수행할 수 있다.The host processor 200 according to an embodiment converts a user-defined deep learning model using the deep learning model converter 210 , performs scheduling once using the pre-scheduler 220 , and uses the scheduling result as an accelerator. can be forwarded to The accelerator according to an embodiment may repeatedly perform deep learning learning and inference based on a result provided by the pre-scheduler 220 .

아래에서, 도 3을 참조하여 딥러닝 모델 변환기(210) 및 사전 스케줄러(220)의 동작 방법을 상세히 설명한다.Below, an operation method of the deep learning model converter 210 and the pre-scheduler 220 will be described in detail with reference to FIG. 3 .

도 3은 일 실시예에 따른 딥러닝 모델 변환기 및 사전 스케줄러의 동작 방법을 설명하기 위한 도면이다.3 is a diagram for explaining a method of operating a deep learning model converter and a pre-scheduler according to an embodiment.

도 3을 참조하면, 단계들(310 내지 330)은 도 1 내지 도 2를 참조하여 전술한 호스트 프로세서에 의해 수행될 수 있다. 도 3의 동작은 도시된 순서 및 방식으로 수행될 수 있지만, 도시된 실시예의 사상 및 범위를 벗어나지 않으면서 일부 동작의 순서가 변경되거나 일부 동작이 생략될 수 있다. 도 3에 도시된 다수의 동작은 병렬로 또는 동시에 수행될 수 있다.Referring to FIG. 3 , steps 310 to 330 may be performed by the host processor described above with reference to FIGS. 1 to 2 . The operations of FIG. 3 may be performed in the illustrated order and manner, but the order of some operations may be changed or some operations may be omitted without departing from the spirit and scope of the illustrated embodiment. A number of the operations shown in FIG. 3 may be performed in parallel or concurrently.

단계(310)에서, 일 실시예에 따른 딥러닝 모델 변환기(210)는 연산-스트림 할당 알고리즘(operator-to-stream mapping algorithm)에 기초하여 딥러닝 모델을 변환한다. 일 실시예에 따른 딥러닝 모델 변환기(210)는 사용자가 작성한 딥러닝 모델을 입력으로 수신할 수 있다. 딥러닝 모델 변환기(210)는 주어진 딥러닝 모델을 구성하는 GPU 연산들 간의 관계를 파악해, 각 GPU 연산을 적절한 GPU 스트림에 할당하는 연산-스트림 할당 알고리즘을 수행한다.In step 310, the deep learning model converter 210 according to an embodiment converts the deep learning model based on the operation-stream allocation algorithm (operator-to-stream mapping algorithm). The deep learning model converter 210 according to an embodiment may receive a deep learning model written by a user as an input. The deep learning model converter 210 recognizes the relationship between GPU operations constituting a given deep learning model, and performs an operation-stream allocation algorithm that allocates each GPU operation to an appropriate GPU stream.

딥러닝 모델 변환기(210)는 서로 의존성이 없어 동시에 수행될 수 있는 GPU 연산은 항상 서로 다른 GPU 스트림에 배치하고, 이 때 정확한 딥러닝 수행 결과를 위해 필요한 GPU 스트림 간 동기화 횟수는 최소로 하는 연산-스트림 할당 관계를 생성할 수 있다. 일 실시예에 따른 연산-스트림 할당 알고리즘은 아래에서 도 5a 내지 도 5b를 참조하여 설명된다.The deep learning model converter 210 does not depend on each other, so GPU operations that can be performed at the same time are always placed in different GPU streams, and at this time, the number of synchronization between GPU streams required for accurate deep learning performance results is minimized. You can create stream assignment relationships. An operation-stream allocation algorithm according to an embodiment is described below with reference to FIGS. 5A-5B.

딥러닝 모델 변환기(210)는 이렇게 생성된 연산-스트림 할당 관계를 이용해 딥러닝 모델을 변환할 수 있다. 변환된 딥러닝 모델은 각 연산이 노드(node) 에, 연산 간의 데이터 흐름이 엣지(edge)에 대응되는 그래프 형태로 표현될 수 있다. 딥러닝 모델 변환기(210)는 딥러닝 모델 수행이 정확하게 이루어 질 수 있도록, 그래프에 각 GPU 연산을 상술한 알고리즘에 의해 할당된 GPU 스트림에 배치하는 루틴과 GPU 스트림 간의 동기화를 요청하는 루틴을 알맞게 삽입할 수 있다.The deep learning model converter 210 may transform the deep learning model using the operation-stream assignment relationship generated in this way. The transformed deep learning model can be expressed in a graph form in which each operation corresponds to a node and data flow between operations corresponds to an edge. The deep learning model converter 210 properly inserts a routine for arranging each GPU operation in the GPU stream allocated by the above-described algorithm and a routine for requesting synchronization between the GPU stream in the graph so that the deep learning model can be performed accurately can do.

단계(310)에서, 일 실시예에 따른 사전 스케줄러(220)는 미리 정해진 데이터 형태를 갖는 샘플 입력 데이터로 변환된 딥러닝 모델을 예비 수행(pre-run)할 수 있다.In step 310, the pre-scheduler 220 according to an embodiment may pre-run the deep learning model converted into sample input data having a predetermined data type.

보다 구체적으로, 일 실시예에 따른 사전 스케줄러(220)는 입력으로 받은 변환된 딥러닝 모델을 사용자가 원하는 입력 데이터 형태에 대해 딥러닝 학습 또는 추론을 예비 수행할 수 있다. 일 실시예에 따른 예비 수행은 다른 일반적인 딥러닝 수행 시스템의 수행과 같이 GPU 연산 스케줄링 과정을 포함한다.More specifically, the pre-scheduler 220 according to an embodiment may preliminarily perform deep learning learning or inference on the input data type desired by the user using the converted deep learning model received as an input. Preliminary execution according to an embodiment includes a GPU operation scheduling process like other general deep learning execution systems.

사전 스케줄러(220)는 예비 수행 도중에 GPU 연산 수행 요청이 일어날 경우 어떤 GPU 연산 수행을 요청했는지 기록할 수 있다. 나아가, 사전 스케줄러(220)는 예비 수행 도중에 GPU 메모리 할당/해제 요청이 일어날 경우 이를 기록할 수 있다. 기록된 GPU 메모리 할당/해제 요청에 의거해 예비 수행 시 필요로 하는 GPU 메모리 양을 파악하고, 해당하는 만큼의 GPU 메모리를 따로 할당해서 남겨둘 수 있다. 이는 GPU 연산 수행 시 필요로 하는 자원(GPU 메모리)을 확보하기 위함이다. GPU 연산 수행 요청 기록과 확보해 둔 GPU 메모리를 합쳐 “스케줄링 결과”로 지칭할 수 있다. The pre-scheduler 220 may record which GPU operation is requested when a request to perform a GPU operation occurs during preliminary execution. Furthermore, the pre-scheduler 220 may record when a GPU memory allocation/release request occurs during preliminary execution. Based on the recorded GPU memory allocation/release request, it is possible to determine the amount of GPU memory required for preliminary execution, and allocate and reserve the amount of GPU memory separately. This is to secure resources (GPU memory) required for performing GPU operations. The combination of the GPU operation request record and the reserved GPU memory can be referred to as a “scheduling result”.

단계(330)에서, 일 실시예에 따른 사전 스케줄러(220)는 예비 수행을 통해, 스케줄링 결과를 생성할 수 있다. 일 실시예에 따른 사전 스케줄러(220)는 딥러닝 모델 변환기에 의해 변환된 모델을 기반으로 예비 수행을 하기 때문에, GPU 연산을 동시에 수행할 수 있도록 복수개의 GPU 스트림을 활용하는 스케줄링 결과를 생성할 수 있다.In operation 330, the pre-scheduler 220 according to an embodiment may generate a scheduling result through preliminary execution. Since the pre-scheduler 220 according to an embodiment performs preliminary performance based on the model converted by the deep learning model converter, it is possible to generate a scheduling result utilizing a plurality of GPU streams to simultaneously perform GPU operations. there is.

도 4는 일 실시예에 따른 가속기 연산 스케줄링의 경량화 및 병렬화 방법의 예시를 도시한 도면이다. 도 1 내지 도 3을 참조하여 설명한 내용은 도 4에도 동일하게 적용될 수 있는 바, 중복되는 내용은 생략한다.4 is a diagram illustrating an example of a method for reducing weight and parallelizing accelerator operation scheduling according to an embodiment. Since the contents described with reference to FIGS. 1 to 3 can be equally applied to FIG. 4 , overlapping contents are omitted.

도 4를 참조하면, 일 실시예에 따른 사전 스케줄러(220)는 입력으로 받은 변환된 딥러닝 모델을 사용자가 원하는 입력 데이터 형태에 대해 딥러닝 학습 또는 추론을 예비 수행할 수 있다. 일 실시예에 따른 예비 수행은 GPU 연산 스케줄링, GPU 메모리 할당/해제 요청 및 GPU 연산 스케줄링 및 GPU 연산 수행 요청을 포함할 수 있고, 해당 동작들을 반복하여 수행할 수 있다.Referring to FIG. 4 , the pre-scheduler 220 according to an embodiment may preliminarily perform deep learning learning or inference on the input data type desired by the user using the converted deep learning model received as an input. The preliminary execution according to an embodiment may include GPU operation scheduling, GPU memory allocation/release request, GPU operation scheduling, and GPU operation execution request, and the corresponding operations may be repeatedly performed.

나아가, 사전 스케줄러(220)는 예비 수행 도중에 GPU 연산 수행 요청이 일어날 경우 어떤 GPU 연산 수행을 요청했는지 기록할 수 있고, 나아가 예비 수행 도중에 GPU 메모리 할당/해제 요청이 일어날 경우 이를 기록할 수 있다.Furthermore, the pre-scheduler 220 may record which GPU operation is requested when a request to perform a GPU operation occurs during preliminary execution, and further, when a GPU memory allocation/release request occurs during preliminary execution, it may be recorded.

일 실시예에 따른 런타임(run time) 단계에서, 전자 장치는 사전 스케줄러가 출력하는 스케줄링 결과를 입력으로 받을 수 있다. 나아가, 전자 장치는 이와 함께 사용자가 학습 또는 추론 수행에 사용하고자 하는 입력 데이터(예를 들어, 사진, 음성, 텍스트 등)를 입력으로 받을 수 있다.In a run time step according to an embodiment, the electronic device may receive a scheduling result output by the pre-scheduler as an input. Furthermore, the electronic device may receive input data (eg, photo, voice, text, etc.) that the user intends to use for learning or reasoning.

일 실시예에 따른 전자 장치는 입력받은 데이터를 가지고 딥러닝 학습 또는 추론을 수행 할 때, 별도의 스케줄링 과정 없이 GPU 연산 사전 스케줄러가 생성한 스케줄링 결과를 이용해 바로 GPU에 연산 수행을 요청할 수 있다.When performing deep learning learning or inference with the received data, the electronic device according to an embodiment may directly request the GPU to perform an operation using the scheduling result generated by the GPU operation pre-scheduler without a separate scheduling process.

일 실시예에 따른 전자 장치는 사전에 단 한 번 스케줄링을 마친 후 이후 반복에 대해선 사전에 생성된 스케줄링 결과를 재사용 함으로써 스케줄링 오버헤드 없이 GPU에 연산 수행을 요청할 수 있다.The electronic device according to an embodiment may request the GPU to perform an operation without scheduling overhead by reusing the scheduling result generated in advance for subsequent iterations after completing the scheduling only once in advance.

또한 GPU 연산 간 관계를 분석해 동시에 수행될 수 있는 GPU 연산은 서로 다른 GPU 스트림에 배치함으로써 GPU에서 동시에 수행될 수 있도록 해 GPU 자원을 최대한 활용할 수 있다. 이 때 필요한 GPU 스트림 간 동기화 횟수를 최소화 함으로써 GPU 연산 수행 요청이 스트림 간 동기화에 의해 지체되지 않고 빠르게 진행할 수 있다. 이러한 일련의 효과를 통해 최종적으로 딥러닝 모델 학습 및 추론 시 수행 시간을 단축시킬 수 있다.In addition, by analyzing the relationship between GPU operations, GPU operations that can be performed simultaneously are placed in different GPU streams so that they can be performed simultaneously on the GPU, thereby maximizing the use of GPU resources. At this time, by minimizing the required number of synchronizations between GPU streams, requests to perform GPU operations can proceed quickly without being delayed by synchronization between streams. Through this series of effects, it is possible to reduce the execution time of deep learning model training and inference.

도 5a 내지 도 5b는 일 실시예에 따른 연산-스트림 할당 알고리즘을 설명하기 위한 도면이다.5A to 5B are diagrams for explaining an operation-stream allocation algorithm according to an embodiment.

일 실시예에 따른 딥러닝 모델 변환기(210)는 주어진 딥러닝 모델을 구성하는 GPU 연산들 간의 관계를 파악해, 각 GPU 연산을 적절한 GPU 스트림에 할당하는 연산-스트림 할당 알고리즘을 수행할 수 있다.The deep learning model converter 210 according to an embodiment recognizes a relationship between GPU operations constituting a given deep learning model, and allocates each GPU operation to an appropriate GPU stream - it is possible to perform a stream allocation algorithm.

도 5a를 참조하면, 일 실시예에 따른 딥러닝 모델은 딥러닝 모델의 연산자(operator)를 의미하는 노드와 연산자 사이의 관계를 의미하는 엣지로 구성된 그래프(510)로 표현될 수 있다.Referring to FIG. 5A , the deep learning model according to an embodiment may be expressed as a graph 510 composed of a node indicating an operator of the deep learning model and an edge indicating a relationship between the operators.

딥러닝 모델 변환기(210)는 그래프(510)로 표현된 딥러닝 모델을 최소 등가 그래프(minimum equivalent graph)(520)로 변환할 수 있다. 최소 등가 그래프(520)는 그래프(510)와 동일한 도달 가능성 관계(reachability relation)를 갖으면서 그래프(510)의 가장 부분 집합에 해당하는 그래프를 의미할 수 있다. 일 실시예에 따른 최소 등가 그래프(520)는 고유하며, 다항 시간(polynomial time)으로 구성될 수 있다.The deep learning model converter 210 may convert the deep learning model represented by the graph 510 into a minimum equivalent graph 520 . The minimum equivalent graph 520 may mean a graph corresponding to the most subset of the graph 510 while having the same reachability relation as the graph 510 . The minimum equivalence graph 520 according to an embodiment is unique and may be configured in polynomial time.

딥러닝 모델 변환기(210)는 최소 등가 그래프에 대한 이분법 그래프(bipartite graph)를 생성할 수 있다. 나아가, 딥러닝 모델 변환기(210)는 이분법 그래프의 최대 매칭(maximum matching)을 결정할 수 있다. 딥러닝 모델 변환기(210)는 포드-풀커슨 알고리즘(Ford-Fulkerson algorithm)에 기초하여 이분법 그래프의 최대 매칭을 결정할 수 있다. 그러나, 이분법 그래프의 최대 매칭을 결정하는 방법은 앞서 제시된 예시에 한정되지는 않는다.The deep learning model converter 210 may generate a bipartite graph for the minimum equivalence graph. Furthermore, the deep learning model converter 210 may determine a maximum matching of the dichotomous graph. The deep learning model converter 210 may determine the maximum matching of the dichotomous graph based on the Ford-Fulkerson algorithm. However, the method for determining the maximum match of the dichotomous graph is not limited to the example presented above.

딥러닝 모델 변환기(210)는 최대 매칭에 기초하여, 노드를 가속기의 스트림에 할당할 수 있다. 보다 구체적으로, 모델 변환기(210)는 최소 등가 그래프(520)에서 각 노드가 별도의 집합인 노드 집합 모음(collection of node set)을 생성할 수 있다. 예를 들어, 딥러닝 모델 변환기(210)는 최대 매칭에 기초하여, v₁, v₂, v₅를 하나의 노드 집합으로 결정하고, v₃, v₆을 다른 하나의 노드 집합으로 결정하고, v₄, v₇을 또 다른 하나의 노드 집합으로 결정할 수 있다. 나아가, 모델 변환기(210)는, v₁, v₂, v₅를 GPU의 제1 스트림에 할당할 수 있고, v₃, v₆을 GPU의 제2 스트림에 할당할 수 있고, v₃, v₆을 GPU의 제3 스트림에 할당할 수 있다.The deep learning model transformer 210 may assign nodes to streams of accelerators based on the maximum match. More specifically, the model converter 210 may generate a collection of node sets in which each node is a separate set in the minimum equivalence graph 520 . For example, the deep learning model transformer 210 determines, based on the maximum matching, v ₁ , v ₂ , v ₅ as one set of nodes, and determines v ₃ , v ₆ as another set of nodes, We can determine v ₄ and v ₇ as another set of nodes. Further, the model converter 210 may assign v ₁ , v ₂ , v ₅ to a first stream of the GPU, v ₃ , v ₆ to a second stream of the GPU, v ₃ , v ₆ may be assigned to the third stream of the GPU.

도 5a를 참조하여 설명한 연산-스트림 할당 알고리즘은 도 5b와 같이 표현될 수 있다.The operation-stream allocation algorithm described with reference to FIG. 5A may be expressed as shown in FIG. 5B.

도 6 및 도 7는 일실시예에 따른 전자 장치의 예시들을 나타낸 도면이다.6 and 7 are diagrams illustrating examples of an electronic device according to an embodiment.

도 6을 참조하면, 일실시예에 따른 전자 장치는 서버(600)로 구현될 수 있다.Referring to FIG. 6 , an electronic device according to an embodiment may be implemented as a server 600 .

서버(600)는 사용자에 의해 제어되는 사용자 단말과 구분되는 별도의 장치로서, 유선 및/또는 무선 네트워크를 통해 하나 이상의 사용자 단말과 통신을 수행할 수 있다. 서버(600)는 다수의 사용자들이 각자 자신의 단말을 통해 동시다발적으로 전송하는 요청들을 수신할 수 있다. 서버(600)는 앞서 설명한 호스트 프로세서(710)를 통해 사용자가 정의한 딥러닝 모델을 변환하고, 사전 스케줄링을 통해 획득한 스케줄링 결과를 가속기에 전달할 수 있다.The server 600 is a separate device from the user terminal controlled by the user, and may communicate with one or more user terminals through a wired and/or wireless network. The server 600 may receive requests that a plurality of users transmit simultaneously through their respective terminals. The server 600 may convert the deep learning model defined by the user through the host processor 710 described above, and transmit the scheduling result obtained through pre-scheduling to the accelerator.

가속기(720)는 호스트 프로세서(710)가 제공하는 결과물을 바탕으로 딥러닝 학습 및 추론을 반복 수행할 수 있다. 그리고, 서버(600)는 딥러닝 연산 수행 결과들을 각각 대응하는 사용자 단말로 리턴할 수 있다. 예를 들어, 사용자 단말은 스마트폰, 태블릿, 랩탑, 퍼스널 컴퓨터 등 다양한 컴퓨팅 장치, 스마트 시계, 스마트 안경 등 다양한 웨어러블 기기, 스마트 스피커, 스마트 TV, 스마트 냉장고 등 다양한 가전장치, 스마트 자동차, 스마트 키오스크, IoT(Internet of Things) 기기 등을 포함할 수 있다. The accelerator 720 may repeatedly perform deep learning learning and inference based on the result provided by the host processor 710 . And, the server 600 may return the results of performing the deep learning operation to the corresponding user terminals, respectively. For example, the user terminal includes various computing devices such as smartphones, tablets, laptops, and personal computers, various wearable devices such as smart watches and smart glasses, various home appliances such as smart speakers, smart TVs, and smart refrigerators, smart cars, smart kiosks, It may include an Internet of Things (IoT) device and the like.

도 7를 참조하면, 일실시예에 따른 전자 장치는 사용자 단말(700)로 구현될 수 있다. 도 7에서는 설명의 편의를 위해 사용자 단말(700)이 스마트 폰으로 도시되었지만, 이외에도 사용자에 의해 제어되는 기기라면 제한 없이 적용될 수 있다. 사용자 단말(700)은 직접 사용자로부터 요청들을 획득하고, 앞서 설명한 호스트 프로세서(710)를 통해 스케줄링 결과를 가속기(720)에 전달할 수 있다. 가속기(720)는 스케줄링 결과에 따라 딥러닝 학습 및 추론을 반복 수행할 수 있다.Referring to FIG. 7 , an electronic device according to an embodiment may be implemented as a user terminal 700 . Although the user terminal 700 is illustrated as a smart phone in FIG. 7 for convenience of explanation, any device controlled by the user may be applied without limitation. The user terminal 700 may directly obtain requests from the user and transmit the scheduling result to the accelerator 720 through the host processor 710 described above. The accelerator 720 may repeatedly perform deep learning learning and inference according to the scheduling result.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented by a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the apparatus, methods, and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA) array), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using a general purpose computer or special purpose computer. The processing device may execute an operating system (OS) and a software application running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination thereof, which configures the processing device to operate as desired or independently or collectively configures the processing device to operate as desired. can command The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in a computer-readable recording medium.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

pre-running the deep learning model with sample input data having a predetermined data type; and
generating a scheduling result through the preliminary execution
A method for lightweighting and parallelizing accelerator operation scheduling, comprising:

According to claim 1,
receiving input data; and
performing a deep learning operation on the input data based on the scheduling result without separate scheduling for the input data
A method for lightweighting and parallelizing accelerator computation scheduling further comprising

According to claim 1,
The preliminary step is
recording an accelerator operation execution request generated during the preliminary execution; and
Recording the accelerator memory allocation or release request generated during the preliminary execution
A method for lightweighting and parallelizing accelerator operation scheduling, comprising:

4. The method of claim 3,
The step of generating the scheduling result is
generating an accelerator operation execution request record based on the recorded accelerator operation execution request; and
allocating memory to the accelerator based on the accelerator memory allocation or release request;
A method for lightweighting and parallelizing accelerator operation scheduling, comprising:

According to claim 1,
The deep learning model is
A method for reducing and parallelizing accelerator operation scheduling, which can be expressed as a graph consisting of a node meaning an operator of the deep learning model and an edge meaning a relationship between the operator.

According to claim 1,
Transforming the deep learning model based on an operator-to-stream mapping algorithm
A method for lightweighting and parallelizing accelerator computation scheduling further comprising:

7. The method of claim 6,
The step of transforming the deep learning model is
converting the deep learning model into a minimum equivalent graph;
generating a bipartite graph for the minimum equivalence graph;
determining a maximum matching of the dichotomous graph; and
assigning the node to the stream of the accelerator based on the maximum match;
A method for lightweighting and parallelizing accelerator operation scheduling, comprising:

According to claim 1,
The deep learning model is
A method for lightweighting and parallelizing accelerator computation scheduling, including a static neural network.

A computer program stored in a medium for executing the method of any one of claims 1 to 8 in combination with hardware.

A processor that pre-runs a deep learning model with sample input data having a predetermined data type, and generates a scheduling result through the pre-run
A device for lightweighting and parallelizing accelerator operation scheduling comprising a.

11. The method of claim 10,
the processor is
An apparatus for reducing and parallelizing accelerator operation scheduling that records an accelerator operation execution request generated during the preliminary execution, and records an accelerator memory allocation or release request generated during the preliminary execution.

12. The method of claim 11,
the processor is
An apparatus for reducing and parallelizing accelerator operation scheduling, generating an accelerator operation execution request record based on the recorded accelerator operation execution request, and allocating memory to the accelerator based on the accelerator memory allocation or release request.

11. The method of claim 10,
The deep learning model is
An apparatus for reducing and parallelizing accelerator operation scheduling, which can be expressed as a graph composed of a node meaning an operator of the deep learning model and an edge meaning a relationship between the operator.

11. The method of claim 10,
the processor is
An apparatus for lightweighting and parallelizing accelerator computation scheduling, which transforms a deep learning model based on an operator-to-stream mapping algorithm.

15. The method of claim 14,
the processor is
Transform the deep learning model into a minimum equivalent graph, generate a bipartite graph for the minimum equivalent graph, determine the maximum matching of the bipartite graph, and determine the maximum matching Based on , allocating the node to the stream of the accelerator, an apparatus for reducing and parallelizing accelerator computation scheduling.

11. The method of claim 10,
The deep learning model is
An apparatus for lightweighting and parallelizing accelerator computation scheduling, including a static neural network.

a host processor for pre-running a deep learning model with sample input data having a predetermined data type, and generating a scheduling result through the preliminary execution; and
An accelerator for executing the deep learning model according to a schedule determined by the host processor
An electronic device comprising a.

18. The method of claim 17,
the host processor
receive input data;
the accelerator
An electronic device that performs a deep learning operation on the input data based on the scheduling result without separate scheduling on the input data.