KR102430837B1

KR102430837B1 - Method of dividing plurality of layers included in machine learning model and determining processor that performs divided layers, and device performing method

Info

Publication number: KR102430837B1
Application number: KR1020200106168A
Authority: KR
Inventors: 백웅기; 한명균; 현지훈; 박성범; 박진수
Original assignee: 울산과학기술원
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2022-08-09
Also published as: KR20220025452A

Abstract

본 발명의 일 실시예에 따른 복수의 프로세서들을 포함하는 장치에서 기계학습 모델을 수행하는 방법은, 복수의 레이어들을 포함하는 기계학습 모델을 입력 받는 단계; 상기 복수의 레이어들 각각의 수행 시간 및 소모 전력 중에서 적어도 하나에 대한 프로파일링을 수행하여 프로파일링 데이터를 생성하는 단계; 상기 복수의 레이어들 각각에 입력되는 입력 텐서의 크기 및 상기 복수의 레이어들 각각으로부터 출력되는 출력 텐서의 크기를 결정하는 단계; 상기 복수의 레이어들 각각의 상기 프로파일링 데이터, 상기 입력 텐서의 크기 및 상기 출력 텐서의 크기를 이용하여, 상기 복수의 레이어들을 실행할 때의 제1 최소 실행 비용을 계산하는 단계; 상기 제1 최소 실행 비용에 기초하여, 상기 복수의 레이어들을 하나 이상의 슬라이스로 분할하는 단계; 및 상기 복수의 프로세서들 중에서 하나 이상의 프로세서를 이용하여, 상기 하나 이상의 슬라이스를 수행하는 단계를 포함할 수 있다.According to an embodiment of the present invention, a method for performing a machine learning model in an apparatus including a plurality of processors includes: receiving a machine learning model including a plurality of layers; generating profiling data by profiling at least one of the execution time and power consumption of each of the plurality of layers; determining a size of an input tensor input to each of the plurality of layers and a size of an output tensor output from each of the plurality of layers; calculating a first minimum execution cost when executing the plurality of layers by using the profiling data of each of the plurality of layers, the size of the input tensor, and the size of the output tensor; dividing the plurality of layers into one or more slices based on the first minimum running cost; and performing the one or more slices by using one or more of the plurality of processors.

Description

A method of dividing a plurality of layers included in a machine learning model, determining a processor performing the divided layers, and an apparatus for performing the method , AND DEVICE PERFORMING METHOD}

본 발명은 기계학습 모델에 포함된 복수의 레이어들을 분할하고, 분할한 레이어들을 수행하는 프로세서를 결정하는 방법 및 상기 방법을 수행하는 장치에 관한 것이다.The present invention relates to a method of dividing a plurality of layers included in a machine learning model, and determining a processor that performs the divided layers, and an apparatus for performing the method.

이종 임베디드 시스템(Heterogeneous embedded system)은 서로 다른 특성들(예컨대, 성능, 에너지, 통신 오버헤드, 기능(실행 가능한 수학 연산), 메모리 용량 등)을 갖는 둘 이상의 프로세서들을 포함하는 시스템이다.A heterogeneous embedded system is a system including two or more processors having different characteristics (eg, performance, energy, communication overhead, function (executable mathematical operation), memory capacity, etc.).

이종 임베디드 시스템은 고성능 작업을 수행하며 전력 사용이 큰 빅(big) 코어, 저성능 작업을 수행하며 저전력으로 동작하는 리틀(little) 코어 및 빅 코어와 리틀 코어의 중간 성능의 작업을 수행하는 미들(middle) 코어를 포함하는 CPU(Central Processing Unit), 영상 관련 처리를 중점적으로 수행하는 GPU(Graphics Processing Unit), 머신 러닝 관련 프로세스를 중점적으로 수행하는 NPU(Neural Network Processing Unit)을 포함할 수 있다.Heterogeneous embedded systems include a big core that uses high power and performs high-performance tasks, a little core that performs low-performance tasks and operates with low power, and a middle ( middle) It may include a CPU (Central Processing Unit) including a core, a GPU (Graphics Processing Unit) that mainly performs image-related processing, and a Neural Network Processing Unit (NPU) that mainly performs machine learning-related processes.

최근 기계학습 모델(Machine Learning Model)의 보편화에 따라 임베디드 시스템에서의 기계학습 수행의 필요성이 커지고 있으며, 예컨대, 자율주행 자동차에서의 객체인식, 스마트폰에서 필기체 인식 등 여러 분야에서의 애플리케이션을 실행하기 위해서는 실시간으로 기계학습 추론이 가능해야 한다.Recently, with the generalization of machine learning models, the need to perform machine learning in embedded systems is increasing. In order to do this, machine learning inference must be possible in real time.

다만, 이종 임베디드 시스템에 포함된 프로세서들 중에서 기계학습 모델에 포함되는 각 레이어를 어떠한 프로세서가 수행하느냐에 따라서 기계학습 모델 수행의 성능이 크게 차이가 날 수 있으므로, 기계학습 모델에 포함되는 레이어를 수행할 프로세서를 결정하는 방법이 문제될 수 있다.However, among the processors included in the heterogeneous embedded system, the performance of the machine learning model execution may vary greatly depending on which processor performs each layer included in the machine learning model, so How to determine the processor can be problematic.

본 발명이 해결하고자 하는 과제는, 전술한 문제를 해결하기 위하여, 기계학습 모델에 포함된 복수의 레이어들을 분할하고, 분할한 레이어들을 수행하는 프로세서를 결정하는 방법을 제공하는 것이다.An object of the present invention is to provide a method of dividing a plurality of layers included in a machine learning model and determining a processor performing the divided layers in order to solve the above problem.

다만, 본 발명이 해결하고자 하는 과제는 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 해결하고자 하는 과제는 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the problems to be solved by the present invention are not limited to those mentioned above, and other problems to be solved that are not mentioned can be clearly understood by those of ordinary skill in the art to which the present invention belongs from the following description. will be.

상기 방법은, 상기 제1 최소 실행 비용에 기초하여, 상기 복수의 프로세서들 중에서 상기 복수의 레이어들 각각을 수행하는 프로세서를 결정하는 단계를 더 포함할 수 있다.The method may further include determining, from among the plurality of processors, a processor performing each of the plurality of layers based on the first minimum execution cost.

상기 제1 최소 실행 비용을 계산하는 단계는, 상기 복수의 레이어들 중에서 제1 개수만큼의 연속된 레이어의 프로파일링 데이터, 입력 텐서의 크기 및 출력 텐서의 크기를 이용하여, 상기 복수의 프로세서들 각각이 상기 제1 개수만큼의 연속된 레이어를 실행할 때의 제1 총 실행 비용, 및 상기 제1 개수만큼의 연속된 레이어를 제외한 나머지 레이어에 대한 제2 최소 실행 비용을 결정하는 단계; 상기 복수의 레이어들 중에서 제2 개수만큼의 연속된 레이어의 입력 텐서의 크기 및 출력 텐서의 크기를 이용하여, 상기 복수의 프로세서들 각각이 상기 제2 개수만큼의 연속된 레이어를 실행할 때의 제2 총 실행 비용, 및 상기 제2 개수만큼의 연속된 레이어를 제외한 나머지 레이어에 대한 제3 최소 실행 비용을 결정하는 단계; 및 상기 제1 총 실행 비용 및 상기 제2 최소 실행 비용의 제1 합과 상기 제2 총 실행 비용 및 상기 제3 최소 실행 비용의 제2 합 중에서 더 적은 상기 제1 합을 상기 제1 최소 실행 비용으로 결정하는 단계를 포함할 수 있다.The calculating of the first minimum execution cost may include using profiling data of a first number of consecutive layers from among the plurality of layers, a size of an input tensor, and a size of an output tensor for each of the plurality of processors. determining a first total execution cost when executing the first number of consecutive layers and a second minimum execution cost for the remaining layers excluding the first number of consecutive layers; A second when each of the plurality of processors executes the second number of consecutive layers using the size of the input tensor and the size of the output tensor of the second number of consecutive layers among the plurality of layers determining a total execution cost and a third minimum execution cost for the remaining layers except for the second number of consecutive layers; and the first minimum running cost less the first sum of the first total running cost and the second minimal running cost and the second sum of the second total running cost and the third minimal running cost. It may include the step of determining

상기 제1 총 실행 비용은, 상기 제1 개수만큼의 연속된 레이어 중에서 가장 먼저 수행되는 레이어의 입력 텐서의 크기, 상기 제1 개수만큼의 연속된 레이어 중에서 가장 나중에 수행되는 레이어의 출력 텐서의 크기 및 상기 제1 개수만큼의 연속된 레이어들 자체의 실행 비용을 이용하여 결정될 수 있다.The first total execution cost includes the size of an input tensor of a layer performed first among the first number of consecutive layers, a size of an output tensor of a layer performed last among the first number of consecutive layers, and It may be determined using the execution cost of the first number of consecutive layers themselves.

상기 제1 개수만큼의 연속된 레이어들 자체의 실행 비용은 프로세서의 성능을 최적화하는 기준에 기초하여 결정될 수 있다.The execution cost of the first number of consecutive layers itself may be determined based on a criterion for optimizing the performance of the processor.

상기 프로세서의 성능을 최적화하는 기준이 수행 성능인 경우, 상기 제1 개수만큼의 연속된 레이어들 자체의 실행 비용은 상기 1 개수만큼의 연속된 레이어들의 수행 성능에 따라 결정되고, 상기 프로세서의 성능을 최적화하는 기준이 에너지 소모량인 경우, 상기 제1 개수만큼의 연속된 레이어들 자체의 실행 비용은 상기 1 개수만큼의 연속된 레이어들의 수행 성능 및 총 소모 전력에 따라 결정될 수 있다.When the criterion for optimizing the performance of the processor is performance performance, the execution cost of the first number of consecutive layers itself is determined according to the performance performance of the first number of consecutive layers, and the performance of the processor is When the optimization criterion is energy consumption, the execution cost of the first number of consecutive layers itself may be determined according to the performance performance of the first number of consecutive layers and total power consumption.

상기 수행 성능 및 상기 총 소모 전력은 상기 제1 개수만큼의 연속된 레이어들 각각을 수행하는 프로세서의 동작 주파수에 기초하여 결정될 수 있다.The performance performance and the total power consumption may be determined based on an operating frequency of a processor that executes each of the first number of consecutive layers.

본 발명의 다른 실시예에 따른 복수의 레이어들을 포함하는 기계학습 모델을 수행하는 기계학습 모델 수행 장치는, 상기 기계학습 모델을 입력 받는 입출력기; 및 복수의 프로세서들을 포함하고, 상기 입출력 장치를 제어하는 프로세서부를 포함하고, 상기 프로세서부는, 상기 복수의 레이어들 각각의 수행 시간 및 소모 전력 중에서 적어도 하나에 대한 프로파일링을 수행하여 프로파일링 데이터를 생성하고, 상기 복수의 레이어들 각각에 입력되는 입력 텐서의 크기 및 상기 복수의 레이어들 각각으로부터 출력되는 출력 텐서의 크기를 결정하고, 상기 복수의 레이어들 각각의 상기 프로파일링 데이터, 상기 입력 텐서의 크기 및 상기 출력 텐서의 크기를 이용하여, 상기 복수의 레이어들을 실행할 때의 제1 최소 실행 비용을 계산하고, 상기 제1 최소 실행 비용에 기초하여, 상기 복수의 레이어들을 하나 이상의 슬라이스로 분할하고, 상기 복수의 프로세서들 중에서 하나 이상의 프로세서를 이용하여, 상기 하나 이상의 슬라이스를 수행할 수 있다.According to another embodiment of the present invention, an apparatus for performing a machine learning model for performing a machine learning model including a plurality of layers includes: an input/output unit for receiving the machine learning model; and a processor unit including a plurality of processors and controlling the input/output device, wherein the processor unit generates profiling data by profiling at least one of the execution time and power consumption of each of the plurality of layers and determining a size of an input tensor input to each of the plurality of layers and a size of an output tensor output from each of the plurality of layers, and determining the profiling data of each of the plurality of layers and the size of the input tensor and calculating a first minimum execution cost when executing the plurality of layers by using the size of the output tensor, and dividing the plurality of layers into one or more slices based on the first minimum execution cost; The one or more slices may be performed using one or more processors among a plurality of processors.

본 발명의 실시예에 의하면, 기계학습 모델에 포함된 복수의 레이어들을 분할하고, 분할한 레이어들을 수행하는 프로세서를 결정하는 방법을 제안함으로써, 기계학습 모델을 수행하는 소모되는 시간 및 에너지를 효과적으로 감소시킬 수 있다.According to an embodiment of the present invention, time and energy consumed for performing a machine learning model are effectively reduced by dividing a plurality of layers included in a machine learning model and proposing a method of determining a processor performing the divided layers can do it

도 1은 본 발명의 일 실시예에 따라 기계학습 모델의 슬라이스를 각 프로세서로 분배하는 기계학습 모델 수행 장치의 블록도이다.
도 2는 본 발명의 일 실시예에 따른 레이어 분배 모델의 기능을 개념적으로 나타내는 블록도이다.
도 3은 본 발명의 일 실시예에 따른 비용 계산부의 기능을 개념적으로 나타내는 블록도이다.
도 4는 본 발명의 일 실시예에 따라 실행 비용을 결정하는 방법을 나타낸다.
도 5는 본 발명의 일 실시예에 따라 최소 실행 비용을 결정하는 일 단계를 나타낸다.
도 6은 본 발명의 일 실시예에 따라 최소 실행 비용을 결정하는 다른 단계를 나타낸다.
도 7은 도 5 및 도 6의 단계들을 통해 결정된 슬라이스 및 각 슬라이스를 수행하는 프로세서의 일 예시를 나타낸다.
도 8은 본 발명의 일 실시예에 따른 기계학습 모델 수행 장치를 이용하여 기계학습을 실행하는 경우와 다른 종래 기술을 이용하여 기계학습을 실행하는 경우의 레이턴시(latency)를 비교한 예시를 나타낸다.
도 9는 본 발명의 일 실시예에 따른 기계학습 모델 수행 장치를 이용하여 기계학습을 실행하는 경우와 다른 종래 기술을 이용하여 기계학습을 실행하는 경우의 에너지 효율을 비교한 예시를 나타낸다.
도 10은 본 발명의 일 실시예에 따라 복수의 레이어들을 하나 이상의 슬라이스로 분할하고, 각 슬라이스를 수행하는 프로세서를 결정하는 방법을 나타낸다.1 is a block diagram of an apparatus for performing a machine learning model for distributing a slice of a machine learning model to each processor according to an embodiment of the present invention.
2 is a block diagram conceptually illustrating a function of a layer distribution model according to an embodiment of the present invention.
3 is a block diagram conceptually illustrating a function of a cost calculator according to an embodiment of the present invention.
4 illustrates a method for determining an implementation cost according to an embodiment of the present invention.
5 illustrates a step of determining a minimum running cost according to an embodiment of the present invention.
6 illustrates another step of determining a minimum running cost according to an embodiment of the present invention.
7 shows a slice determined through the steps of FIGS. 5 and 6 and an example of a processor performing each slice.
8 shows an example in which latency is compared between the case of executing machine learning using the machine learning model performing apparatus according to an embodiment of the present invention and the case of executing machine learning using other conventional techniques.
9 shows an example in which energy efficiency is compared between the case of executing machine learning using the machine learning model performing apparatus according to an embodiment of the present invention and the case of executing machine learning using another conventional technique.
10 illustrates a method of dividing a plurality of layers into one or more slices and determining a processor performing each slice according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the technical field to which the present invention pertains It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In describing the embodiments of the present invention, if it is determined that a detailed description of a well-known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, the terms to be described later are terms defined in consideration of functions in an embodiment of the present invention, which may vary according to intentions or customs of users and operators. Therefore, the definition should be made based on the content throughout this specification.

도 1은 본 발명의 일 실시예에 따라 기계학습 모델의 슬라이스를 각 프로세서로 분배하는 기계학습 모델 수행 장치의 블록도이다.1 is a block diagram of an apparatus for performing a machine learning model for distributing a slice of a machine learning model to each processor according to an embodiment of the present invention.

도 1을 참조하면, 기계학습 모델 수행 장치(100)는 프로세서부(110), 입출력기(120) 및 메모리(130)를 포함할 수 있다.Referring to FIG. 1 , an apparatus 100 for performing a machine learning model may include a processor unit 110 , an input/output device 120 , and a memory 130 .

프로세서부(110)는 기계학습 모델 수행 장치(100)의 동작을 전반적으로 제어할 수 있다. 프로세서부(110)는 CPU(Central Processing Unit), GPU(Graphics Processing Unit), NPU(Neural Network Processing Unit)를 포함하는 복수의 프로세서들을 포함할 수 있고, CPU는 고성능 작업을 수행하며 전력 사용이 큰 빅(big) 코어, 저성능 작업을 수행하며 저전력으로 동작하는 리틀(little) 코어 및 빅 코어와 리틀 코어의 중간 성능의 작업을 수행하는 미들(middle) 코어를 포함할 수 있다.The processor unit 110 may control the overall operation of the machine learning model performing apparatus 100 . The processor unit 110 may include a plurality of processors including a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processing unit (NPU), and the CPU performs a high-performance task and consumes a lot of power. It may include a big core, a little core that performs a low-performance task and operates with low power, and a middle core that performs a task of intermediate performance between the big core and the little core.

본 명세서에서 설명하는 프로세서부(110)의 기능은 특별한 설명이 없는 한 프로세서부(110)에 포함된 복수의 프로세서들 중에서 하나 이상의 프로세서가 수행하는 것으로 이해할 수 있다.The functions of the processor unit 110 described in this specification may be understood as being performed by one or more processors among a plurality of processors included in the processor unit 110 unless otherwise specified.

프로세서부(110)는 입출력기(120)를 제어하여, 기계학습 모델 또는 기계학습 모델을 이용하는 프로그램을 입력받을 수 있다. 또는, 도 1에 도시된 바와는 다르게, 프로세서부(110)는 입출력기(120)가 아닌 송수신기(120)를 통해 기계학습 모델 또는 기계학습 모델을 이용하는 프로그램을 수신할 수 있다. 본 명세서에서는, 설명의 편의를 위하여 기계학습 모델을 입력(수신)받는 경우에 대해서만 설명하지만, 기계학습 모델을 이용하는 프로그램을 수신하는 경우 또한 본 명세서에서 설명하는 방법이 동일하게 적용될 수 있다.The processor unit 110 may control the input/output device 120 to receive a machine learning model or a program using the machine learning model. Alternatively, unlike shown in FIG. 1 , the processor unit 110 may receive a machine learning model or a program using the machine learning model through the transceiver 120 rather than the input/output unit 120 . In this specification, only the case of receiving (receiving) the machine learning model is described for convenience of explanation, but the method described in this specification may be equally applied to the case of receiving a program using the machine learning model.

메모리(130)는 레이어 분배 모델(200) 및 레이어 분배 모델(200)의 실행에 필요한 정보를 저장할 수 있다.The memory 130 may store the layer distribution model 200 and information necessary for executing the layer distribution model 200 .

프로세서부(110)는 레이어 분배 모델(200)을 실행하기 위하여 메모리(130)로부터 레이어 분배 모델(200) 및 레이어 분배 모델(200)의 실행에 필요한 정보를 로드할 수 있다.The processor unit 110 may load the layer distribution model 200 and information necessary for executing the layer distribution model 200 from the memory 130 to execute the layer distribution model 200 .

본 명세서에서 레이어 분배 모델(200)은 입력된 기계학습 모델에 포함된 복수의 레이어들을 하나 이상의 슬라이스로 나누고, 각 슬라이스를 담당할 프로세서를 결정하는 소프트웨어(컴퓨터 프로그램 코드)를 의미할 수 있다.In the present specification, the layer distribution model 200 may refer to software (computer program code) that divides a plurality of layers included in the input machine learning model into one or more slices and determines a processor to be in charge of each slice.

프로세서부(110)는, 레이어 분배 모델(200)을 실행하여, 입력된 기계학습 모델에 포함된 복수의 레이어들을 하나 이상의 슬라이스로 나누고, 복수의 프로세서들 중에서 각 슬라이스를 담당할 프로세서를 결정할 수 있다.The processor unit 110 may execute the layer distribution model 200, divide a plurality of layers included in the input machine learning model into one or more slices, and determine a processor to be in charge of each slice from among the plurality of processors. .

도 2는 본 발명의 일 실시예에 따른 레이어 분배 모델의 기능을 개념적으로 나타내는 블록도이고, 도 3은 본 발명의 일 실시예에 따른 비용 계산부의 기능을 개념적으로 나타내는 블록도이고, 도 4는 본 발명의 일 실시예에 따라 실행 비용을 결정하는 방법을 나타낸다.2 is a block diagram conceptually illustrating a function of a layer distribution model according to an embodiment of the present invention, FIG. 3 is a block diagram conceptually illustrating a function of a cost calculator according to an embodiment of the present invention, and FIG. 4 is A method for determining an implementation cost is shown in accordance with an embodiment of the present invention.

도 2를 참조하면, 레이어 분배 모델(200)은 프로파일러(210), 비용 계산부(220), 슬라이스 결정부(230) 및 슬라이스 실행부(240)를 포함할 수 있다.Referring to FIG. 2 , the layer distribution model 200 may include a profiler 210 , a cost calculator 220 , a slice determiner 230 , and a slice executor 240 .

도 2에 도시된 프로파일러(210), 비용 계산부(220), 슬라이스 결정부(230) 및 슬라이스 실행부(240)는 레이어 분배 모델(200)의 기능을 쉽게 설명하기 위하여 레이어 분배 모델(200)의 기능을 개념적으로 나눈 것으로서, 이에 한정되지 않는다. 실시예들에 따라, 프로파일러(210), 비용 계산부(220), 슬라이스 결정부(230) 및/또는 슬라이스 실행부(240)는 하나의 프로그램에 포함된 일련의 명령어들로 구현될 수도 있다.The profiler 210 , the cost calculator 220 , the slice determiner 230 , and the slice executor 240 shown in FIG. 2 are the layer distribution model 200 to easily explain the function of the layer distribution model 200 . ) as a conceptual division of the function, but is not limited thereto. According to embodiments, the profiler 210 , the cost calculator 220 , the slice determiner 230 , and/or the slice executor 240 may be implemented as a series of instructions included in one program. .

프로파일러(210)는 입력받은 기계학습 모델에 포함된 복수의 레이어들 각각을 복수의 프로세서들 각각이 수행하는 경우의 수행 시간 및/또는 소모 전력에 대한 프로파일링을 수행하고, 프로파일링 데이터를 생성할 수 있다.The profiler 210 performs profiling on execution time and/or power consumption when each of a plurality of processors performs each of a plurality of layers included in the received machine learning model, and generates profiling data. can do.

상기 프로파일링 데이터는 비용 계산부(220)에서 복수의 프로세서 각각이 복수의 레이어 각각을 수행할 때의 수행 성능을 결정하고, 복수의 프로세서 각각이 복수의 레이어 각각을 수행할 때의 총 소모 전력을 예측하는데 이용될 수 있다.The profiling data determines the performance when each of the plurality of processors performs each of the plurality of layers in the cost calculator 220 , and calculates the total power consumption when each of the plurality of processors performs each of the plurality of layers. can be used for prediction.

이때, 복수의 프로세서들 중에서 DVFS (Dynamic Voltage and Frequency Scaling)를 지원하는 프로세서가 있는 경우, 프로파일러(210)는 해당 프로세서가 최소 주파수로 레이어를 수행할 때의 수행 시간 및 소모 전력과 해당 프로세서가 최대 주파수로 레이어를 수행할 때의 수행 시간 및 소모 전력을 프로파일링 할 수 있다.At this time, if there is a processor supporting DVFS (Dynamic Voltage and Frequency Scaling) among the plurality of processors, the profiler 210 determines the execution time and power consumption when the processor performs the layer with the minimum frequency and the corresponding processor. It is possible to profile the execution time and power consumption when performing the layer with the maximum frequency.

또한, 복수의 프로세서들 중에서 특정 프로세서가 메모리의 제한, 실행 가능한 연산의 제한 등으로 인하여 특정 레이어를 수행할 수 없는 경우, 프로파일러(210)는 상기 특정 프로세서가 상기 특정 레이어를 수행할 때의 실행 비용을 무한대로 설정할 수 있다. In addition, when a specific processor among a plurality of processors cannot perform a specific layer due to a memory limitation, a limitation of an executable operation, etc., the profiler 210 performs the execution when the specific processor executes the specific layer The cost can be set to infinity.

또한, 프로파일러(210)는, 통신 비용을 결정하는데 이용하기 위하여, 복수의 레이어들 각각에 입력되는 입력 텐서(tensor)의 크기 및 복수의 레이어들 각각으로부터 출력되는 출력 텐서(tensor)의 크기를 결정할 수 있다.In addition, in order to use the profiler 210 to determine the communication cost, the size of an input tensor input to each of the plurality of layers and the size of an output tensor output from each of the plurality of layers are determined. can decide

프로파일러(210)는 입력 텐서, 출력 텐서의 크기 및 프로파일링 데이터를 비용 계산부(220)로 전송할 수 있다.The profiler 210 may transmit the input tensor, the size of the output tensor, and profiling data to the cost calculator 220 .

비용 계산부(220)는 프로파일러(210)로부터 복수의 레이어들 각각의 입력 텐서(tensor)의 크기 및 출력 텐서(tensor)의 크기를 수신할 수 있다.The cost calculator 220 may receive the size of the input tensor and the size of the output tensor of each of the plurality of layers from the profiler 210 .

비용 계산부(220)는, 수신한 복수의 레이어들 각각의 입력 텐서(tensor)의 크기 및 출력 텐서(tensor)의 크기를 이용하여, 복수의 프로세서들 각각이 복수의 레이어들 각각을 수행할 때의 통신 비용을 계산할 수 있고, 비용 계산부(220)는, 프로파일러(210)로부터 수신한 프로파일링 데이터를 이용하여, 수행 성능 및 총 소모 전력을 계산할 수 있다.The cost calculator 220, when each of the plurality of processors performs each of the plurality of layers, by using the size of the input tensor and the size of the output tensor of each of the plurality of layers received may calculate the communication cost of , and the cost calculator 220 may calculate performance performance and total power consumption by using the profiling data received from the profiler 210 .

비용 계산부(220)는 계산한 통신 비용, 수행 성능 및 총 소모 전력을 슬라이스 결정부(230)로 전송할 수 있다. The cost calculator 220 may transmit the calculated communication cost, performance performance, and total power consumption to the slice determiner 230 .

도 3을 더 참조하면, 비용 계산부(220)는 비용 추정부(310), 성능 예측부(320) 및 전력 예측부(330)를 포함할 수 있다.Referring further to FIG. 3 , the cost calculation unit 220 may include a cost estimation unit 310 , a performance prediction unit 320 , and a power prediction unit 330 .

도 3에 도시된 비용 추정부(310), 성능 예측부(320) 및 전력 예측부(330)는 비용 계산부(220)의 기능을 쉽게 설명하기 위하여 비용 계산부(220)의 기능을 개념적으로 나눈 것으로서, 이에 한정되지 않는다. 실시예들에 따라, 비용 추정부(310), 성능 예측부(320) 및 전력 예측부(330)는 하나의 프로그램에 포함된 일련의 명령어들로 구현될 수도 있다.The cost estimation unit 310 , the performance prediction unit 320 , and the power prediction unit 330 shown in FIG. 3 conceptually describe the functions of the cost calculation unit 220 in order to easily explain the functions of the cost calculation unit 220 . divided, but not limited thereto. According to embodiments, the cost estimator 310 , the performance predictor 320 , and the power predictor 330 may be implemented as a series of instructions included in one program.

비용 추정부(310)는 복수의 레이어들 각각에 대한 입력 텐서(tensor)의 크기(예컨대, Byte 단위)를 이용하여 복수의 프로세서들 각각이 복수의 레이어들 각각을 수행할 때의 입력 통신 비용을 결정하고, 복수의 레이어들 각각에 대한 출력 텐서(tensor)의 크기(예컨대, Byte 단위)를 이용하여 복수의 프로세서들 각각이 복수의 레이어들 각각을 수행할 때의 출력 통신 비용을 결정할 수 있다.The cost estimator 310 calculates the input communication cost when each of the plurality of processors performs each of the plurality of layers by using the size (eg, byte unit) of the input tensor for each of the plurality of layers. It is determined, and the output communication cost when each of the plurality of processors performs each of the plurality of layers may be determined using the size (eg, Byte unit) of an output tensor for each of the plurality of layers.

여기서, 입력 통신 비용은 프로세서가 레이어를 수행하기 위해 입력 데이터를 수신하는데 사용하는 자원(또는 자원의 양)(예컨대, 시간, 에너지 등)을 의미하고, 출력 통신 비용은 프로세서가 레이어를 수행함으로 인한 출력 데이터를 전송하는데 사용하는 자원(또는 자원의 양)을 의미할 수 있다.Here, the input communication cost means a resource (or amount of resource) (eg, time, energy, etc.) used by the processor to receive input data to perform the layer, and the output communication cost is due to the processor performing the layer. It may mean a resource (or amount of resource) used to transmit output data.

실시예에 따라, 비용 추정부(310)는 선형 회귀(linear regression) 모델에 기초하여 상기 입력 통신 비용 및 상기 출력 통신 비용을 결정할 수 있다. 상기 입력 통신 비용과 상기 출력 통신 비용은 아래의 수학식 1에 기초하여 결정될 수 있다.According to an embodiment, the cost estimator 310 may determine the input communication cost and the output communication cost based on a linear regression model. The input communication cost and the output communication cost may be determined based on Equation 1 below.

여기서, c_in은 입력 통신 비용을 나타내고, l은 레이어(레이어의 인덱스)를 나타내고, d는 프로세서(프로세서의 인덱스)를 나타내고, fd는 d-번째 프로세서의 동작 주파수를 나타내고, c_out은 출력 통신 비용을 나타내고, α_d,fd와 β_d,fd는 선형 회귀 모델에 이용되고 프로세서(d) 및 프로세서(d)의 동작 주파수(fd)마다 선형 회귀를 통해 결정되는 선형 회귀 계수를 나타내고, T_in,l은 l-번째 레이어에 입력되는 입력 텐서의 크기를 나타내고, T_out,l은 l-번째 레이어에서 출력되는 출력 텐서의 크기를 나타낼 수 있다.where c _in represents the input communication cost, l represents the layer (the index of the layer), d represents the processor (the index of the processor), fd represents the operating frequency of the d-th processor, and c _out represents the output communication represents the cost, α _d,fd and β _d,fd represent the linear regression coefficients used in the linear regression model and determined through linear regression for each operating frequency fd of the processor d and the processor d, T _{in ,l} may represent the size of the input tensor input to the l-th layer, and T _out,l may represent the size of the output tensor output from the l-th layer.

따라서, c_in,l,d,fd는 동작 주파수가 fd로 동작하는 d-번째 프로세서가 l-번째 레이어를 수행할 때의 입력 통신 비용을 의미하고, c_out,l,d,fd는 동작 주파수가 fd로 동작하는 d-번째 프로세서가 l-번째 레이어를 수행할 때의 출력 통신 비용을 의미할 수 있다.Therefore, c _in,l,d,fd means the input communication cost when the d-th processor operating at the operating frequency fd performs the l-th layer, and c _out,l,d,fd is the operating frequency may mean an output communication cost when the d-th processor operating as fd performs the l-th layer.

성능 예측부(320)는 복수의 프로세서 각각이 복수의 레이어 각각을 수행할 때의 수행 성능을 결정할 수 있다.The performance predictor 320 may determine performance when each of the plurality of processors performs each of the plurality of layers.

실시예에 따라, 성능 예측부(320)는 선형 회귀 모델(linear regression model)에 기초하여 수행 성능을 결정할 수 있다. 수행 성능은 아래의 수학식 2에 기초하여 결정될 수 있다.According to an embodiment, the performance predictor 320 may determine performance performance based on a linear regression model. Performance performance may be determined based on Equation 2 below.

여기서, γ_l,d 와 ε_l,d는 선형 회귀 모델에 이용되고 프로세서(d) 및 레이어(l)마다 프로파일링 데이터 및 선형 회귀를 통해 결정되는 선형 회귀 계수를 나타내고, t_l,d,fd는 동작 주파수가 fd로 동작하는 d-번째 프로세서가 l-번째 레이어를 수행할 때의 수행 성능을 의미할 수 있다.Here, γ _l,d and ε _l,d represent linear regression coefficients used in the linear regression model and determined through profiling data and linear regression for each processor (d) and layer (l), t _l,d,fd may mean performance performance when the d-th processor operating at the operating frequency fd performs the l-th layer.

전력 예측부(330)는 복수의 프로세서 각각이 복수의 레이어 각각을 수행할 때의 총 소모 전력을 예측할 수 있다. 상기 총 소모 전력은 프로세서의 동작에 의해 소모되는 동적 전력(dynamic power)과 프로세서의 동작 없이도 기본적으로 소모되는 정적 전력(static power)을 포함할 수 있다.The power prediction unit 330 may predict the total power consumption when each of the plurality of processors performs each of the plurality of layers. The total power consumption may include dynamic power consumed by the operation of the processor and static power that is basically consumed without the operation of the processor.

상기 동적 전력은 프로세서의 사양(specification)에 의해 결정되거나, 측정에 의해 결정될 수 있고, 상기 정적 전력은 프로세서 및 프로세서의 동작 주파수에 기초하여 오프라인(offline) 프로파일링을 통해 결정될 수 있다.The dynamic power may be determined by a specification of a processor or may be determined by measurement, and the static power may be determined through offline profiling based on a processor and an operating frequency of the processor.

전력 예측부(330)는 아래의 수학식 3에 기초하여 총 소모 전력을 결정할 수 있다.The power predictor 330 may determine the total power consumption based on Equation 3 below.

여기서, V_fd는 동작 주파수가 fd일 때의 프로세서의 동작 전압을 나타내고, V_fd,max는 DVFS를 지원하는 프로세서의 최대 주파수가 fd,max일 때의 프로세서의 동작 전압을 나타낼 수 있다.Here, V _fd may represent the operating voltage of the processor when the operating frequency is fd, and V _fd,max may represent the operating voltage of the processor when the maximum frequency of the processor supporting DVFS is fd,max.

따라서, P_l,d,fd는 동작 주파수가 fd로 동작하는 d-번째 프로세서가 l-번째 레이어를 수행할 때의 총 소모 전력을 의미하고, P_{dynamic,l,d,fd}는 프로파일러(210)에서 결정한 프로파일링 데이터를 이용하여 결정되는 동적 전력으로서, 동작 주파수가 fd로 동작하는 d-번째 프로세서가 l-번째 레이어를 수행할 때의 동적 전력을 의미하고, P_static,d,fd는 동작 주파수가 fd로 동작하는 d-번째 프로세서가 l-번째 레이어를 수행할 때의 정적 전력을 의미할 수 있다.Accordingly, P _l,d,fd means the total power consumed when the d-th processor operating at the operating frequency fd performs the l-th layer, and P _{dynamic,l,d,fd} is the profiler 210 ) as the dynamic power determined using the _profiling data determined in The frequency may mean static power when the d-th processor operating at fd performs the l-th layer.

비용 추정부(310)는 결정한 입력 통신 비용(c_in,l,d,fd), 출력 통신 비용(c_out,l,d,fd), 수행 성능(t_l,d,fd) 및 총 소모 전력(P_l,d,fd)을 슬라이스 결정부(230)로 전송할 수 있다.The cost estimator 310 determines the input communication cost (c _in,l,d,fd ), the output communication cost (c _out,l,d,fd ), performance performance (t _l,d,fd ), and total power consumption (P _l,d,fd ) may be transmitted to the slice determiner 230 .

다시 도 2를 참조하면, 슬라이스 결정부(230)는, 비용 계산부(220)로부터 수신한 입력 통신 비용(c_in,l,d,fd), 출력 통신 비용(c_out,l,d,fd), 수행 성능(t_l,d,fd) 및 총 소모 전력(P_l,d,fd)을 이용하여, 복수의 레이어들을 하나 이상의 슬라이스(slice)로 분할하고, 복수의 프로세서들 중에서 각각의 슬라이스를 수행할 프로세서를 결정할 수 있다. 여기서, 슬라이스란 하나 이상의 연속된 레이어를 포함하는 레이어의 집합을 의미할 수 있다.Referring back to FIG. 2 , the slice determining unit 230 includes an input communication cost (c _in,l,d,fd ) and an output communication cost (c _out,l,d, fd ) received from the cost calculation unit 220 . ), performance (t _l,d,fd ), and total power consumption (P _l,d,fd ) to divide a plurality of layers into one or more slices, and each slice among a plurality of processors You can decide which processor to perform. Here, the slice may mean a set of layers including one or more consecutive layers.

슬라이스 결정부(230)는 프로세서가 슬라이스(또는 레이어)를 수행하는데 사용하는 자원(또는 자원의 양)(예컨대, 시간, 에너지 등)을 나타내는 실행 비용이 최소가 되도록 복수의 프로세서들에 복수의 레이어들을 슬라이스 단위로 할당할 수 있다. 상기 실행 비용은 실행 시간(단위: s, ms), 에너지(단위: J, mJ) 등으로 나타내어질 수 있다.The slice determiner 230 provides a plurality of layers to a plurality of processors so that an execution cost indicating a resource (or amount of resource) (eg, time, energy, etc.) used by the processor to perform a slice (or layer) is minimized. can be allocated in units of slices. The execution cost may be expressed in terms of execution time (unit: s, ms), energy (unit: J, mJ), and the like.

슬라이스 결정부(230)는, 아래의 수학식 4 및 수학식 5를 이용하여, 복수의 프로세서들 각각에, 복수의 프로세서들 각각이 수행할 슬라이스를 할당할 수 있다.The slice determiner 230 may allocate a slice to be performed by each of the plurality of processors to each of the plurality of processors using Equations 4 and 5 below.

도 4를 더 참조하면, 슬라이스 결정부(230)는, 아래의 수학식 4에 기초하고, 입력 통신 비용, 출력 통신 비용 및 복수의 연속된 레이어들 그 자체를 수행하는 실행 비용을 이용하여, 복수의 연속된 레이어들을 포함하는 슬라이스의 총 실행 비용을 결정할 수 있다.Referring further to FIG. 4 , the slice determining unit 230 is based on Equation 4 below, and using the input communication cost, the output communication cost, and the execution cost of performing the plurality of consecutive layers itself, It is possible to determine the total execution cost of a slice including consecutive layers of .

또한, 슬라이스 결정부(230)는, 아래의 수학식 5에 기초하여, 하나 이상의 슬라이스를 수행하는 최소 실행 비용을 결정할 수 있다.Also, the slice determiner 230 may determine the minimum execution cost for performing one or more slices based on Equation 5 below.

여기서, k는 레이어의 인덱스를 나타내고, _δ는 프로세서의 인덱스를 나타낼 수 있다. 따라서, e_k,δ는 δ-번째 프로세서가 k-번째 레이어 자체를 수행할 때의 실행 비용을 의미하고, c_in,m,δ는 δ-번째 프로세서가 m-번째 레이어를 수행할 때의 입력 통신 비용을 의미하고, c_out,n,δ는 δ-번째 프로세서가 n-번째 레이어를 수행할 때의 출력 통신 비용을 의미하고, C_m,n,δ는 δ-번째 프로세서가 m-번째부터 n-번째까지의 연속된 레이어들(또는 m-번째부터 n-번째까지의 연속된 레이어들을 포함하는 슬라이스)을 수행할 때의 총 실행 비용을 의미하고, C_tot,l은 l-개의 연속된 레이어들을 포함하는 슬라이스를 실행하는 최소 실행 비용을 의미할 수 있다.Here, k may represent an index of a layer, and _δ may represent an index of a processor. Therefore, e _k,δ means the execution cost when the δ-th processor executes the k-th layer itself, and c _in,m,δ is the input when the δ-th processor executes the m-th layer means the communication cost, c _out,n,δ means the output communication cost when the δ-th processor performs the n-th layer, C _m,n,δ is the δ-th processor from the m-th Means the total execution cost when performing up to n-th consecutive layers (or a slice including m-th to n-th consecutive layers), and C _tot,l is l-th consecutive layers It may mean a minimum execution cost of executing a slice including layers.

여기서, 최소 실행 비용(C_tot,l)은 복수의 프로세서들 중에서 l-개의 연속된 레이어들을 포함하는 슬라이스를 수행하는 실행 비용이 가장 적은 프로세스가 상기 슬라이스를 수행할 때의 실행 비용을 의미한다.Here, the minimum execution cost (C _tot,l ) means an execution cost when a process having the lowest execution cost for performing a slice including l- consecutive layers among a plurality of processors performs the slice.

슬라이스 결정부(230)가 슬라이스를 프로세서에 할당하는 기준이 되는 실행 비용은 c_tot,l를 의미할 수 있다.The execution cost, which is a criterion for the slice determiner 230 to allocate a slice to a processor, may mean c _tot,l .

수학식 5에 나타낸 바와 같이, 최소 실행 비용(C_tot,l)은 k-번째까지의 연속되는 레이어를 포함하는 슬라이스의 최소 실행 비용과 δ-번째 프로세서가 (k+1)-번째부터 l-번째까지의 연속된 레이어들을 수행할 때의 실행 비용의 합이 최소가 될 때의 실행 비용을 의미할 수 있다.As shown in Equation 5, the minimum execution cost (C _tot,l ) is the minimum execution cost of a slice including the k-th consecutive layers and the δ-th processor is (k+1)-th to l- It may mean an execution cost when the sum of execution costs when performing consecutive layers up to the th is the minimum.

e_k,δ는 수행 성능(t_l,d,fd) 및 총 소모 전력(P_l,d,fd)에 기초하여 결정될 수 있다. e_k,δ는 아래의 수학식 6에 기초하여 결정될 수 있다.e _k,δ may be determined based on the performance performance (t _l,d,fd ) and the total power consumption (P _l,d,fd ). e _k,δ may be determined based on Equation 6 below.

수학식 6의 (1)은 기계학습 모델을 수행하는 프로세서부(110)의 성능을 최적화하는 기준이 수행 성능인 경우에 e_k,δ를 결정하는 방법을 나타내고, 수학식 6의 (2)는 기계학습 모델을 수행하는 프로세서부(110)의 성능을 최적화 하는 기준이 에너지 소모량인 경우에 e_k,δ를 결정하는 방법을 나타낼 수 있다.(1) of Equation 6 represents a method of determining e _k,δ when the criterion for optimizing the performance of the processor unit 110 for performing the machine learning model is performance performance, (2) of Equation 6 is When the criterion for optimizing the performance of the processor unit 110 performing the machine learning model is energy consumption, a method of determining e _k,δ may be indicated.

즉, 기계학습 모델을 수행하는 프로세서부(110)의 성능을 최적화하는 기준이 수행 성능인 경우, δ-번째 프로세서가 k-번째 레이어를 수행할 때의 실행 비용(e_k,δ)은 수행 성능(t_l,d,fd)으로 결정되고, 기계학습 모델을 수행하는 프로세서부(110)의 성능을 최적화 하는 기준이 에너지 소모량인 경우, δ-번째 프로세서가 k-번째 레이어를 수행할 때의 실행 비용(e_k,δ)은 수행 성능(t_l,d,fd)과 총 소모 전력(P_l,d,fd)의 곱으로 결정될 수 있다.That is, when the criterion for optimizing the performance of the processor unit 110 performing the machine learning model is performance performance, the execution cost (e _k,δ ) when the δ-th processor performs the k-th layer is the performance performance. (t _l,d,fd ) and when the criterion for optimizing the performance of the processor unit 110 performing the machine learning model is energy consumption, the δ-th processor executes the k-th layer The cost (e _k,δ ) may be determined as the product of the performance performance (t _l,d,fd ) and the total power consumption (P _l,d,fd ).

위에서 살펴본 바와 같이, 슬라이스 결정부(230)는, 입력 통신 비용(c_in,l,d,fd), 출력 통신 비용(c_out,l,d,fd), 수행 성능(t_l,d,fd) 및 총 소모 전력(P_l,d,fd)을 이용하여 결정한 실행 비용에 기초하여, 복수의 레이어들을 하나 이상의 슬라이스로 나누고, 상기 하나 이상의 슬라이스 각각을 복수의 프로세서들 중에서 어느 하나에 할당할 수 있다.As described above, the slice determiner 230 includes an input communication cost (c _in,l,d,fd ), an output communication cost (c _out,l,d,fd ), and a performance performance (t _l,d,fd ). ) and the total power consumption (P _l,d,fd ) based on the determined execution cost, the plurality of layers may be divided into one or more slices, and each of the one or more slices may be allocated to any one of the plurality of processors. have.

보다 상세하게는, 슬라이스 결정부(230)는 슬라이스를 구성하는 연속된 레이어의 개수 및 상기 슬라이스를 수행하는 프로세스를 변경하면서 최소 실행 비용을 계산하고, 계산 결과에 기초하여, 복수의 레이어를 하나 이상의 슬라이스로 분할하고, 복수의 프로세서들 중에서 분할된 슬라이스 각각을 수행할 프로세서를 결정할 수 있다. 슬라이스 결정부(230)가 슬라이스를 분할하고, 분할된 슬라이스를 수행할 프로세서를 결정하는 방법은 도 5 및 도 6을 통해서 보다 자세하게 설명하기로 한다.More specifically, the slice determiner 230 calculates the minimum execution cost while changing the number of consecutive layers constituting the slice and the process of performing the slice, and based on the calculation result, sets the plurality of layers to one or more. It is divided into slices, and from among a plurality of processors, a processor to perform each of the divided slices may be determined. A method by which the slice determiner 230 divides a slice and determines a processor to perform the divided slice will be described in more detail with reference to FIGS. 5 and 6 .

슬라이스 실행부(240)는 슬라이스 결정부(230)의 결정에 따라 생성된 하나 이상의 슬라이스 각각을 복수의 프로세서들 중에서 각각의 슬라이스를 수행하도록 결정된 프로세서가 실행하도록 제어할 수 있다.The slice execution unit 240 may control the processor determined to perform each slice among a plurality of processors to execute each of the one or more slices generated according to the determination of the slice determiner 230 .

도 5는 본 발명의 일 실시예에 따라 최소 실행 비용을 결정하는 일 단계를 나타내고, 도 6은 본 발명의 일 실시예에 따라 최소 실행 비용을 결정하는 다른 단계를 나타내고, 도 7은 도 5 및 도 6의 단계들을 통해 결정된 슬라이스 및 슬라이스를 수행하는 프로세서의 일 예시를 나타낸다.5 shows one step of determining the minimum running cost according to an embodiment of the present invention, FIG. 6 shows another step of determining the minimum running cost according to an embodiment of the present invention, and FIG. 7 shows FIG. 5 and The slice determined through the steps of FIG. 6 and an example of a processor performing the slice are shown.

도 5 및 도 6을 참조하면, 기계학습 모델 수행 장치(100)로 입력되는 기계학습 모델은 5개의 레이어들(L1, L2, L3, L4, L5)를 포함할 수 있다. 도 5 및 도 6에서는 설명의 편의를 위하여 기계학습 모델이 5개의 레이어들이 포함되는 것으로 설명하였지만, 이에 한정되지 않는다. 즉, 기계학습 모델에 포함되는 레이어의 수는 변경될 수 있다.5 and 6 , the machine learning model input to the machine learning model performing apparatus 100 may include five layers L1, L2, L3, L4, and L5. Although it has been described that the machine learning model includes five layers for convenience of description in FIGS. 5 and 6 , the present invention is not limited thereto. That is, the number of layers included in the machine learning model may be changed.

슬라이스 결정부(230)는 슬라이스를 구성하는 연속된 레이어의 개수 및 상기 슬라이스를 수행하는 프로세스를 변경하면서 상기 슬라이스에 대한 최소 실행 비용을 계산하고, 계산 결과에 기초하여, 복수의 레이어를 하나 이상의 슬라이스로 분할하고, 복수의 프로세서들 중에서 분할된 슬라이스를 수행할 프로세서를 결정할 수 있다.The slice determiner 230 calculates the minimum execution cost for the slice while changing the number of consecutive layers constituting the slice and the process of performing the slice, and divides the plurality of layers into one or more slices based on the calculation result. , and it is possible to determine a processor to perform the partitioned slice among a plurality of processors.

우선, 도 5를 참조하면, 제1 슬라이스(S1a)가 제5 레이어(L5) 만을 포함한다고 가정하고, 슬라이스 결정부(230)는 복수의 프로세서들 각각이 기계학습 모델에 포함된 5개의 레이어들 중에서 순서적으로 가장 마지막에 수행되는 제5 레이어(L5)를 수행할 때의 총 실행 비용을 계산할 수 있다. 이에 따라, 슬라이스 결정부(230)는 제1 슬라이스(S1a)를 수행하는 실행 비용이 가장 적은 프로세서(δ), 및 실행 비용이 가장 적은 프로세서(δ)가 제1 슬라이스(S1a)를 수행했을 때의 실행 비용을 결정할 수 있다.First, with reference to FIG. 5 , it is assumed that the first slice S1a includes only the fifth layer L5 , and the slice determiner 230 includes each of the plurality of processors 5 layers included in the machine learning model. Among them, the total execution cost when the fifth layer L5, which is sequentially and lastly performed, is performed may be calculated. Accordingly, the slice determiner 230 performs the first slice S1a when the processor δ with the lowest execution cost and the processor δ with the lowest execution cost performing the first slice S1a perform the first slice S1a. can determine the cost of implementation.

또한, 슬라이스 결정부(230)는 제5 레이어(L5)를 제외한 나머지 레이어들(L1, L2, L3, L4)을 포함하는 제2 슬라이스(S2a)에 대한 최소 실행 비용을 결정하고, 제1 슬라이스(S1a)에 대한 총 실행 비용(C_l,l,δ) 및 제2 슬라이스(S2a)에 대한 최소 실행 비용(C_tot,l-1)에 기초하여, 제1 슬라이스(S1a)가 제5 레이어(L5) 하나만을 포함했을 때의 기계학습 모델 전체의 최소 실행 비용을 결정할 수 있다.In addition, the slice determiner 230 determines the minimum execution cost for the second slice S2a including the remaining layers L1 , L2 , L3 , and L4 excluding the fifth layer L5 , and the first slice Based on the total execution cost (C _l,l,δ ) for (S1a) and the minimum execution cost (C _tot,l-1 ) for the second slice (S2a), the first slice (S1a) is a fifth layer (L5) It is possible to determine the minimum running cost of the entire machine learning model when only one is included.

슬라이스 결정부(230)는, 동일한 방법에 따라, 제1 슬라이스(S1b)가 제4 레이어(L4) 및 제5 레이어(L5)를 포함한다고 가정했을 때, 제1 슬라이스(S1b)를 수행할 프로세서(δ), 제1 슬라이스(S1b)에 대한 총 실행 비용(C_l-1,l,δ) 및 나머지 레이어들(L1, L2, L3)을 포함하는 제2 슬라이스(S2b)에 대한 최소 실행 비용(C_tot,l-2)을 결정할 수 있다. 슬라이스 결정부(230)는, 결정에 기초하여, 제1 슬라이스(S1b)가 제4 레이어(L4) 및 제5 레이어(L5)를 포함했을 때의 기계학습 모델 전체의 최소 실행 비용을 결정할 수 있다.The slice determiner 230 is the processor to perform the first slice S1b, assuming that the first slice S1b includes the fourth layer L4 and the fifth layer L5 according to the same method. (δ), the total execution cost (Cl _-1,1, δ) for the first slice (S1b) and the minimum execution cost for the second slice (S2b) including the remaining layers (L1, L2, L3) (C _tot,l-2 ) can be determined. The slice determiner 230 may determine the minimum execution cost of the entire machine learning model when the first slice S1b includes the fourth layer L4 and the fifth layer L5 based on the determination. .

슬라이스 결정부(230)는, 위의 과정을 반복할 수 있고, 마지막으로 제1 슬라이스(S1c)가 모든 레이어들(L1, L2, L3, L4, L5)을 포함한다고 가정했을 때, 제1 슬라이스(S1c)를 수행할 프로세서(δ) 및 제1 슬라이스(S1c)에 대한 총 실행 비용(C_l,l,δ)을 결정할 수 있다. 슬라이스 결정부(230)는, 결정에 기초하여, 제1 슬라이스(S1c)가 모든 레이어들(L1, L2, L3, L4, L5)을 포함했을 때의 기계학습 모델 전체의 최소 실행 비용을 결정할 수 있다. 이때, 기계학습 모델 전체의 최소 실행 비용은 제1 슬라이스(S1c)에 대한 총 실행 비용(C_l,l,δ)과 동일할 수 있다.The slice determiner 230 may repeat the above process, and finally, assuming that the first slice S1c includes all the layers L1, L2, L3, L4, and L5, the first slice It is possible to determine the processor (δ) that will perform (S1c) and the total execution cost (Cl _,l,δ ) for the first slice (S1c). The slice determiner 230 may determine, based on the determination, the minimum execution cost of the entire machine learning model when the first slice S1c includes all the layers L1, L2, L3, L4, and L5. have. In this case, the minimum execution cost of the entire machine learning model may be equal to the total execution cost (Cl _,l,δ ) for the first slice S1c.

슬라이스 결정부(230)는 제1 슬라이스(S1a, S1b, S1c, 대표하여 S1)에 포함된 연속된 레이어의 개수에 따라 결정되는 기계학습 모델 전체의 최소 실행 비용을 비교하고, 비교 결과에 따라 제1 슬라이스(S1)에 포함되는 연속된 레이어의 개수 및 제1 슬라이스(S1)를 수행하는 프로세서(δ)를 결정할 수 있다.The slice determining unit 230 compares the minimum execution cost of the entire machine learning model determined according to the number of consecutive layers included in the first slice (S1a, S1b, S1c, representatively S1), and according to the comparison result, the second The number of consecutive layers included in one slice S1 and the processor δ performing the first slice S1 may be determined.

도 6을 더 참조하면, 도 5에서 설명한 과정에 따라, 제1 슬라이스(S1)가 제5 레이어(L5)만을 포함하는 것으로 결정되었다고 가정한다. 이 경우, 슬라이스 결정부(230)는 제1 슬라이스(S1)로 결정된 제5 레이어(L5)를 제외하고, 다른 레이어들에 대해서 도 5에서 설명한 과정을 반복할 수 있다.6 , it is assumed that the first slice S1 is determined to include only the fifth layer L5 according to the process described with reference to FIG. 5 . In this case, the slice determiner 230 may repeat the process described in FIG. 5 for other layers except for the fifth layer L5 determined as the first slice S1 .

즉, 슬라이스 결정부(230)는 제3 슬라이스(S3a)가 제4 레이어(L4)를 포함한다고 가정했을 때, 제3 슬라이스(S3a)를 수행할 프로세서(δ), 제3 슬라이스(S3a)에 대한 총 실행 비용(C_l-1,l-1,δ) 및 나머지 레이어들(L1, L2, L3)을 포함하는 제4 슬라이스(S4a)에 대한 최소 실행 비용(C_tot,l-2)을 결정할 수 있다. 슬라이스 결정부(230)는, 결정에 기초하여, 제3 슬라이스(S3a)가 제4 레이어(L4)를 포함했을 때의 기계학습 모델 전체의 최소 실행 비용을 결정할 수 있다.That is, assuming that the third slice S3a includes the fourth layer L4, the slice determiner 230 performs the third slice S3a in the processor δ and the third slice S3a. The total execution cost (C _l-1,l-1,δ ) and the minimum execution cost (C _tot,l-2 ) for the fourth slice S4a including the remaining layers (L1, L2, L3) can decide The slice determiner 230 may determine the minimum execution cost of the entire machine learning model when the third slice S3a includes the fourth layer L4 based on the determination.

슬라이스 결정부(230)는, 동일한 방법에 따라, 제3 슬라이스(S3b)가 제3 레이어(L3) 및 제4 레이어(L4)를 포함한다고 가정했을 때, 제3 슬라이스(S3b)를 수행할 프로세서(δ), 제3 슬라이스(S3b)에 대한 총 실행 비용(C_l-2,l-1,δ) 및 나머지 레이어들(L1, L2)을 포함하는 제4 슬라이스(S4b)에 대한 최소 실행 비용(C_tot,l-3)을 결정할 수 있다. 슬라이스 결정부(230)는, 결정에 기초하여, 제3 슬라이스(S3b)가 제3 레이어(L3) 및 제4 레이어(L4)를 포함했을 때의 기계학습 모델 전체의 최소 실행 비용을 결정할 수 있다.The slice determiner 230 is a processor to perform the third slice S3b, assuming that the third slice S3b includes the third layer L3 and the fourth layer L4 according to the same method. (δ), the total execution cost (C _l-2,l-1,δ ) for the third slice (S3b), and the minimum execution cost for the fourth slice (S4b) including the remaining layers (L1, L2) (C _tot,l-3 ) can be determined. The slice determiner 230 may determine the minimum execution cost of the entire machine learning model when the third slice S3b includes the third layer L3 and the fourth layer L4 based on the determination. .

슬라이스 결정부(230)는, 위의 과정을 반복할 수 있고, 마지막으로 제3 슬라이스(S3c)가 제1 슬라이스(S1)을 제외한 나머지 레이어들(L1, L2, L3, L4)을 포함한다고 가정했을 때, 제3 슬라이스(S3c)를 수행할 프로세서(δ) 및 제3 슬라이스(S3c)에 대한 총 실행 비용(C_l,l-1,δ)을 결정할 수 있다. 슬라이스 결정부(230)는, 결정에 기초하여, 제3 슬라이스(S3c)가 나머지 레이어들(L1, L2, L3, L4)을 포함했을 때의 기계학습 모델 전체의 최소 실행 비용을 결정할 수 있다. 이때, 기계학습 모델 전체의 최소 실행 비용은 제3 슬라이스(S3c)에 대한 총 실행 비용(C_l,l-1,δ)과 동일할 수 있다.The slice determiner 230 may repeat the above process, and finally assume that the third slice S3c includes the remaining layers L1 , L2 , L3 , and L4 excluding the first slice S1 . , the processor δ that will perform the third slice S3c and the total execution cost Cl _,l-1,δ for the third slice S3c may be determined. The slice determiner 230 may determine the minimum execution cost of the entire machine learning model when the third slice S3c includes the remaining layers L1, L2, L3, and L4 based on the determination. In this case, the minimum execution cost of the entire machine learning model may be equal to the total execution cost (Cl _,l-1,δ ) for the third slice S3c.

슬라이스 결정부(230)는 제3 슬라이스(S3a, S3b, S3c, 대표하여 S3)에 포함된 연속된 레이어의 개수에 따라 결정되는 기계학습 모델 전체의 최소 실행 비용을 비교하고, 비교 결과에 따라 제3 슬라이스(S3)에 포함되는 연속된 레이어의 개수 및 제3 슬라이스(S3)를 수행하는 프로세서를 결정할 수 있다.The slice determining unit 230 compares the minimum execution cost of the entire machine learning model determined according to the number of consecutive layers included in the third slice (S3a, S3b, S3c, representatively S3), and according to the comparison result, the The number of consecutive layers included in the third slice S3 and a processor performing the third slice S3 may be determined.

슬라이스 결정부(230)는 도 5 및 도 6을 이용하여 설명한 슬라이스에 포함되는 연속된 레이어의 개수 및 슬라이스를 수행할 프로세서를 결정한 방법을 제1 레이어(L1)가 포함될 슬라이스를 결정할 때까지 반복할 수 있다. 이에 따라, 슬라이스 결정부(230)는 모든 레이어들을 하나 이상의 슬라이스로 분할할 수 있고, 각 슬라이스를 수행할 프로세서를 결정할 수 있다.The slice determiner 230 repeats the method of determining the number of consecutive layers included in the slice and the processor to perform the slice described with reference to FIGS. 5 and 6 until the slice including the first layer L1 is determined. can Accordingly, the slice determiner 230 may divide all layers into one or more slices, and may determine a processor to perform each slice.

도 7을 더 참조하면, 기계학습 모델 수행 장치(100)에 이미지를 분석하는 기계학습 모델(LM)이 입력되는 경우, 슬라이스 결정부(230)는 도 5 및 도 6을 이용하여 설명한 방법에 따라, 기계학습 모델(LM)에 포함된 레이어들(L1, L2, L3, L4, L5)을 하나 이상의 슬라이스로 분할하고, 분할한 슬라이스를 수행하는 프로세서를 결정할 수 있다.Referring further to FIG. 7 , when a machine learning model (LM) for analyzing an image is input to the machine learning model performing apparatus 100 , the slice determiner 230 is configured according to the method described with reference to FIGS. 5 and 6 . , the layers L1, L2, L3, L4, and L5 included in the machine learning model LM may be divided into one or more slices, and a processor performing the divided slice may be determined.

예컨대, 도 7에 도시된 바와 같이, 슬라이스 결정부(230)는 제1 슬라이스(S1)는 제5 레이어(L1)를 포함하고, CPU의 빅 코어(Big Core)에 의해 수행되도록 결정하고, 제2 슬라이스(S2)는 제3 레이어(L3) 및 제4 레이어(L4)를 포함하고, NPU에 의해 수행되도록 결정하고, 제3 슬라이스(S3)는 제1 레이어(L1) 및 제2 레이어(L2)를 포함하고, GPU에 의해 수행되도록 결정할 수 있다.For example, as shown in FIG. 7 , the slice determining unit 230 determines that the first slice S1 includes the fifth layer L1 and is performed by the big core of the CPU, and the second slice S1 includes the fifth layer L1. The second slice S2 includes the third layer L3 and the fourth layer L4, and is determined to be performed by the NPU, and the third slice S3 includes the first layer L1 and the second layer L2. ), and may decide to be performed by the GPU.

도 8은 본 발명의 일 실시예에 따른 기계학습 모델 수행 장치를 이용하여 기계학습을 실행하는 경우와 다른 종래 기술을 이용하여 기계학습을 실행하는 경우의 레이턴시(latency)를 비교한 예시를 나타내고, 도 9는 본 발명의 일 실시예에 따른 기계학습 모델 수행 장치를 이용하여 기계학습을 실행하는 경우와 다른 종래 기술을 이용하여 기계학습을 실행하는 경우의 에너지 효율을 비교한 예시를 나타낸다.8 shows an example of comparing the latency between the case of executing machine learning using the machine learning model performing apparatus according to an embodiment of the present invention and the case of executing machine learning using another conventional technique, 9 shows an example in which energy efficiency is compared between the case of executing machine learning using the machine learning model performing apparatus according to an embodiment of the present invention and the case of executing machine learning using another conventional technique.

도 8 및 도 9을 참조하면, MOSAIC은 본 발명의 일 실시예에 따른 기계학습 모델 수행 장치를 이용하여 기계학습 모델을 실행하는 경우를 나타내고, TF-BIG-P, TF-LITTLE-P, TF-GPU-P 및 TF-NPU-P는 각각 최대 주파수로 고정하고 하나의 프로세서(Big Core, Little Core, GPU 또는 NPU) 만을 이용하여 기계학습 모델을 실행하는 경우를 나타내고, TF-BIG-O, TF-LITTLE-O, TF-GPU-O 및 TF-NPU-O는 각각 로드(load)에 따라 동적으로 주파수를 조절되는 하나의 프로세서(Big Core, Little Core, GPU 또는 NPU) 만을 이용하여 기계학습 모델을 실행하는 경우를 나타내고, Exhaustive는 광범위한 실험을 통해 찾은 최적의 설정으로 기계학습 모델을 실행하는 경우를 나타낼 수 있다.8 and 9, MOSAIC shows a case of executing a machine learning model using the machine learning model execution apparatus according to an embodiment of the present invention, TF-BIG-P, TF-LITTLE-P, TF -GPU-P and TF-NPU-P are each fixed at the maximum frequency and represent the case of running the machine learning model using only one processor (Big Core, Little Core, GPU or NPU) TF-LITTLE-O, TF-GPU-O, and TF-NPU-O machine learning using only one processor (Big Core, Little Core, GPU or NPU) that dynamically adjusts the frequency according to the load, respectively. It represents the case of running the model, and Exhaustive can represent the case of running the machine learning model with the optimal settings found through extensive experiments.

도 8을 참조하면, MOSAIC은 TF-BIG-P, TF-LITTLE-P, TF-GPU-P 및 TF-NPU-P와 비교하여 레이턴시(inference latency)가 각각 70.3%, 86.1%, 39.1%, 29.2% 만큼 감소됨을 알 수 있다. 즉, MOSAIC은 heterogeneity, communication, constraint 등을 고려하여 높은 수준의 레이턴시 성능을 달성할 수 있다.Referring to Figure 8, MOSAIC has a latency (inference latency) of 70.3%, 86.1%, 39.1%, respectively, compared to TF-BIG-P, TF-LITTLE-P, TF-GPU-P and TF-NPU-P, It can be seen that the decrease by 29.2%. In other words, MOSAIC can achieve a high level of latency performance in consideration of heterogeneity, communication, and constraint.

또한, MOSAIC은 Exhaustive와 레이턴시가 평균 0.67% 정도 차이가 남을 알 수 있다. 즉, Exhaustive에 따른 최적의 설정을 찾기 위하여 많은 시간 및 자원이 필요한 점을 감안할 때, MOSAIC은 성능 대비 효율이 뛰어남을 알 수 있다.Also, it can be seen that MOSAIC has an average difference of 0.67% between exhaustive and latency. That is, considering that a lot of time and resources are required to find the optimal setting according to the exhaustive, it can be seen that the MOSAIC has excellent performance versus efficiency.

도 9를 참조하면, MOSAIC은 TF-BIG-O, TF-LITTLE-O, TF-GPU-O, TF-NPU-O와 비교하여 사용 에너지(inference energy)가 각각 91.0%, 80.5%, 83.4%, 36.6% 만큼 감소됨을 알 수 있다. 즉, MOSAIC은 heterogeneity, communication, constraint 등을 고려하여 높은 수준의 에너지 효율을 달성할 수 있다.Referring to FIG. 9, MOSAIC has 91.0%, 80.5%, and 83.4% of inference energy compared to TF-BIG-O, TF-LITTLE-O, TF-GPU-O, and TF-NPU-O, respectively. , it can be seen that it is reduced by 36.6%. That is, MOSAIC can achieve a high level of energy efficiency by considering heterogeneity, communication, and constraint.

또한, MOSAIC은 Exhaustive와 레이턴시가 평균 0.53% 정도 차이가 남을 알 수 있다. 즉, Exhaustive에 따른 최적의 설정을 찾기 위하여 많은 시간 및 자원이 필요한 점을 감안할 때, MOSAIC은 성능 대비 효율이 뛰어남을 알 수 있다.Also, it can be seen that MOSAIC has an average difference of 0.53% between exhaustive and latency. That is, considering that a lot of time and resources are required to find the optimal setting according to the exhaustive, it can be seen that the MOSAIC has excellent performance versus efficiency.

도 10은 본 발명의 일 실시예에 따라 복수의 레이어들을 하나 이상의 슬라이스로 분할하고, 각 슬라이스를 수행하는 프로세서를 결정하는 방법을 나타낸다.10 illustrates a method of dividing a plurality of layers into one or more slices and determining a processor performing each slice according to an embodiment of the present invention.

도 2 내지 도 4 및 도 10을 참조하면, 프로파일러(210)는 기계학습 모델을 입력받고(S1000), 입력받은 기계학습 모델에 포함된 복수의 레이어들 각각을 복수의 프로세서들 각각이 수행하는 경우, 수행 시간 및/또는 소모 전력에 대한 프로파일링을 수행하여 프로파일링 데이터를 생성하고, 복수의 레이어들 각각에 입력되는 입력 텐서(tensor)의 크기 및 복수의 레이어들 각각으로부터 출력되는 출력 텐서의 크기를 결정할 수 있다(S1010).2 to 4 and 10 , the profiler 210 receives a machine learning model ( S1000 ), and each of a plurality of processors performs each of a plurality of layers included in the received machine learning model. In this case, profiling is performed on execution time and/or power consumption to generate profiling data, the size of an input tensor input to each of the plurality of layers and an output tensor output from each of the plurality of layers The size may be determined (S1010).

비용 계산부(220)는, 복수의 레이어들 각각의 프로파일링 데이터, 입력 텐서의 크기 및 출력 텐서의 크기를 이용하여, 복수의 프로세서들 각각이 복수의 레이어들 각각을 수행할 때의 통신 비용, 수행 성능 및 총 소모 전력을 계산할 수 있다(S1020).The cost calculator 220 uses the profiling data of each of the plurality of layers, the size of the input tensor, and the size of the output tensor, the communication cost when each of the plurality of processors performs each of the plurality of layers, Performance performance and total power consumption may be calculated (S1020).

슬라이스 결정부(230)는, 비용 계산부(220)로부터 수신한 입력 통신 비용, 출력 통신 비용, 수행 성능 및 총 소모 전력을 이용하여, 복수의 레이어들을 하나 이상의 슬라이스로 분할하고, 각 슬라이스를 수행할 프로세서를 결정할 수 있다(S1030).The slice determiner 230 divides the plurality of layers into one or more slices, and performs each slice by using the input communication cost, the output communication cost, the performance performance, and the total power consumption received from the cost calculator 220 . A processor to be used may be determined (S1030).

슬라이스 실행부(240)는, 슬라이스 결정부(230)의 결정에 따라 각 슬라이스를 수행하기로 결정된 각 프로세서를 이용하여, 각 슬라이스를 수행할 수 있다(S1040).The slice executor 240 may perform each slice by using each processor determined to perform each slice according to the determination of the slice determiner 230 ( S1040 ).

본 발명에 첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 인코딩 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 인코딩 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방법으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.Combinations of each block in the block diagram attached to the present invention and each step in the flowchart may be performed by computer program instructions. These computer program instructions may be embodied in the encoding processor of a general purpose computer, special purpose computer, or other programmable data processing equipment, such that the instructions executed by the encoding processor of the computer or other programmable data processing equipment may correspond to each block or block diagram of the block diagram. Each step of the flowchart creates a means for performing the functions described. These computer program instructions may also be stored in a computer-usable or computer-readable memory that may direct a computer or other programmable data processing equipment to implement a function in a particular manner, and thus the computer-usable or computer-readable memory. The instructions stored in the block diagram may produce an article of manufacture containing instruction means for performing the functions described in each block in the block diagram or in each step in the flowchart. The computer program instructions may also be mounted on a computer or other programmable data processing equipment, such that a series of operational steps are performed on the computer or other programmable data processing equipment to create a computer-executed process to create a computer or other programmable data processing equipment. It is also possible that instructions for performing the processing equipment provide steps for performing the functions described in each block of the block diagram and each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.Further, each block or each step may represent a module, segment, or portion of code that includes one or more executable instructions for executing the specified logical function(s). It should also be noted that in some alternative embodiments it is also possible for the functions recited in blocks or steps to occur out of order. For example, it is possible that two blocks or steps shown one after another may in fact be performed substantially simultaneously, or that the blocks or steps may sometimes be performed in the reverse order according to the corresponding function.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 품질에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and various modifications and variations will be possible without departing from the essential quality of the present invention by those skilled in the art to which the present invention pertains. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

100: 기계학습 모델 수행 장치
110: 프로세서부
120: 입출력기
130: 메모리
200: 레이어 분배 모델
210: 프로파일러
220: 비용 계산부
230: 슬라이스 결정부
240: 슬라이스 실행부100: machine learning model execution device
110: processor unit
120: input/output device
130: memory
200: layer distribution model
210: Profiler
220: cost calculator
230: slice determining unit
240: slice execution unit

Claims

A method for performing a machine learning model in a device including a plurality of processors, the method comprising:
receiving a machine learning model including a plurality of layers;
generating profiling data by profiling at least one of the execution time and power consumption of each of the plurality of layers;
determining a size of an input tensor input to each of the plurality of layers and a size of an output tensor output from each of the plurality of layers;
calculating a first minimum execution cost when executing the plurality of layers by using the profiling data of each of the plurality of layers, the size of the input tensor, and the size of the output tensor;
dividing the plurality of layers into one or more slices based on the first minimum running cost; and
performing the one or more slices using one or more of the plurality of processors,
The first minimum execution cost is
Each of the plurality of processors executes the first number of consecutive layers by using the profiling data of the first number of consecutive layers among the plurality of layers, the size of the input tensor, and the size of the output tensor. When the first total execution cost and the second minimum execution cost for the remaining layers except for the first number of consecutive layers are determined, it is determined based on the first total execution cost and the second minimum execution cost felled
How to perform a machine learning model.

The method of claim 1,
based on the first minimum execution cost, further comprising determining, from among the plurality of processors, a processor performing each of the plurality of layers
How to perform a machine learning model.

The method of claim 1,
Calculating the first minimum execution cost comprises:
determining the first total running cost and the second minimum running cost;
A second when each of the plurality of processors executes the second number of consecutive layers using the size of the input tensor and the size of the output tensor of the second number of consecutive layers among the plurality of layers determining a total execution cost and a third minimum execution cost for the remaining layers except for the second number of consecutive layers; and
The lesser of the first sum of the first total running cost and the second minimal running cost and the second sum of the second total running cost and the third minimal running cost is the first minimal running cost comprising determining
How to perform a machine learning model.

4. The method of claim 3,
The first total execution cost is
The size of the input tensor of the first performed layer among the first number of consecutive layers, the size of the output tensor of the last layer performed among the first number of consecutive layers, and the first number of consecutive layers Determined using the execution cost of the layers themselves
How to perform a machine learning model.

5. The method of claim 4,
The execution cost of the first number of consecutive layers itself is determined based on a criterion for optimizing the performance of the processor.
How to perform a machine learning model.

6. The method of claim 5,
When the criterion for optimizing the performance of the processor is performance performance, the execution cost of the first number of consecutive layers itself is determined according to the performance performance of the first number of consecutive layers,
When the criterion for optimizing the performance of the processor is energy consumption, the execution cost of the first number of consecutive layers itself is determined according to the performance performance of the first number of consecutive layers and total power consumption
How to perform a machine learning model.

7. The method of claim 6,
The performance performance and the total power consumption are determined based on an operating frequency of a processor executing each of the first number of consecutive layers.
How to perform a machine learning model.

In the machine learning model performing apparatus for performing a machine learning model including a plurality of layers,
an input/output device for receiving the machine learning model; and
It includes a plurality of processors and includes a processor unit for controlling the input/output device,
The processor unit,
Profiling is performed on at least one of the execution time and power consumption of each of the plurality of layers to generate profiling data,
determining a size of an input tensor input to each of the plurality of layers and a size of an output tensor output from each of the plurality of layers;
calculating a first minimum execution cost when executing the plurality of layers using the profiling data of each of the plurality of layers, the size of the input tensor, and the size of the output tensor,
splitting the plurality of layers into one or more slices based on the first minimum execution cost;
performing the one or more slices by using one or more processors among the plurality of processors;
The first minimum execution cost is
Each of the plurality of processors executes the first number of consecutive layers by using the profiling data of the first number of consecutive layers among the plurality of layers, the size of the input tensor, and the size of the output tensor. When the first total execution cost and the second minimum execution cost for the remaining layers except for the first number of consecutive layers are determined, it is determined based on the first total execution cost and the second minimum execution cost felled
Machine learning model execution device.

9. The method of claim 8,
The processor unit,
determining a processor performing each of the plurality of layers from among the plurality of processors based on the first minimum execution cost
Machine learning model execution device.

9. The method of claim 8,
The processor unit,
determine the first total running cost and the second minimum running cost;
A second when each of the plurality of processors executes the second number of consecutive layers using the size of the input tensor and the size of the output tensor of the second number of consecutive layers among the plurality of layers determine a total execution cost and a third minimum execution cost for the remaining layers except for the second number of consecutive layers,
The lesser of the first sum of the first total running cost and the second minimal running cost and the second sum of the second total running cost and the third minimal running cost is the first minimal running cost to decide
Machine learning model execution device.

11. The method of claim 10,
The first total execution cost is
The size of the input tensor of the first performed layer among the first number of consecutive layers, the size of the output tensor of the last layer performed among the first number of consecutive layers, and the first number of consecutive layers Determined using the execution cost of the layers themselves
Machine learning model execution device.

12. The method of claim 11,
The execution cost of the first number of consecutive layers itself is determined based on a criterion for optimizing the performance of the processor.
Machine learning model execution device.

13. The method of claim 12,
When the criterion for optimizing the performance of the processor is performance performance, the execution cost of the first number of consecutive layers itself is determined according to the performance performance of the first number of consecutive layers,
When the criterion for optimizing the performance of the processor is energy consumption, the execution cost of the first number of consecutive layers itself is determined according to the performance performance of the first number of consecutive layers and total power consumption
Machine learning model execution device.

14. The method of claim 13,
The performance performance and the total power consumption are determined based on an operating frequency of a processor executing each of the first number of consecutive layers.
Machine learning model execution device.

As a computer-readable recording medium storing a computer program,
The computer program is
8. A method comprising instructions for causing a processor to perform the method according to any one of claims 1 to 7
computer readable recording medium.

As a computer program stored in a computer-readable recording medium,
The computer program is
8. A method comprising instructions for causing a processor to perform the method according to any one of claims 1 to 7
computer program.