KR20230076075A

KR20230076075A - Method and apparatus for optimal layer-wise execution policy prediction based on supervised learning method in systolic array based npu environment for inference

Info

Publication number: KR20230076075A
Application number: KR1020220043236A
Authority: KR
Inventors: 박영준; 유용승; 이영현
Original assignee: 한양대학교 산학협력단
Priority date: 2021-11-23
Filing date: 2022-04-07
Publication date: 2023-05-31

Abstract

Disclosed are a method and device for predicting an optimal execution policy for each layer based on a supervised learning technique in an NPU environment for inference with a systolic array structure. A method for predicting execution policy for each layer includes the steps of: analyzing kernel parameters for each layer of a target deep neural network (DNN) application and the available resources of the target NPU; searching for an optimal factor for each situation or each layer through a model learned based on the layer information of the target DNN application and the target NPU information; and applying an optimal execution policy predicted based on the optimal factor to DNN application runtime.

Description

Method and Apparatus for Predicting Optimal Layer-by-Layer Execution Policy Based on Supervised Learning Technique in Neural Network Operation Device Environment for Inference of Systolic Array Structure INFERENCE}

본 발명은 시스톨릭 어레이(systolic array) 구조의 추론용 NPU(neural processing unit) 환경에서 지도학습기법 기반으로 최적의 계층별 실행정책(layer-wise execution policy)을 예측하는 방법과 이를 적용한 프레임워크에 관한 것이다.The present invention provides a method for predicting an optimal layer-wise execution policy based on a supervised learning method in a neural processing unit (NPU) environment for reasoning with a systolic array structure and a framework to which the same is applied. it's about

일부 종래 기술에서는 딥러닝 기반 응용에서 타겟(target) 하드웨어의 구조 및 응용을 구성하는 각 레이어(layer)의 파라미터 등에 따라서 레이어에 맞는 최적의 실행 정책(execution policy)이 상이할 수 있음을 언급하고 있다.Some prior art mentions that the optimal execution policy for each layer may be different according to the structure of target hardware in deep learning-based applications and the parameters of each layer constituting the application. .

하지만, 시스톨릭 어레이(systolic array) 기반 신경망 연산장치(neural processing unit, NPU) 환경에서 딥러닝 응용의 각 레이어에 대한 최적 데이터플로우(dataflow), 클러스터링 팩터(clustering factor) 등의 파라미터들을 서로 다르게 설정할 수 있으나, 레이어 별로 최적의 파라미터를 어떻게 적용하는가에 대한 적절한 방안은 아직까지 제안되고 있지 못한 실정이다.However, in a systolic array-based neural processing unit (NPU) environment, parameters such as optimal dataflow and clustering factor for each layer of deep learning applications are set differently. However, an appropriate method for how to apply an optimal parameter for each layer has not yet been proposed.

본 발명은 종래 기술에서 제시되었던 다양한 실행정책(execution policy)을 시스톨릭 어레이(systolic array) 구조의 추론용 NPU(neural processing unit)의 타겟 환경에 맞추어 신경망의 계층별로 예측하고 적용할 수 있는, 지도학습기법 기반 최적의 계층별 실행정책 예측 방법 및 장치를 제공하는데 그 목적이 있다.The present invention is a map capable of predicting and applying various execution policies presented in the prior art according to the target environment of a neural processing unit (NPU) for inference having a systolic array structure for each layer of a neural network. The purpose of this study is to provide a method and device for predicting the optimal execution policy for each layer based on learning techniques.

본 발명의 다른 목적은, NPU 환경 기반 딥러닝 응용 수행시, 단일 NPU 또는 단일 응용으로의 제한없이 다양한 환경에서 지도학습기법을 통해 최적의 계층별 정책을 예측할 수 있는, 계층별 실행정책 예측 방법 및 장치를 제공하는데 있다.Another object of the present invention is a method for predicting an execution policy for each layer, capable of predicting an optimal policy for each layer through a supervised learning method in various environments without being limited to a single NPU or a single application when performing an NPU environment-based deep learning application, and device is provided.

본 발명의 또 다른 목적은 NPU 환경의 딥러닝 응용에 최적의 계층별 실행정책을 실시간으로 유동적으로 적용할 수 있는, 계층별 실행정책 예측 방법 및 적용 프레임워크를 제공하는데 있다.Another object of the present invention is to provide an execution policy prediction method for each layer and an application framework capable of flexibly applying the optimal execution policy for each layer in real time to a deep learning application in an NPU environment.

상기 기술적 과제를 해결하기 위한 본 발명의 일 측면에 따른 지도학습기법 기반 계층별 실행정책 예측 방법은, 시스톨릭 어레이(systolic array) 구조의 추론용 신경망 연산장치(neural processing unit, NPU) 환경에서 지도학습기법 기반 최적의 계층별 실행정책(layer-wise execution policy) 예측 방법로서, 타겟 심층신경망(deep neural network, DNN) 응용의 각 계층별 커널 파라미터와 타겟 NPU(neural processing unit)의 가용 자원을 분석하는 단계; 상기 타겟 DNN 응용의 레이어 정보 및 타겟 NPU 정보를 기반으로 학습된 모델을 통해 각 상황별 또는 각 계층별로 가장 우수한 성능을 나타내는 팩터인 최적 팩터를 검색하는 단계; 및 상기 최적 팩터에 기초하여 예측한 최적 실행정책을 DNN 응용 런타임(runtime)에 적용하는 단계를 포함한다.In order to solve the above technical problem, a supervised learning method-based layer-by-layer execution policy prediction method according to an aspect of the present invention is guided in a neural processing unit (NPU) environment for inference having a systolic array structure. As a learning method-based optimal layer-wise execution policy prediction method, analysis of kernel parameters for each layer of a target deep neural network (DNN) application and available resources of a target neural processing unit (NPU) doing; Searching for an optimal factor, which is a factor representing the best performance for each situation or each layer, through a model learned based on layer information and target NPU information of the target DNN application; and applying the optimal execution policy predicted based on the optimal factor to the DNN application runtime.

일실시예에서, 상기 분석하는 단계는, DNN 커널들의 레이어 구성, 레이어들의 배치, 신경망 파라미터들을 포함한 커널 특징을 획득하는 것을 포함할 수 있다.In one embodiment, the analyzing may include obtaining kernel features including layer configuration of DNN kernels, arrangement of layers, and neural network parameters.

일실시예에서, 상기 분석하는 단계는, 상기 타겟 NPU의 PE(processing element) 자원이나, 온-칩 스크래치패드(on-chip scratchpad) 메모리 및 오프-칩(off-chip) 메모리의 메모리 제한(memory limits)에 대한 NPU 특징을 획득하는 것을 포함할 수 있다.In one embodiment, the analyzing may include processing element (PE) resources of the target NPU, memory limitations of on-chip scratchpad memory and off-chip memory. limits) may include acquiring NPU characteristics.

일실시예에서, 상기 계층별로 최적 팩터를 검색하는 단계는, 각 DNN 레이어에 대하여 각 타일링 팩터(tiling factor)를 도출하고, 각 클러스터링 팩터(clustering factor)를 도출하고, 각 데이터플로우(dataflow)를 도출하는 것을 포함할 수 있다.In one embodiment, the step of searching for the optimal factor for each layer comprises deriving each tiling factor for each DNN layer, deriving each clustering factor, and deriving each dataflow. may include derivation.

일실시예에서, 상기 DNN 응용 런타임에 적용하는 단계는, 상기 타겟 DNN 응용의 행렬 곱셈 연산 및 합성곱 연산 레이어를 추출하는 단계; 상기 타겟 NPU 및 레이어 특징을 기반으로 최적 정책 세트를 예측하는 단계; 각 레이어 별로 도출된 정책 세트를 적용하여 컴파일을 수행하는 단계; 및 컴파일된 레이어를 통해 DNN 응용을 수행하는 단계를 포함할 수 있다.In one embodiment, the applying to the DNN application runtime may include extracting a matrix multiplication operation and convolution operation layer of the target DNN application; predicting an optimal policy set based on the target NPU and layer characteristics; Compiling by applying a policy set derived for each layer; and performing a DNN application through the compiled layer.

일실시예에서, 지도학습기법 기반 계층별 실행정책 예측 방법은, 상기 검색하는 단계 전에, 무차별 조사를 통해 획득한 데이터를 기반으로 전연결 레이어로 구성된 신경망을 통해 예측 모델을 학습시키는 단계를 더 포함할 수 있다.In one embodiment, the supervised learning method-based execution policy prediction method for each layer further includes, prior to the searching step, learning a prediction model through a neural network composed of fully connected layers based on data obtained through indiscriminate investigation. can do.

일실시예에서, 상기 학습시키는 단계는, 상기 예측 모델의 훈련을 위한 연산 특징 또는 커널 특징과 NPU 특징을 분류하는 단계; 전연결 레이어로 구성된 인공 신경망(artificial neural network)을 구축하는 단계; 및 합성 데이터 세트 및 분류된 특징 세트를 기반으로 DNN을 통해 예측 모델을 훈련시키는 단계를 포함할 수 있다.In one embodiment, the step of learning may include classifying computational features or kernel features and NPU features for training the predictive model; Building an artificial neural network composed of fully connected layers; and training a predictive model through a DNN based on the synthetic data set and the classified feature set.

일실시예에서, 지도학습기법 기반 계층별 실행정책 예측 방법은, 상기 학습시키는 단계 전에, 지도학습기법을 통한 모델 학습을 위하여 무차별 조사(brute-force search)를 진행하는 단계를 더 포함할 수 있다.In one embodiment, the supervised learning method-based execution policy prediction method for each layer may further include, prior to the learning step, performing a brute-force search for model learning through the supervised learning method. .

일실시예에서, 상기 무차별 조사를 진행하는 단계는, 예측 모델 학습에 사용가능한 모든 정답을 가진 데이터 세트의 범위에서 수행될 수 있다.In one embodiment, the step of conducting the indiscriminate search may be performed in a range of data sets having all correct answers available for learning the predictive model.

상기 기술적 과제를 해결하기 위한 본 발명의 또 다른 측면에 따른 지도학습기법 기반 계층별 실행정책 예측 장치는, 시스톨릭 어레이(systolic array) 구조의 추론용 신경망 연산장치(neural processing unit, NPU) 환경에서 지도학습기법 기반 최적의 계층별 실행정책(layer-wise execution policy)을 예측하는 장치로서, 프로세서; 및 상기 프로세서에 의해 실행되는 적어도 하나의 명령을 저장하는 메모리를 포함한다. 상기 적어도 하나의 명령에 의해 상기 프로세서는, 타겟 심층신경망(deep neural network, DNN) 응용의 각 계층별 커널 파라미터와 타겟 NPU(neural processing unit)의 가용 자원을 분석하는 단계, 상기 타겟 DNN 응용의 레이어 정보 및 타겟 NPU 정보를 기반으로 학습된 모델을 통해 각 상황별 또는 각 계층별로 가장 우수한 성능을 나타내는 팩터인 최적 팩터를 검색하는 단계, 및 상기 최적 팩터에 기초하여 예측한 최적 실행정책을 DNN 응용 런타임(runtime)에 적용하는 단계를 수행하도록 구성될 수 있다.According to another aspect of the present invention for solving the above technical problem, an apparatus for predicting an execution policy for each layer based on a supervised learning method is provided in a neural processing unit (NPU) environment for inference having a systolic array structure. An apparatus for predicting an optimal layer-wise execution policy based on a supervised learning method, comprising: a processor; and a memory storing at least one instruction executed by the processor. Analyzing, by the at least one instruction, kernel parameters for each layer of a target deep neural network (DNN) application and available resources of a target neural processing unit (NPU), by the processor, layer of the target DNN application Searching for an optimal factor, which is a factor that exhibits the best performance for each situation or each layer, through a model learned based on information and target NPU information, and DNN application runtime, which predicts an optimal execution policy based on the optimal factor It can be configured to perform steps that apply to (runtime).

일실시예에서, 상기 프로세서는, 상기 분석하는 단계에서, DNN 커널들의 레이어 구성, 레이어들의 배치, 신경망 파라미터들을 포함한 커널 특징을 획득하도록 구성될 수 있다.In one embodiment, in the analyzing step, the processor may be configured to obtain kernel characteristics including layer configuration of DNN kernels, arrangement of layers, and neural network parameters.

일실시예에서, 상기 프로세서는, 상기 분석하는 단계에서, 상기 타겟 NPU의 PE(processing element) 자원이나, 온-칩 스크래치패드(on-chip scratchpad) 메모리 및 오프-칩(off-chip) 메모리의 메모리 제한(memory limits)에 대한 NPU 특징을 획득하도록 구성될 수 있다.In one embodiment, the processor, in the analyzing step, PE (processing element) resources of the target NPU, on-chip scratchpad memory and off-chip memory It can be configured to acquire NPU characteristics for memory limits.

일실시예에서, 상기 프로세서는, 상기 계층별로 최적 팩터를 검색하는 단계에서, 각 DNN 레이어에 대하여 각 타일링 팩터(tiling factor)를 도출하고, 각 클러스터링 팩터(clustering factor)를 도출하고, 각 데이터플로우(dataflow)를 도출하도록 구성될 수 있다.In one embodiment, the processor derives each tiling factor for each DNN layer, derives each clustering factor, and each data flow in the step of searching for the optimal factor for each layer. It can be configured to derive (dataflow).

일실시예에서, 상기 프로세서는, 상기 DNN 응용 런타임에 적용하는 단계에서, 상기 타겟 DNN 응용의 행렬 곱셈 연산 및 합성곱 연산 레이어를 추출하는 단계, 상기 타겟 NPU 및 레이어 특징을 기반으로 최적 정책 세트를 예측하는 단계, 각 레이어 별로 도출된 정책 세트를 적용하여 컴파일을 수행하는 단계, 및 컴파일된 레이어를 통해 DNN 응용을 수행하는 단계를 수행하도록 구성될 수 있다.In one embodiment, the processor extracts matrix multiplication operation and convolution operation layers of the target DNN application in the step of applying to the DNN application runtime, and an optimal policy set based on the target NPU and layer characteristics. It may be configured to perform a predicting step, a step of performing compilation by applying a set of policies derived for each layer, and a step of performing a DNN application through the compiled layer.

일실시예에서, 지도학습기법 기반 계층별 실행정책 예측 장치는, 상기 프로세서가, 상기 검색하는 단계 전에, 무차별 조사를 통해 획득한 데이터를 기반으로 전연결 레이어로 구성된 신경망을 통해 예측 모델을 학습시키는 단계를 더 수행하도록 구성될 수 있다.In one embodiment, the supervised learning method-based execution policy prediction apparatus for each layer is configured such that the processor, prior to the searching step, learns a predictive model through a neural network composed of fully connected layers based on data obtained through indiscriminate investigation. It can be configured to perform further steps.

일실시예에서, 상기 프로세서는, 상기 학습시키는 단계에서, 상기 예측 모델의 훈련을 위한 연산 특징 또는 커널 특징과 NPU 특징을 분류하는 단계; 전연결 레이어로 구성된 인공 신경망(artificial neural network)을 구축하는 단계; 및 합성 데이터 세트 및 분류된 특징 세트를 기반으로 DNN을 통해 예측 모델을 훈련시키는 단계를 수행하도록 구성될 수 있다.In one embodiment, the processor, in the step of learning, classifying a computational feature or kernel feature and NPU feature for training the predictive model; Building an artificial neural network composed of fully connected layers; and training a predictive model via a DNN based on the synthetic data set and the classified feature set.

일실시예에서, 지도학습기법 기반 계층별 실행정책 예측 장치는, 상기 프로세서가, 상기 학습시키는 단계 전에, 지도학습기법을 통한 모델 학습을 위하여 예측 모델 학습에 사용가능한 모든 정답을 가진 데이터 세트의 범위에서 무차별 조사(brute-force search)를 진행하는 단계를 더 수행하도록 구성될 수 있다.In one embodiment, in the supervised learning method-based execution policy prediction apparatus for each layer, the processor, prior to the learning step, for model learning through the supervised learning method, sets a range of data sets having all correct answers usable for predictive model learning. It may be configured to further perform a step of performing a brute-force search in .

본 발명에 의하면, 주어진 NPU(neural processing unit) 및 DNN(deep neural network) 응용 레이어(layer)에 따른 최적의 실행정책(execution policy)을 실시간으로 예측하여 상황에 맞게 적용할 수 있다.According to the present invention, an optimal execution policy according to a given neural processing unit (NPU) and deep neural network (DNN) application layer can be predicted in real time and applied according to circumstances.

또한, 본 발명에 의하면, 단일 NPU 또는 단일 응용에 국한되지 않고 다양한 환경에서의 계층별 최적 실행정책을 도출할 수 있다. 즉, 종래 기술에서 타겟 환경에 따라 NPU 또는 DNN의 실행정책을 유연하게 변화시킬 수 없는 한계를 극복하는 효과적인 솔루션을 제공할 수 있다.In addition, according to the present invention, it is possible to derive an optimal execution policy for each layer in various environments without being limited to a single NPU or a single application. That is, it is possible to provide an effective solution that overcomes the limitation of not being able to flexibly change the execution policy of the NPU or DNN according to the target environment in the prior art.

또한, 본 발명에 의하면, 사물 인식, 실시간 자동차 번호판 인식 응용 등에서 단위 시간당 처리량 등의 성능을 극대화를 실현할 수 있는 최적화 기법으로써 활용 가능한 솔루션을 제공할 수 있다.In addition, according to the present invention, it is possible to provide a solution that can be utilized as an optimization technique capable of maximizing performance such as throughput per unit time in object recognition, real-time license plate recognition applications, and the like.

도 1은 본 발명의 일실시예에 따른 NPU(neural processing unit) 환경의 최적화와 관련하여 실제 심층신경망(deep neural network, DNN) 응용에서의 실행정책(execution policy)에 따른 성능 차이를 보여주는 예시도이다.
도 2a 및 도 2b는 본 발명의 일실시예에 따른, 시스톨릭 어레이(systolic array, SA) 구조의 NPU 환경에서 지도학습기법을 기반으로 최적의 계층별 실행정책(layer-wise execution policy) 예측 기법 및 적용 프레임워크를 예시한 전체 구조도이다.
도 3은 도 2a의 프레임워크의 전체 실행 순서를 개략적으로 나타낸 순서도이다.
도 4는 도 3의 프레임워크의 전체 실행 순서 중 무차별 대입 검색 과정에 대한 상세 절차를 설명하기 위한 순서도이다.
도 5는 도 3의 프레임워크의 전체 실행 순서 중 예측 모델 훈련 과정에 대한 상세 절차를 설명하기 위한 순서도이다.
도 6은 도 3의 프레임워크의 전체 실행 순서 중 DNN 응용에의 적용 과정에 대한 상세 절차를 설명하기 위한 순서도이다.
도 7은 도 2a의 프레임워크에 채용할 수 있는 DNN 커널을 설명하기 위한 예시도도이다.
도 8은 도 2a의 프레임워크에 채용할 수 있는 타겟 NPU를 설명하기 위한 예시도이다.
도 9는 도 2a의 프레임워크에 채용할 수 있는 MLP 모델 트레이닝 과정 및 이 과정에 따른 최적 파라미터 구성을 설명하기 위한 예시도이다.
도 10은 본 발명의 다른 실시예에 따른, SA 구조의 NPU 환경에서 지도학습기법 기반 최적의 계층별 실행정책 예측 기법을 적용하는 프레임워크에 채용할 수 있는 주요 구성에 대한 개략적인 블록도이다.1 is an exemplary view showing a performance difference according to an execution policy in an actual deep neural network (DNN) application in relation to optimization of a neural processing unit (NPU) environment according to an embodiment of the present invention. am.
2A and 2B show an optimal layer-wise execution policy prediction technique based on a supervised learning technique in an NPU environment having a systolic array (SA) structure according to an embodiment of the present invention. and an overall structural diagram illustrating the application framework.
3 is a flowchart schematically illustrating an entire execution sequence of the framework of FIG. 2A.
FIG. 4 is a flowchart for explaining a detailed procedure for a brute force search process among the entire execution sequence of the framework of FIG. 3 .
5 is a flowchart for explaining detailed procedures for a predictive model training process among the entire execution sequence of the framework of FIG. 3 .
FIG. 6 is a flowchart for explaining a detailed procedure for applying the framework of FIG. 3 to a DNN application among the entire execution sequence.
7 is an exemplary diagram for explaining a DNN kernel that can be employed in the framework of FIG. 2A.
8 is an exemplary view for explaining a target NPU that can be employed in the framework of FIG. 2A.
9 is an exemplary diagram for explaining an MLP model training process that can be employed in the framework of FIG. 2A and optimal parameter configuration according to this process.
10 is a schematic block diagram of main components that can be employed in a framework that applies an optimal layer-by-layer execution policy prediction method based on a supervised learning method in an NPU environment of an SA structure according to another embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. Since the present invention can make various changes and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. Like reference numerals have been used for like elements throughout the description of each figure.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는 데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. '및/또는'이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. Terms such as first, second, A, and B may be used to describe various components, but the components should not be limited by the terms. These terms are only used for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element, without departing from the scope of the present invention. The term 'and/or' includes a combination of a plurality of related recited items or any one of a plurality of related recited items.

본 출원의 실시예들에서, 'A 및 B 중에서 적어도 하나'는 'A 또는 B 중에서 적어도 하나' 또는 'A 및 B 중 하나 이상의 조합들 중에서 적어도 하나'를 의미할 수 있다. 또한, 본 출원의 실시예들에서, 'A 및 B 중에서 하나 이상'은 'A 또는 B 중에서 하나 이상' 또는 'A 및 B 중 하나 이상의 조합들 중에서 하나 이상'을 의미할 수 있다.In embodiments of the present application, 'at least one of A and B' may mean 'at least one of A or B' or 'at least one of combinations of one or more of A and B'. Also, in the embodiments of the present application, 'at least one of A and B' may mean 'at least one of A or B' or 'at least one of combinations of one or more of A and B'.

어떤 구성요소가 다른 구성요소에 '연결되어' 있다거나 '접속되어' 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 '직접 연결되어' 있다거나 '직접 접속되어'있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.It is understood that when a component is referred to as being 'connected' or 'connected' to another component, it may be directly connected or connected to the other component, but other components may exist in the middle. It should be. On the other hand, when a component is referred to as being 'directly connected' or 'directly connected' to another component, it should be understood that no other component exists in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, '포함한다' 또는 '가진다' 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, terms such as 'comprise' or 'having' are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in this application, it should not be interpreted in an ideal or excessively formal meaning. don't

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 NPU 환경의 최적화와 관련하여 실제 심층신경망(deep neural network, DNN) 응용에서의 실행정책(execution policy)에 따른 성능 차이를 보여주는 예시도이다.1 is an exemplary view showing a performance difference according to an execution policy in an actual deep neural network (DNN) application in relation to optimization of an NPU environment according to an embodiment of the present invention.

즉, 도 1은 본 실시예에 따른 NPU 환경의 최적화가 잘 이루어진 경우에 기대할 수 있는 성능 측면의 이득을 정량적으로 보여준다. 또한, 도 1에서는, DNN 응용의 S-CNN(sentimantal convolutional neural network)과 트랜스포머(Transformer)의 두 케이스(case)에 대하여 출력값 고정(output stationary, OS), 가중치 고정(weight stationary, WS) 그리고 입력값 고정(input stationary, IS)의 3개의 실행정책에 따른 성능 차이를 도시하고 있으며, OS 실행정책에 대하여 다른 두 실행정책인 WS 및 IS를 정규화하여 보여주고 있다.That is, FIG. 1 quantitatively shows the gain in terms of performance that can be expected when the NPU environment is well optimized according to the present embodiment. In addition, in FIG. 1, for two cases of S-CNN (sentimantal convolutional neural network) and Transformer of DNN application, output stationary (OS), weight stationary (WS) and input The performance difference according to the three execution policies of value fixation (input stationary, IS) is shown, and the other two execution policies, WS and IS, are normalized and shown for the OS execution policy.

출력값 고정(OS)은 출력값(output)을 재사용해서 부분합(partial sum)을 읽고 쓸 때의 에너지 소모를 최소화하도록 설계된 구조이다. 가중치 고정(WS)은 필터값 또는 가중치(weight)를 유지해서 재사용성을 높이는 구조이다. 즉, 가중치 고정(WS)은 레지스터 파일(register file)에서 가중치를 가져오는 것을 최소화함으로써 합성곱(convolutional)과 필터(filter)의 재사용(reuse)을 최대화하여 가중치를 읽을 때의 에너지 소모를 최소화하도록 설계된 구조이다. 그리고 입력값 고정(IS)은 가중치 고정과는 반대로 어레이(array)에 입력 특징맵(feature map, fmap) 값이 저장되어 있고, 가중치를 어레이로 넘겨줘 합성곱 연산을 수행하는 구조이다.Output fixation (OS) is a structure designed to minimize energy consumption when reading and writing partial sums by reusing outputs. The weight fixation (WS) is a structure that increases reusability by maintaining filter values or weights. That is, weight fixation (WS) minimizes energy consumption when reading weights by maximizing convolutional and filter reuse by minimizing fetching weights from register files. It is a designed structure. In contrast to weight fixation, the input value lock (IS) is a structure in which an input feature map (fmap) value is stored in an array and a convolution operation is performed by passing the weight to the array.

본 실시예에서 S-CNN 모델은 4개의 컨볼루션 레이어(convolution layer)로 이루어진 DNN 응용의 일례에 대응하고, Transformer 모델은 10개의 컨볼루션 레이어로 이루어진 DNN 응용의 또 다른 일례에 대응할 수 있다.In this embodiment, the S-CNN model may correspond to an example of a DNN application consisting of 4 convolution layers, and the Transformer model may correspond to another example of a DNN application consisting of 10 convolution layers.

여기서, S-CNN 모델은 소셜 네트워크 서비스(SNS) 등을 통해 수집되는 인간의 감정 상태를 표현하는 정보들을 합성곱 신경망을 통하여 분류하는 모델인 감상 분류 합성곱 신경망(Sentimental CNN)을 일컫는다. S-CNN 모델은 방대한 입력 데이터로부터 긍정(positive)과 부정(negative)을 나타내는 어휘 표현을 모델이 인식할 수 있는 데이터로 가공하여 훈련하고, 학습하는 모델이다.Here, the S-CNN model refers to a sentimental classification convolutional neural network (Sentimental CNN), which is a model that classifies information expressing a human emotional state collected through a social network service (SNS) or the like through a convolutional neural network. The S-CNN model is a model that processes, trains, and learns lexical expressions representing positives and negatives from massive input data into data that the model can recognize.

Transformer 모델은 순환신경망(recurrent neural network, RNN)을 사용하지 않지만 기존의 seq2seq(sequence to sequence) 모델의 구조처럼 인코더에서 입력 시퀀스를 입력받고, 디코더에서 출력 시퀀스를 출력하는 인코더-디코더 구조를 유지하도록 구성될 수 있다. 또한, 기존의 seq2seq 구조에서는 인코더와 디코더에서 각각 하나의 RNN이 t개의 시점(time step)을 가지지만, Transformer 모델에서는 인코더와 디코더라는 단위가 N개(본 실시예에서는 10개)로 구성되는 구조를 가질 수 있다.The Transformer model does not use a recurrent neural network (RNN), but maintains an encoder-decoder structure that receives an input sequence from an encoder and outputs an output sequence from a decoder like the structure of an existing seq2seq (sequence to sequence) model. can be configured. In addition, in the existing seq2seq structure, each RNN in the encoder and decoder has t time steps, but in the Transformer model, the unit of encoder and decoder is N units (10 in this embodiment). can have

본 실시예에 의하면, 컨볼루션 레이어(convolution layer)에 따라 서로 다른 3가지의 실행정책 중 최적인 경우가 다소 다르게 나타날 수 있다. 특히 어떤 경우에는 실행정책에 따른 성능 차이가 16배 또는 30배에 가깝게 나타나기도 한다.According to this embodiment, an optimal case among three different execution policies may appear slightly different according to a convolution layer. In particular, in some cases, the performance difference according to the execution policy appears close to 16 times or 30 times.

이에 본 실시예에서는, 시스톨릭 어레이(systolic array) 구조의 추론용 NPU 환경에 적용할 수 있는 여러 실행정책들 중 타겟 환경에서의 최적 실행정책을 딥러닝을 이용하여 효과적으로 도출하고 이를 적용할 수 있도록 함으로써 여러 상황에서 NPU 및 응용의 성능을 극대화하는 효과를 볼 수 있으리라 기대된다.Therefore, in this embodiment, among several execution policies applicable to the NPU environment for reasoning with a systolic array structure, the optimal execution policy in the target environment can be effectively derived and applied using deep learning. By doing so, it is expected to have the effect of maximizing the performance of NPU and applications in various situations.

한편, 본 실시예의 예측 방법 및 적용 프레임워크는 전술한 3가지의 실행정책 외에 NLR(no local reuse), 행 고정(row stationary, RS), 타일링(tiling), 클러스터링(clustering) 등과 같이 시스톨릭 어레이 구조의 추론용 NPU 환경에 적용할 수 있는 실행정책에 대하여 모두 적용할 수 있음은 물론이다.On the other hand, the prediction method and application framework of this embodiment, in addition to the above-mentioned three execution policies, are systolic arrays such as no local reuse (NLR), row stationary (RS), tiling, and clustering. Of course, it can be applied to all execution policies that can be applied to the NPU environment for structure inference.

여기서, NLR은 전역 버퍼(global buffer)로부터 PE(processing element) 어레이까지 직접 데이터를 전달함으로써 DRAM 등의 메모리 접근(access)을 줄이는 구조를 가진다.Here, the NLR has a structure that reduces memory access such as DRAM by directly transferring data from a global buffer to a processing element (PE) array.

행 고정(RS)은, 가중치 고정이나 출력 고정의 경우에 가중치와 부분합만 최적화하는데 비해, 가중치, 픽셀, 부분합과 같은 여러 데이터에 대해 재사용을 최대화하는 구조이다. 즉, 행 고정(RS)은 로컬 메모리(local memory)의 특정 부분에 연산에 사용할 만큼의 입력 특징맵(feature map, fmap) 값을 넣고 출력 특징맵(output fmap) 위치에 해당하는 부분합을 계산하며, RS 동작 과정에서 모두 같은 가중치를 사용하고 즉, 가중치를 재사용하고, 매 과정에서 로컬 메모리의 대응 부분에 하나의 값만을 저장하는 구조를 가진다. 이러한 RS의 구조는 적은 로컬 메모리와 높이 데이터 재사용으로 연산을 진행할 수 있고, 그 결과 다른 실행정책에 비해 가장 에너지 효율성이 좋다.Row fixation (RS) is a structure that maximizes reuse for various data such as weights, pixels, and subtotals, compared to optimizing only weights and subtotals in case of weight fixation or output fixation. That is, row fixation (RS) puts enough input feature map (fmap) values to be used for operation in a specific part of local memory, calculates the partial sum corresponding to the output feature map location, , RS has a structure in which the same weight is used in all processes, that is, the weight is reused, and only one value is stored in the corresponding part of the local memory in each process. This RS structure can perform calculations with a small amount of local memory and reuse of height data, and as a result, it is the most energy efficient compared to other execution policies.

타일링은 분류기 레이어들(classifier layers), 컨볼루션 레이어들 및 풀링 레이어들(pooling layers) 중 적어도 어느 하나 이상의 레이어들의 일정 부분을 타일 형태로 끊어서 연산하는 구조로서, 메모리에서 가져온 값을 최대한 재사용하는 형태로 연산하도록 구성된다.Tiling is a structure in which a certain portion of at least one or more layers of classifier layers, convolution layers, and pooling layers is divided into tiles and operated, and values brought from memory are reused as much as possible. It is configured to operate as

클러스터링은 기본적으로 특정 기준에 따라 유사한 데이터 사례들을 하나의 세트로 그룹화하는 것을 말한다. 클러스터링은 본래의 데이터 공간을 저차원의 공간으로 매핑하면서 가능한 많은 정보를 보존하는 비지도학습 방식을 지칭할 수 있다. 본 실시예에서 클러스터링은 지도학습기법과 결합된 형태로 적용될 수 있다.Clustering basically refers to grouping similar data instances into a set according to certain criteria. Clustering may refer to an unsupervised learning method that preserves as much information as possible while mapping the original data space to a low-dimensional space. In this embodiment, clustering may be applied in a form combined with a supervised learning method.

아래의 설명에서, 실행 정책은 간략히 정책으로 지칭되거나, 팩터(factor)로 대체될 수 있다. 예를 들어, 전술한 OS 정책, WS 정책 및 IS 정책은 기재된 순서대로 OS 팩터, WS 팩터 및 IS 팩터로 지칭될 수 있다.In the description below, the execution policy may be referred to as a policy for short, or may be replaced with a factor. For example, the aforementioned OS policy, WS policy, and IS policy may be referred to as an OS factor, a WS factor, and an IS factor in the order in which they are described.

도 2a 및 도 2b는 본 발명의 일실시예에 따른, 시스톨릭 어레이(systolic array, SA) 구조의 NPU 환경에서 지도학습기법을 기반으로 최적의 계층별 실행정책(layer-wise execution policy) 예측 방법 및 적용 프레임워크를 예시한 전체 구조도이다. 그리고 도 3은 도 2a의 프레임워크의 전체 실행 순서를 개략적으로 나타낸 순서도이다.2A and 2B show a method for predicting an optimal layer-wise execution policy based on a supervised learning technique in an NPU environment having a systolic array (SA) structure according to an embodiment of the present invention. and an overall structural diagram illustrating the application framework. And Figure 3 is a flow chart schematically showing the entire execution sequence of the framework of Figure 2a.

도 2a, 도 2b 및 도 3을 참조하면, 계층별 실행정책(layer-wise execution policy) 예측 방법의 세부 동작은 다음의 순서로 진행될 수 있다.Referring to FIGS. 2A, 2B, and 3 , detailed operations of a method for predicting a layer-wise execution policy may proceed in the following order.

먼저, 다양한 DNN 응용의 각 계층별 커널(kernel) 파라미터와 타겟 NPU의 가용 자원을 분석한다.First, kernel parameters for each layer of various DNN applications and available resources of the target NPU are analyzed.

본 단계는 DNN 응용에 사용가능한 모든 계층별 커널 파라미터를 분석하고 해당 계층별 커널 파라미터 정보를 저장함으로써 타겟 NPU의 가용 자원을 고려하여 타겟 NPU에 적용할 수 있는 계층별 커널 파라미터를 설정하기 위한 것이다.This step is to set kernel parameters for each layer applicable to the target NPU in consideration of available resources of the target NPU by analyzing kernel parameters for each layer available for DNN application and storing kernel parameter information for each layer.

또한, 본 단계는, 타겟 NPU의 가용 자원 분석을 위해, 계층별 실행정책 예측 장치가 이 예측 장치에 연결되는 타겟 NPU의 메모리에 기록된 특정 정보를 읽어오는 방식으로 수행되거나, 두 장치의 연결시 기설정된 절차에 따라 타겟 NPU에서 특정 정보를 예측 장치로 전달하는 방식으로 수행될 수 있다.In addition, in order to analyze the available resources of the target NPU, this step is performed in such a way that an execution policy prediction device for each layer reads specific information recorded in the memory of the target NPU connected to the prediction device, or when the two devices are connected It may be performed in a manner in which specific information is transferred from the target NPU to the prediction device according to a preset procedure.

계층별 커널들은 딥러닝(deep learning, DL) 커널들을 포함하고, 딥러닝 커널들은 GEMM(general matrix multiplication) 커널, 컨볼루션(convolution, CONV) 커널 등을 포함할 수 있다. GEMM 커널은 C=A*B+C 연산의 형태로 구현될 수 있다. A, B, X 및 C는 행렬을 나타낸다. 예를 들어, GEMM 커널은 데이터 무버들(data movers), 전치 및 버퍼들(transpose and buffers), 및 시스톨릭 어레이(systolic array)의 주요 3개의 부분으로 구성될 수 있다. GEMM 커널은 2048 × 256 × 1600 등의 크기를 가진 GEMM 레이어에 대응하는 크기를 가질 수 있다. 그리고 CONV 커널은 컨볼로션 레이어의 각 요소별 가중치를 주어 특징을 추출하기 위한 것으로, 3 × 3 × 256 × 256 등의 크기를 가진 컨볼루션 레이어에 대해 3 × 5 × 5 등의 크기를 가질 수 있다.Kernels for each layer include deep learning (DL) kernels, and the deep learning kernels may include a general matrix multiplication (GEMM) kernel, a convolution (CONV) kernel, and the like. The GEMM kernel can be implemented in the form of the C=A*B+C operation. A, B, X and C represent matrices. For example, the GEMM kernel may consist of three main parts: data movers, transpose and buffers, and a systolic array. The GEMM kernel may have a size corresponding to a GEMM layer having a size such as 2048 × 256 × 1600. In addition, the CONV kernel is for extracting features by giving weights for each element of the convolution layer, and may have a size of 3 × 5 × 5 for a convolution layer having a size of 3 × 3 × 256 × 256, etc. .

다음, 지도학습기법을 통한 모델 학습을 위하여 무차별 조사(brute-force search)를 진행한다(S40).Next, brute-force search is performed for model learning through supervised learning techniques (S40).

지도 학습(supervised Learning) 기법은 분류(classification) 문제를 위한 기계 학습(machine learning) 기법 중 하나로, 선행 지식으로부터 습득된 많은 수의 분류 결과들을 통해 모델을 학습하고, 새로운 데이터에 대한 분류를 수행하는 기법이다. 본 실시예에서 지도 학습 기법은 타겟 환경에 따라 최적의 실행 정책 예측을 위한 접근법으로 사용된다. NPU 구조 및 DNN 응용의 파라미터들을 특징(feature)으로 정의하고, 가용한 실행 정책을 분류 라벨(classification label)로 정의해 다양한 특징에서의 최적 라벨 도출 결과를 훈련 데이터로 만들어 모델 학습에 사용한다. 학습된 모델을 통해 다양한 NPU 및 DNN 응용 환경에서의 최적의 실행 정책을 예측하도록 한다.The supervised learning technique is one of the machine learning techniques for classification problems, which learns a model through a large number of classification results acquired from prior knowledge and performs classification on new data. it is a technique In this embodiment, the supervised learning technique is used as an approach for predicting the optimal execution policy according to the target environment. The parameters of the NPU structure and DNN application are defined as features, and the available execution policies are defined as classification labels, and the results of deriving optimal labels from various features are made into training data and used for model learning. Through the learned model, it predicts the optimal execution policy in various NPU and DNN application environments.

무차별 조사는 완전 탐색으로도 지칭되며, 가장 단순하게 가능한 모든 경우의 수나 가능한 모든 경로를 탐색하여 결과 값을 찾는 방법을 말한다. 본 실시예에서 무차별 조사는 본 실시예의 예측 모델 학습을 위해 사용가능한 모든 정답을 가진 데이터 세트의 범위에서 수행될 수 있다.Indiscriminate search, also referred to as exhaustive search, is the simplest way to find a result value by searching all possible numbers or all possible paths. In this embodiment, indiscriminate search can be performed in the range of the data set with all possible correct answers for learning the predictive model of this embodiment.

다음, 무차별 조사를 통해 획득한 데이터를 기반으로 전연결 레이어(fully-connected layer, FCL)로 구성된 신경망을 통해 예측 모델을 학습시킨다(S50).Next, a prediction model is trained through a neural network composed of a fully-connected layer (FCL) based on data obtained through indiscriminate investigation (S50).

예측 모델 학습(S50)은 다층 퍼셉트론(multilayer perceptron, MLP) 모델 훈련(model traning)(50)을 포함할 수 있다. 이 경우, 예측 모델 학습(S50)은 컨볼루션(CONV) 레이어들 각각의 실행정책을 포함한 제1 정책 세트(policy set)를 추론하는 제1 학습(30)이나 GEMM 레이어들 각각의 실행정책을 포함하는 제2 정책 세트를 추론하는 제2 학습(40)을 지도학습(supervised learning) 기법 기반으로 수행하도록 구성될 수 있다.Prediction model learning (S50) may include multilayer perceptron (MLP) model training (model training) (50). In this case, the predictive model learning (S50) includes the first learning 30 that infers the first policy set including the execution policy of each of the convolution (CONV) layers or the execution policy of each of the GEMM layers It may be configured to perform the second learning 40 for inferring the second policy set based on a supervised learning technique.

다음, 타겟 DNN 응용의 레이어(layer) 정보 및 타겟 NPU(20)의 정보를 기반으로 학습된 모델을 통해 각 상황별 또는 각 계층별로 최적 팩터를 검색하고(optimal factor search), 최적 실행정책(execution policy)을 예측하고, 예측한 최적 실행정책을 DNN 응용 런타임(runtime)에 유동적으로 적용할 수 있다(S60).Next, through the model learned based on the layer information of the target DNN application and the information of the target NPU 20, the optimal factor search for each situation or each layer is searched, and the optimal execution policy policy), and the predicted optimal execution policy can be flexibly applied to the DNN application runtime (S60).

즉, DNN 응용에의 적용(S60)에서는 다수의 NPU(NPU #0, NPU #1, NPU #2) 각각의 타겟 레이어들(target layers)에 대하여 훈련된 모델(trained model, 80)을 통해 컨볼루션 계층별(CONV #0, CONV #1, GEMM #0) 정책 세트(policy set)를 예측하여 최적 DNN 실행(optimal DNN execution)을 수행하도록 DNNs에 적용할 수 있다(applying to DNNs).That is, in application to DNN application (S60), convolution is performed through a trained model (80) for target layers of each of a plurality of NPUs (NPU #0, NPU #1, and NPU #2). A policy set for each solution layer (CONV #0, CONV #1, GEMM #0) can be predicted and applied to DNNs to perform optimal DNN execution.

여기서, 타겟(target) NPU(20)는 DRAM(dynamic random access memory)과 DRAM에 연결되는 다수의 PE(processing element) 클러스터들(PE cluster 0, PE cluster 1, Pe cluster 2)을 포함할 수 있다. 그리고 타겟 레이어들은 소프트맥스(softmax)나 전연결 레이어(FCL)를 포함한 출력 레이어를 제외하고 실제 DNNs(real-world DNNs, 70)의 입력 레이어, 컨볼루션 레이어(convolution layer), 풀링 레이어(pooling layer) 등의 나머지 레이어들을 포함할 수 있다.Here, the target NPU 20 may include a dynamic random access memory (DRAM) and a plurality of processing element (PE) clusters (PE cluster 0, PE cluster 1, and Pe cluster 2) connected to the DRAM. . And the target layers are input layers, convolution layers, and pooling layers of real-world DNNs (70), except for output layers including softmax or fully connected layers (FCL). ), etc. may include the remaining layers.

도 4는 도 3의 프레임워크의 전체 실행 순서 중 무차별 대입 검색 과정에 대한 상세 절차를 설명하기 위한 순서도이다.FIG. 4 is a flowchart for explaining a detailed procedure for a brute force search process among the entire execution sequence of the framework of FIG. 3 .

도 4를 참조하면, 무차별 대입 검색(brute-force search) 과정에서는, 먼저 행렬 곱셈 연산과 합성곱 연산으로 구성된 합성 데이터 세트를 정의한다(S41).Referring to FIG. 4 , in a brute-force search process, first, a composite data set composed of a matrix multiplication operation and a convolution operation is defined (S41).

다음, NPU 환경에 적용할 수 있는 모든 가용한 정책 세트를 분류한다(S42).Next, all available policy sets applicable to the NPU environment are classified (S42).

다음, 합성 데이터 세트 대상으로 가용한 모든 정책에 대해 NPU에서 연산을 수행한다(S43).Next, the NPU performs calculations on all policies available for the synthetic data set (S43).

다음, NPU 연산에 따라 도출되는 커널 특징이나 NPU 특징을 토대로 합성 데이터 세트를 신경망 훈련용 데이터로 재구성한다(S44).Next, the synthesized data set is reconstructed into neural network training data based on the kernel feature or NPU feature derived according to the NPU operation (S44).

도 5는 도 3의 프레임워크의 전체 실행 순서 중 예측 모델 훈련 과정에 대한 상세 절차를 설명하기 위한 순서도이다.5 is a flowchart for explaining detailed procedures for a predictive model training process among the entire execution sequence of the framework of FIG. 3 .

도 5를 참조하면, 예측 모델 훈련(predictive model training) 과정에서는, 먼저 예측 모델 훈련을 위한 연산 특징(또는 커널 특징)과 NPU 특징을 분류한다(S51).Referring to FIG. 5, in the process of predictive model training, first, computational features (or kernel features) and NPU features for predictive model training are classified (S51).

다음, 전연결 레이어로 구성된 인공 신경망(artificial neural network)을 구축한다(S52). 인공 신경망은 심층신경망(DNN)을 포함한다.Next, an artificial neural network composed of fully connected layers is built (S52). Artificial neural networks include deep neural networks (DNNs).

다음, 합성 데이터 세트 및 분류된 특징 세트를 기반으로 DNN을 통해 예측 모델을 훈련시킨다(S53).Next, a predictive model is trained through a DNN based on the synthetic data set and the classified feature set (S53).

훈련된 모델은 검증 데이터 세트로 검증될 수 있다(S54). 검증 데이터 세트는 정답을 가진 데이터 세트를 포함할 수 있다.The trained model may be verified with a verification data set (S54). The validation data set may include data sets with correct answers.

도 6은 도 3의 프레임워크의 전체 실행 순서 중 DNN 응용의 적용 과정에 대한 상세 절차를 설명하기 위한 순서도이다.FIG. 6 is a flowchart for explaining a detailed procedure for applying a DNN application among the entire execution sequence of the framework of FIG. 3 .

도 6을 참조하면, DNN 응용의 적용(applying to DNN applications) 과정에서는, 먼저 대상 DNN 응용의 행렬 곱셈 연산 및 합성곱 연산 레이어를 추출한다(S61).Referring to FIG. 6 , in the process of applying to DNN applications, matrix multiplication operation and convolution operation layers of the target DNN application are first extracted (S61).

다음, 대상 NPU 및 레이어 특징을 기반으로 최적 정책 세트를 예측한다(S62).Next, an optimal policy set is predicted based on the target NPU and layer characteristics (S62).

다음, 각 레이어 별로 도출된 정책 세트를 적용하여 컴파일을 수행한다(S63). 컴파일은 예측 모델을 훈련시키기 전에 훈련 과정에 필요한 파라미터(parameter)를 정의하는 것을 지칭할 수 있다. 컴파일 과정에서 훈련 파라미터는 컴파일 메서드에 전달될 수 있다. 컴파일 메서드는 컴파일 기능을 정의한 코드들 혹은 명령들의 집합에 대응될 수 있다.Next, compilation is performed by applying the policy set derived for each layer (S63). Compilation may refer to defining parameters required for a training process before training a predictive model. During the compilation process, training parameters can be passed to the compilation method. A compilation method may correspond to a set of codes or commands defining a compilation function.

다음, 컴파일된 레이어를 통해 DNN 응용을 수행한다(S64).Next, DNN application is performed through the compiled layer (S64).

도 7은 도 2a의 프레임워크에 채용할 수 있는 DNN 커널을 설명하기 위한 예시도도이다. 도 8은 도 2a의 프레임워크에 채용할 수 있는 타겟 NPU를 설명하기 위한 예시도이다. 그리고 도 9는 도 2a의 프레임워크에 채용할 수 있는 MLP 모델 트레이닝 과정 및 이 과정에 따른 최적 파라미터 구성을 설명하기 위한 예시도이다.7 is an exemplary diagram for explaining a DNN kernel that can be employed in the framework of FIG. 2A. 8 is an exemplary view for explaining a target NPU that can be employed in the framework of FIG. 2A. 9 is an exemplary diagram for explaining an MLP model training process that can be employed in the framework of FIG. 2A and optimal parameter configuration according to this process.

즉, 도 7 내지 도 9는 본 발명의 또 다른 실시예에 따른 방법으로서, 시스톨릭 어레이 구조의 NPU에서 지도학습기법 기반으로 최적의 계층별 실행정책(layer-wise execution policy) 예측 방법을 설명하기 위한 도면들이다.That is, FIGS. 7 to 9 are methods according to another embodiment of the present invention, which describe a method of predicting an optimal layer-wise execution policy based on a supervised learning method in a systolic array-structured NPU. drawings for

계층별 실행정책 예측 방법은, 먼저 도 7에 도시한 바와 같이, 딥러닝 응용(DL applications) 및 타겟(target) NPU 구조에 따른 다양한 파라미터를 분석한다. 파라미터 분석 과정에서는 DNN 커널들(kernels)(12)의 레이어 구성(layer configuration, Layer configs), 레이어들의 배치(Batch), 신경망(neural network, NN) 파라미터들(parameters, params) 등을 포함한 커널 특징(kernel features)을 획득하도록 구성될 수 있다.The layer-by-layer execution policy prediction method first analyzes various parameters according to deep learning applications and target NPU structures, as shown in FIG. 7 . In the parameter analysis process, kernel characteristics including layer configuration (Layer configs) of DNN kernels (12), batch of layers, neural network (NN) parameters (parameters, params), etc. (kernel features).

또한, 파라미터 분석 과정에서는 도 8에 도시한 바와 같이 타겟 NPU(22)의 PE(processing element) 자원(resources)이나, 온-칩 스크래치패드(on-chip scratchpad) 메모리 및 오프-칩(off-chip) 메모리의 메모리 제한(memory limits) 등에 대한 NPU 특징(features)을 획득하도록 구성될 수 있다.In addition, in the parameter analysis process, as shown in FIG. 8, processing element (PE) resources of the target NPU 22, on-chip scratchpad memory and off-chip ) may be configured to obtain NPU features for memory limits, etc. of memory.

다음으로, 계층별 실행정책 예측 방법은, 도 9에 도시한 바와 같이, 지도학습(supervised learning) 기법의 적용(52)을 위해 무차별 조사를 진행하고, 무차별 조사를 통해 얻은 데이터를 기반으로 구축된 신경망을 통해 모델을 학습할 수 있다. 모델 학습은 훈련된 모델(trained model)을 기반으로 가장 좋은 구성(the best configuration)을 추론(inference)하도록 구성될 수 있다(82).Next, as shown in FIG. 9, the method of predicting the execution policy for each layer proceeds with indiscriminate investigation for the application of the supervised learning technique (52), and builds based on the data obtained through the indiscriminate investigation. A model can be trained through a neural network. Model learning may be configured to infer the best configuration based on the trained model (82).

추론된 최적 파라미터 구성(optimal parameter configuration, 84)은 예를 들어 각 DNN 레이어(CONV #0, CONV 31, CONV #2)에 대하여 기재된 순서대로 각 타일링 팩터(tiling factor)를 (8, 8), (4, 16), (16, 8)로 도출하고, 각 클러스터링 팩터(clustering factor)를 2, 8, 4로 도출하고, 각 데이터플로우(dataflow)를 OS, WS, IS로 도출한 것을 포함할 수 있다.Inferred optimal parameter configuration (optimal parameter configuration, 84), for example, each tiling factor (tiling factor) in the order described for each DNN layer (CONV #0, CONV 31, CONV #2) (8, 8), (4, 16), (16, 8), deriving each clustering factor as 2, 8, and 4, and deriving each dataflow as OS, WS, and IS. can

모델 학습이 끝나면, 해당 모델을 통해 훈련 시 고려하지 않았던 새로운 환경에서의 최적 실행 정책을 예측하고, 예측한 최적 실행 정책을 타겟 NPU 응용 런타임에 적용할 수 있다.After model learning is completed, the optimal execution policy in the new environment that was not considered during training can be predicted through the model, and the predicted optimal execution policy can be applied to the target NPU application runtime.

이와 같이, 본 실시예의 예측 방법에 의해 분류한 최적의 계층별 정책 세트는 타일링 정책, 클러스터링 정책, 데이터플로우 등을 포함할 수 있다.As such, the optimal policy set for each layer classified by the prediction method of the present embodiment may include a tiling policy, a clustering policy, a data flow, and the like.

타일링 정책은, 타일링 팩터(tiling factor)에 대응될 수 있고, 하나의 행렬 곱셈 또는 합성곱 연산 전체를 동일한 크기의 타일로 구분하여 각 연산 코어에 나누어 수행하는 최적화 정책이다. 타일링 파라미터는 대상 연산을 구성하는 행렬들의 크기 및 모양에 따라 그 최적값이 다를 수 있다.The tiling policy may correspond to a tiling factor, and is an optimization policy in which an entire matrix multiplication or convolution operation is divided into tiles of the same size and divided into respective operation cores. Optimum values of the tiling parameters may vary depending on the size and shape of matrices constituting the target operation.

클러스터링 정책은, 클러스터링 팩터(clustering factor)로 지칭될 수 있으며, 대상 NPU의 연산 코어를 동일한 크기의 클러스터로 구분하여 각 클러스터에 연산을 나누어 할당할 수 있도록 하는 최적화 정책이다. 클러스터링 파라미터는 대상 NPU 자원 및 대상 연산의 크기, 모양 등에 따라 그 최적값이 다를 수 있다.The clustering policy, which may be referred to as a clustering factor, is an optimization policy that divides computational cores of a target NPU into clusters of the same size and allocates computations to each cluster. The clustering parameter may have different optimal values depending on the target NPU resource and the size and shape of the target operation.

데이터플로우(dataflow)는 대상 NPU에 연산에 필요한 입력값들을 할당하는 순서 또는 방법에 대한 정책이다. 출력값 고정(output stationary), 필터값 고정 또는 가중치 고정(weight stationary), 입력값 고정(input stationary) 등의 데이터플로우 정책이 있으며, 대상 행렬 곱셈 연산 또는 합성곱 연산에 따라 그 최적값이 다를 수 있다.Dataflow is a policy for the order or method of allocating input values necessary for operation to the target NPU. There are data flow policies such as output stationary, filter value fixed or weight stationary, and input stationary, and the optimal value may differ depending on the target matrix multiplication operation or convolution operation. .

도 10은 본 발명의 또 다른 실시예에 따른 장치로서, 시스톨릭 어레이 구조의 NPU에서 지도학습기법 기반으로 최적의 계층별 실행정책을 예측하는 장치(이하 간략히 '예측 장치'라고 함)에 대한 개략적인 블록도이다.10 is a device according to another embodiment of the present invention, which is a schematic diagram of a device for predicting an optimal execution policy for each layer based on a supervised learning method in an NPU having a systolic array structure (hereinafter, simply referred to as a 'prediction device'). is a block diagram.

본 실시예의 예측 장치는 또한 예측한 실행정책을 적용하는 프레임워크를 포함할 수 있다. 따라서, 본 실시예의 장치는 예측 및 적용 장치로도 지칭될 수 있다.The prediction device of this embodiment may also include a framework for applying the predicted execution policy. Therefore, the device of this embodiment may also be referred to as a prediction and application device.

도 10을 참조하면, 예측 장치(100)는 적어도 하나의 프로세서(110) 및 메모리(120)를 포함할 수 있다. 또한, 예측 장치(100)는 송수신 장치(130)를 더 포함할 수 있다. 또한, 예측 장치(100)는 선택적으로 저장 장치(140), 입력 인터페이스 장치(150), 출력 인터페이스 장치(160) 등을 더 포함할 수 있다. 예측 장치(100)에 포함된 각각의 구성 요소들은 버스(bus)에 의해 연결되어 서로 통신을 수행할 수 있다.Referring to FIG. 10 , the prediction device 100 may include at least one processor 110 and a memory 120. In addition, the prediction device 100 may further include a transceiver 130. In addition, the prediction device 100 may optionally further include a storage device 140, an input interface device 150, an output interface device 160, and the like. Each component included in the prediction device 100 may be connected by a bus to communicate with each other.

프로세서(110)는 메모리(120) 및 저장 장치(140) 중에서 적어도 하나에 저장된 프로그램 명령(program command)을 실행할 수 있다. 프로그램 명령은 다양한 DNN 응용의 각 계층별 커널(kernel) 파라미터와 타겟 NPU의 가용 자원을 분석하는 명령, 지도학습기법을 통한 모델 학습을 위하여 예측 모델 학습을 위해 사용가능한 모든 정답을 가진 데이터 세트의 범위에서 무차별 조사(brute-force search)를 수행하는 명령, 무차별 조사를 통해 획득한 데이터를 기반으로 전연결 레이어(fully-connected layer, FCL)로 구성된 신경망을 통해 예측 모델을 학습시키는 명령, 타겟 DNN 응용의 레이어 정보 및 타겟 NPU의 정보를 기반으로 학습된 모델을 통해 각 상황별 또는 각 계층별로 최적 팩터를 검색하는 명령, 최적 실행정책(execution policy)를 예측하는 명령, 예측한 최적 실행정책을 DNN 응용 런타임(runtime)에 유동적으로 적용하는 명령 등을 포함할 수 있다. 또한, 프로그램 명령은 도 4 내지 도 6의 각 단계를 수행하기 위한 명령들을 포함할 수 있다.The processor 110 may execute a program command stored in at least one of the memory 120 and the storage device 140 . The program command is a command to analyze the kernel parameters of each layer of various DNN applications and the available resources of the target NPU, and a range of data sets with all possible correct answers for model learning through supervised learning techniques. A command to perform a brute-force search in , a command to train a prediction model through a neural network composed of a fully-connected layer (FCL) based on data obtained through brute-force search, and a target DNN application A command to search for an optimal factor for each situation or each layer through a model learned based on layer information and target NPU information, a command to predict an optimal execution policy, and a DNN application of the predicted optimal execution policy It may include commands that are flexibly applied at runtime. Also, the program command may include instructions for performing each step of FIGS. 4 to 6 .

전술한 프로세서(110)는 중앙 처리 장치(central processing unit, CPU), 그래픽 처리 장치(graphics processing unit, GPU), 또는 본 발명의 실시예에 따른 방법들 중 적어도 하나의 방법이 수행되는 전용의 프로세서를 의미할 수 있다.The above-described processor 110 may be a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor for performing at least one of the methods according to an embodiment of the present invention. can mean

메모리(120) 및 저장 장치(140) 각각은 휘발성 저장 매체 및 비휘발성 저장 매체 중에서 적어도 하나로 구성될 수 있다. 예를 들어, 메모리(120)는 읽기 전용 메모리(read only memory, ROM) 및 랜덤 액세스 메모리(random access memory, RAM) 중에서 적어도 하나로 구성될 수 있다.Each of the memory 120 and the storage device 140 may include at least one of a volatile storage medium and a non-volatile storage medium. For example, the memory 120 may include at least one of a read only memory (ROM) and a random access memory (RAM).

송수신 장치(130)는 유선 통신이나 무선 통신 방식으로 통신을 수행할 수 있는 통신서브시스템을 포함할 수 있다. 서브통신시스템은 인트라넷, 인터넷, 무선랜, 이동통신, 위성통신 등 적어도 하나 이상의 통신 방식을 지원하도록 구성될 수 있다.The transceiver 130 may include a communication subsystem capable of performing communication in a wired or wireless communication method. The sub-communication system may be configured to support at least one communication method such as intranet, Internet, wireless LAN, mobile communication, satellite communication, and the like.

입력 인터페이스 장치(150)는 키보드, 마이크, 터치패드, 터치스크린 등에서 선택되는 적어도 하나의 입력 수단과 적어도 하나의 입력 수단을 통해 입력되는 신호를 기저장된 명령과 매핑하거나 처리하는 입력 신호 처리부를 포함할 수 있다.The input interface device 150 may include at least one input means selected from a keyboard, microphone, touch pad, touch screen, etc., and an input signal processing unit that maps or processes a signal input through the at least one input means with a pre-stored command. can

그리고 출력 인터페이스 장치(160)는 프로세서(110)의 제어에 따라 출력되는 신호를 기저장된 신호 형태나 레벨로 매핑하거나 처리하는 출력 신호 처리부와, 출력 신호 처리부의 신호에 따라 진동, 빛 등의 형태로 신호나 정보를 출력하는 적어도 하나의 출력 수단을 포함할 수 있다. 적어도 하나의 출력 수단은 스피커, 디스플레이 장치, 프린터, 광 출력 장치, 진동 출력 장치 등의 출력 수단들에서 선택되는 적어도 하나를 포함할 수 있다.In addition, the output interface device 160 includes an output signal processing unit that maps or processes the signal output under the control of the processor 110 into a pre-stored signal type or level, and a signal of the output signal processing unit in the form of vibration or light. It may include at least one output means for outputting a signal or information. The at least one output means may include at least one selected from among output means such as a speaker, a display device, a printer, an optical output device, and a vibration output device.

본 발명의 실시예에 따른 방법의 동작은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 프로그램 또는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산 방식으로 컴퓨터로 읽을 수 있는 프로그램 또는 코드가 저장되고 실행될 수 있다. The operation of the method according to the embodiment of the present invention can be implemented as a computer readable program or code on a computer readable recording medium. A computer-readable recording medium includes all types of recording devices in which data that can be read by a computer system is stored. In addition, computer-readable recording media may be distributed to computer systems connected through a network to store and execute computer-readable programs or codes in a distributed manner.

또한, 컴퓨터가 읽을 수 있는 기록매체는 롬(rom), 램(ram), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다. 프로그램 명령은 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.In addition, the computer-readable recording medium may include hardware devices specially configured to store and execute program instructions, such as ROM, RAM, and flash memory. The program command may include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine code generated by a compiler.

본 발명의 일부 측면들은 장치의 문맥에서 설명되었으나, 그것은 상응하는 방법에 따른 설명 또한 나타낼 수 있고, 구성부(unit) 또는 장치는 방법 단계 또는 방법 단계의 특징에 상응할 수 있다. 이와 유사하게, 방법의 문맥에서 설명된 측면들은 또한 상응하는 블록 또는 구성부 또는 이에 상응하는 기능을 수행하는 장치의 특징으로 나타낼 수 있다. 방법 단계들의 몇몇 또는 전부는 예를 들어, 마이크로프로세서, 프로그램 가능한 컴퓨터 또는 전자 회로와 같은 하드웨어 장치에 의해(또는 이용하여) 수행될 수 있다. 몇몇의 실시예에서, 가장 중요한 방법 단계들의 하나 이상은 이와 같은 장치에 의해 수행될 수 있다. Although some aspects of the invention have been described in the context of an apparatus, it may also represent a description according to a corresponding method, and a unit or apparatus may correspond to a method step or characteristics of a method step. Similarly, aspects described in the context of a method may also be represented by corresponding blocks or components or features of a device performing a corresponding function. Some or all of the method steps may be performed by (or using) a hardware device such as, for example, a microprocessor, programmable computer, or electronic circuitry. In some embodiments, one or more of the most important method steps may be performed by such an apparatus.

실시예들에서, 프로그램 가능한 로직 장치 예를 들어, 필드 프로그래머블 게이트 어레이(field programmable gate array, FPGA)가 여기서 설명된 방법들의 기능의 일부 또는 전부를 수행하기 위해 사용될 수 있다. 또한, 실시예들에서, FPGA, ASIC(application-specific integrated circuit), ASSP(application-specific standrad parts), SoC(system on chip) 등의 주문형 반도체는 여기서 설명된 방법들 중 하나를 수행하기 위한 마이크로프로세서와 함께 작동할 수 있다. 일반적으로, 방법들은 어떤 하드웨어 장치에 의해 수행되는 것이 바람직하다.In embodiments, a programmable logic device, such as a field programmable gate array (FPGA), may be used to perform some or all of the functions of the methods described herein. Further, in embodiments, an application specific semiconductor such as an FPGA, application-specific integrated circuit (ASIC), application-specific standard parts (ASSP), system on chip (SoC), or the like is used as a microcontroller for performing one of the methods described herein. It can work with processors. Generally, methods are preferably performed by some hardware device.

이상 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention within the scope not departing from the spirit and scope of the present invention described in the claims below. will be able to understand

Claims

As a method for predicting an optimal layer-wise execution policy based on a supervised learning method in a neural processing unit (NPU) environment for reasoning with a systolic array structure,
Analyzing kernel parameters for each layer of a target deep neural network (DNN) application and available resources of a target neural processing unit (NPU);
Searching for an optimal factor, which is a factor representing the best performance for each situation or each layer, through a model learned based on layer information and target NPU information of the target DNN application; and
A method for predicting execution policies for each layer based on a supervised learning method comprising: applying an execution policy predicted based on the optimal factor to a DNN application runtime.

The method of claim 1,
Wherein the analyzing step includes acquiring kernel characteristics including layer configuration of DNN kernels, arrangement of layers, and neural network parameters.

The method of claim 1,
The analyzing step may include processing element (PE) resources of the target NPU or memory limits of an on-chip scratchpad memory and an off-chip memory of the NPU. A method for predicting execution policies for each layer based on a supervised learning method, including acquiring features.

The method of claim 1,
The step of searching for the optimal factor for each layer includes deriving each tiling factor for each DNN layer, deriving each clustering factor, and deriving each dataflow. , A method for predicting execution policies by layer based on supervised learning techniques.

The method of claim 1,
The step of applying to the DNN application runtime,
extracting matrix multiplication operation and convolution operation layers of the target DNN application;
predicting an execution policy set based on the target NPU and layer characteristics;
Compiling by applying an execution policy set derived for each layer; and
A method for predicting execution policy for each layer based on supervised learning techniques, including the step of performing DNN application through compiled layers.

The method of claim 1,
Prior to the searching step, further comprising the step of learning a prediction model through a neural network composed of fully connected layers based on data obtained through indiscriminate investigation.

The method of claim 6,
The learning step is
Classifying computational features or kernel features and NPU features for training the predictive model;
Building an artificial neural network composed of fully connected layers; and
A method for predicting an execution policy for each layer based on a supervised learning method, comprising training a predictive model through a DNN based on a synthetic data set and a classified feature set.

The method of claim 7,
Prior to the learning step, the step of performing a brute-force search for model learning through a supervised learning method, further comprising a supervised learning method based execution policy prediction method for each layer.

The method of claim 8,
The step of performing the indiscriminate investigation is performed in a range of data sets having all correct answers available for learning the predictive model, a method for predicting execution policies for each layer based on a supervised learning technique.

A device for predicting an optimal layer-wise execution policy based on a supervised learning method in a neural processing unit (NPU) environment for reasoning with a systolic array structure,
at least one processor; and
A memory storing at least one instruction executed by the at least one processor,
According to the at least one instruction, the at least one or more processors,
Analyze kernel parameters for each layer of the target deep neural network (DNN) application and available resources of the target neural processing unit (NPU),
Searching for an optimal factor, which is a factor showing the best performance for each situation or each layer, through a model learned based on the layer information and target NPU information of the target DNN application,
An apparatus for predicting an execution policy for each layer based on a supervised learning method, which applies an execution policy predicted based on the optimal factor to a DNN application runtime.

The method of claim 10,
Wherein the processor acquires kernel characteristics including layer configuration of DNN kernels, arrangement of layers, and neural network parameters in the process of analyzing the available resources.

The method of claim 11,
The processor, in the process of analyzing the available resources, the PE (processing element) resources of the target NPU, memory limitations of on-chip scratchpad memory and off-chip memory A device for predicting execution policies for each layer based on a supervised learning method that acquires NPU characteristics for (memory limits).

The method of claim 11,
In the process of searching for the optimal factor for each layer, the processor derives each tiling factor for each DNN layer, derives each clustering factor, and derives each dataflow. A device for predicting execution policies for each layer based on supervised learning techniques.

The method of claim 11,
The processor, in the process of applying to the DNN application runtime,
Extract matrix multiplication operation and convolution operation layers of the target DNN application;
Predicting an execution policy set based on the target NPU and layer characteristics,
Compilation is performed by applying the set of execution policies derived for each layer,
Execution policy prediction device for each layer based on supervised learning method that performs DNN application through compiled layer.

The method of claim 11,
the processor,
Before searching for the optimal factor for each layer, a predictive model is trained through a neural network composed of fully connected layers based on data obtained through indiscriminate investigation,
In the process of learning the predictive model, computational features or kernel features and NPU features for training the predictive model are classified, an artificial neural network composed of fully connected layers is built, and a synthetic data set and classified A device for predicting execution policy for each layer based on a supervised learning method that trains a prediction model through DNN based on a feature set.

The method of claim 15
The processor, before learning the predictive model, performs a brute-force search in the range of a data set having all correct answers usable for predictive model learning for model learning through a supervised learning technique. Execution policy prediction device for each layer based on technique.