KR20220149414A

KR20220149414A - Npu implemented for artificial neural networks to process fusion of heterogeneous data received from heterogeneous sensors

Info

Publication number: KR20220149414A
Application number: KR1020220027949A
Authority: KR
Inventors: 김녹원
Original assignee: 주식회사 딥엑스
Priority date: 2021-04-30
Filing date: 2022-03-04
Publication date: 2022-11-08
Also published as: KR102506613B1

Abstract

According to one disclosure of the present invention, a neural processing unit (NPU) is provided. The NPU includes: a control unit configured to receive a machine code of a compiled fusion-artificial neural network; an input unit (or sensor) configured to receive a plurality of input signals corresponding to the fusion-artificial neural network; an array of processing elements (PEs) configured to perform a fusion-artificial neural network operation; a special function unit (SFU) configured to perform a special function of the fusion-artificial neural network operation; and an on-chip memory configured to store fusion-artificial neural network operation data. The control unit can include a scheduler configured to control the PE array, the SPU, and the on-chip memory so that all operation sequences of the fusion-artificial neural network are processed in a preset order according to artificial neural network data locality information included in the machine code.

Description

NPU IMPLEMENTED FOR ARTIFICIAL NEURAL NETWORKS TO PROCESS FUSION OF HETEROGENEOUS DATA RECEIVED FROM HETEROGENEOUS SENSORS}

본 개시는 인공신경망(artificial neural network)에 관한 것이다.The present disclosure relates to an artificial neural network.

인간은 인식(Recognition), 분류(Classification), 추론(Inference), 예측(Predict), 조작/의사결정(Control/Decision making) 등을 할 수 있는 지능을 갖추고 있다. 인공지능(artificial intelligence: AI)은 인간의 지능을 인공적으로 모방하는 것을 의미한다. Humans are equipped with intelligence capable of recognition, classification, inference, prediction, and control/decision making. Artificial intelligence (AI) refers to artificially mimicking human intelligence.

인간의 뇌는 뉴런(Neuron)이라는 수많은 신경세포로 이루어져 있으며, 각각의 뉴런은 시냅스(Synapse)라고 불리는 연결부위를 통해 수백에서 수천 개의 다른 뉴런들과 연결되어 있다. 인간의 지능을 모방하기 위하여, 생물학적 뉴런의 동작원리와 뉴런 간의 연결 관계를 모델링한 것을, 인공신경망(Artificial Neural Network, ANN) 모델이라고 한다. 즉, 인공신경망은 뉴런들을 모방한 노드들을 레이어(Layer: 계층) 구조로 연결시킨, 시스템이다.The human brain consists of numerous neurons called neurons, and each neuron is connected to hundreds to thousands of other neurons through connections called synapses. In order to imitate human intelligence, the modeling of the operating principle of biological neurons and the connection relationship between neurons is called an artificial neural network (ANN) model. In other words, an artificial neural network is a system in which nodes imitating neurons are connected in a layer structure.

이러한 인공신경망 모델은 레이어 수에 따라 '단층 신경망'과 '다층 신경망'으로 구분한다. 일반적인 다층신경망은 입력 레이어와 은닉 레이어, 출력 레이어로 구성된다. 여기서 (1) 입력 레이어(input layer)은 외부의 자료들을 받아들이는 레이어로서, 입력 레이어의 뉴런 수는 입력되는 변수의 수와 동일하다. (2) 은닉 레이어(hidden layer)은 입력 레이어와 출력 레이어 사이에 위치하며 입력 레이어로부터 신호를 받아 특성을 추출하여 출력층으로 전달한다. (3) 출력 레이어(output layer)은 은닉 레이어로부터 신호를 받아 외부로 출력한다. 뉴런 간의 입력신호는 0에서 1 사이의 값을 갖는 각각의 연결강도와 곱해진 후 합산되며 이 합이 뉴런의 임계치보다 크면 뉴런이 활성화되어 활성화 함수를 통하여 출력 값으로 구현된다.These artificial neural network models are divided into 'single-layer neural network' and 'multi-layer neural network' according to the number of layers. A typical multilayer neural network consists of an input layer, a hidden layer, and an output layer. Here, (1) the input layer is a layer that receives external data, and the number of neurons in the input layer is the same as the number of input variables. (2) The hidden layer is located between the input layer and the output layer, receives a signal from the input layer, extracts characteristics, and transmits it to the output layer. (3) The output layer receives the signal from the hidden layer and outputs it to the outside. The input signal between neurons is multiplied by each connection strength with a value between 0 and 1, and then summed.

한편, 보다 높은 인공 지능을 구현하기 위하여, 인공신경망의 은닉 레이어의 개수를 늘린 것을 심층 신경망(Deep Neural Network, DNN)이라고 한다.Meanwhile, in order to implement higher artificial intelligence, an increase in the number of hidden layers of an artificial neural network is called a deep neural network (DNN).

다른 한편, 차량의 자율 주행을 위하여, 차량에 다양한 센서, 예컨대 라이다(Light Detection And Ranging, LiDAR), 레이더(RADAR), 카메라, GPS, 초음파 센서, NPU 등이 장착될 수 있다. 이와 같은 다양한 센서로부터 제공되는 데이터는 크기가 방대하기 때문에, 처리 시간이 상당히 오래 걸리는 단점이 있다.On the other hand, for autonomous driving of the vehicle, various sensors, for example, LiDAR (Light Detection And Ranging, LiDAR), radar (RADAR), camera, GPS, ultrasonic sensor, NPU, etc. may be mounted on the vehicle. Since the data provided from such various sensors is large in size, there is a disadvantage in that processing time is considerably long.

그러나, 자율 주행을 위해서는 방대한 크기의 데이터가 거의 실시간으로 처리되어야 하기 때문에, 최근 인공신경망이 해결방안으로 대두되고 있다. However, since a vast amount of data must be processed in near real time for autonomous driving, artificial neural networks are emerging as a solution recently.

그러나, 이종의 센서 데이터들 각각을 위해 전용 인공신경망을 구현하는 것은, 매우 비효율적인 문제점이 있다.However, implementing a dedicated artificial neural network for each of heterogeneous sensor data is very inefficient.

따라서, 본 특허의 발명자는 이종 센서로부터 제공되는 서로 다른 데이터를 퓨전 신경망을 통하여 효과적으로 처리하기 위한 신경 프로세싱 유닛(Neural Processing Unit, NPU)에 대하여 연구하였다. Therefore, the inventor of this patent has studied a neural processing unit (NPU) for effectively processing different data provided from heterogeneous sensors through a fusion neural network.

본 명세서의 일 개시에 따르면, 신경 프로세싱 유닛(NPU)가 제공된다. 상기 NPU는 컴파일된 퓨전-인공신경망의 머신 코드를 입력 받도록 구성된 제어부; 상기 퓨전-인공신경망에 대응되는 복수의 입력 신호를 수신 받도록 구성된 입력부(또는 센서); 상기 퓨전-인공신경망 연산을 수행하도록 구성된 프로세싱 엘리먼트(PE) 어레이; 상기 퓨전-인공신경망 연산의 특수 기능을 수행하도록 구성된 특수 기능 유닛(SFU); 및 상기 퓨전-인공신경망 연산 데이터를 저장하도록 구성된 온-칩 메모리를 포함할 수 있다. 상기 제어부는, 상기 머신-코드에 포함된 인공신경망 데이터 지역성 정보에 따라 상기 퓨전-인공신경망의 모든 연산 순서가 기 설정된 순서대로 처리되도록, 상기 프로세싱 엘리먼트 어레이, 상기 특수 기능 유닛 및 상기 온-칩 메모리를 제어하도록 구성된 스케줄러를 포함할 수 있다.According to one disclosure herein, a neural processing unit (NPU) is provided. The NPU includes a compiled fusion-control unit configured to receive the machine code of the artificial neural network; an input unit (or sensor) configured to receive a plurality of input signals corresponding to the fusion-artificial neural network; an array of processing elements (PE) configured to perform the fusion-artificial neural network operation; a special function unit (SFU) configured to perform a special function of the fusion-artificial neural network operation; and an on-chip memory configured to store the fusion-artificial neural network computation data. The control unit is configured to: the processing element array, the special function unit, and the on-chip memory so that all operation orders of the fusion-artificial neural network are processed in a preset order according to the artificial neural network data locality information included in the machine-code It may include a scheduler configured to control the.

본 명세서의 다른 일 개시에 따르면, 신경 프로세싱 유닛(NPU)가 제공된다. 상기 NPU는 퓨전-인공신경망의 머신 코드를 입력받도록 구성된 제어부; 상기 머신 코드에 기초하여 상기 퓨전-인공신경망의 연산을 수행하도록 구성된 프로세싱 엘리먼트(PE) 어레이; 및 상기 PE 어레이에서 처리된 합성곱 연산값을 입력받아 대응되는 특수 기능을 연산하도록 구성된 특수 기능 유닛(SFU)을 포함할 수 있다. 상기 SFU는 복수의 기능 유닛들을 포함할 수 있다. 상기 SFU는 상기 머신-코드에 포함된 인공신경망 데이터 지역성 정보에 따라 복수의 기능 유닛들 중 적어도 일부를 선택적으로 제어할 수 있다.According to another disclosure of the present specification, a neural processing unit (NPU) is provided. The NPU includes: a fusion-control unit configured to receive the machine code of the artificial neural network; an array of processing elements (PE) configured to perform computation of the fusion-artificial neural network based on the machine code; and a special function unit (SFU) configured to receive the convolution operation value processed in the PE array and calculate a corresponding special function. The SFU may include a plurality of functional units. The SFU may selectively control at least some of the plurality of functional units according to the artificial neural network data locality information included in the machine-code.

본 명세서의 또 다른 일 개시에 따르면, 시스템이 제공된다. 상기 시스템은 퓨전-인공신경망의 머신 코드를 입력받도록 구성된 제어부, 적어도 하나의 입력 신호를 수신받도록 구성된 입력부, 합성곱 연산을 수행하도록 구성된 프로세싱 엘리먼트 어레이 및 상기 합성곱 연산 결과를 저장하도록 구성된 온-칩 메모리를 포함하는 적어도 하나의 신경 프로세싱 유닛(NPU); 그리고 상기 적어도 하나의 신경 프로세싱 유닛의 연속된 메모리 오퍼레이션 요청들을 예측할 수 있는 상기 퓨전-인공신경망의 인공신경망 데이터 지역성 정보를 제공받도록 구성되고, 상기 인공신경망 데이터 지역성 정보에 기초하여, 대응되는 상기 적어도 하나의 신경 프로세싱 유닛이 요청할 상기 다음 메모리 오퍼레이션 요청을 사전에 캐싱하도록 구성된 메모리를 포함하는, 메모리 제어부를 포함할 수 있다.According to another disclosure of the present specification, a system is provided. The system includes a control unit configured to receive machine code of a fusion-artificial neural network, an input unit configured to receive at least one input signal, an array of processing elements configured to perform a convolution operation, and an on-chip configured to store a result of the convolution operation. at least one neural processing unit (NPU) comprising a memory; and receive neural network data locality information of the fusion-artificial neural network capable of predicting successive memory operation requests of the at least one neural processing unit, and based on the neural network data locality information, the corresponding at least one and a memory controller, comprising a memory configured to pre-caching the next memory operation request to be requested by a neural processing unit of

본 명세서의 개시들에 따라 제시되는 NPU를 활용하면, 이종 센서로부터 제공되는 서로 다른 데이터를 처리하기 위한 퓨전(fusion) 인공신경망의 성능을 향상시킬 수 있다.By utilizing the NPU presented according to the disclosures of the present specification, it is possible to improve the performance of a fusion artificial neural network for processing different data provided from heterogeneous sensors.

본 명세서의 개시들에 따르면, 연접(CONCATANATION) 동작 그리고 건너뛰고 연결하기(SKIP-CONNECTION) 동작을 통하여, 퓨전 인공신경망은 이종 센서로부터 제공되는 서로 다른 데이터를 효과적으로 처리할 수 있다. 이를 위하여, NPU는 복수의 기능 유닛들이 파이프라인으로 연결된 SFU(special function unit)을 포함하고, 상기 복수의 기능 유닛들이 선택적으로 턴-오프됨으로써, 전력 소모를 절감할 수 있다. According to the disclosures of the present specification, through a CONCATANATION operation and a SKIP-CONNECTION operation, the fusion artificial neural network can effectively process different data provided from heterogeneous sensors. To this end, the NPU includes a special function unit (SFU) in which a plurality of functional units are connected by a pipeline, and the plurality of functional units are selectively turned off, thereby reducing power consumption.

본 명세서의 일 개시에 의하면, NIR 광원을 턴-온 그리고 턴-오프함으로써, 재귀반사(retro-reflector) 특성을 가진 표지판들로부터 반사되는 근적외선 반사광을 근적외선 센서를 통해 검출함으로써, 도로 표지판을 효과적으로 검출할 수 있다. According to one disclosure of the present specification, by turning on and turning off the NIR light source, by detecting the near-infrared reflected light reflected from the signs having a retro-reflector characteristic through the near-infrared sensor, effectively detecting a road sign can do.

도 1은 본 개시에 따른 신경 프로세싱 유닛을 설명하는 개략적인 개념도이다.
도 2는 본 개시에 적용될 수 있는 프로세싱 엘리먼트 어레이 중 하나의 프로세싱 엘리먼트를 설명하는 개략적인 개념도이다.
도 3은 도 1에 도시된 NPU(100)의 변형예를 나타낸 예시도이다.
도 4는 예시적인 인공신경망모델을 설명하는 개략적인 개념도이다.
도 5는 컨볼류션 신경망의 기본 구조를 설명하기 위한 도면이다.
도 6은 컨볼류션 레이어의 입력 데이터와 합성곱 연산에 사용되는 커널을 설명하기 위한 도면이다.
도 7은 커널을 사용하여 활성화 맵을 생성하는 컨볼류션 신경망의 동작을 설명하기 위한 도면이다.
도 8은 컨볼류션 신경망의 동작을 이해하기 쉽게 나타낸 종합도이다.
도 9a는 본 명세서의 개시들이 적용되는 자율주행차량의 일 예를 나타낸다.
도 9b는 국제 자동차 기술자 협회에서 정한 자율주행 레벨을 나타낸다.
도 10은 퓨전 알고리즘을 나타낸 예시도이다.
도 11a는 오브젝트를 인식하는 예를 나타낸 예시도이고, 도 11b는 SSD의 구조를 나타낸 예시도이다.
도 12a는 차량에 장착되는 레이더를 이용하는 인공신경망의 예를 나타낸다.
도 12b는 레이더와 카메라를 활용하는 퓨전 처리 방식의 예를 나타낸다.
도 13은 라이다와 카메라를 활용하는 퓨전(fusion) 인공신경망의 예를 나타낸다.
도 14는 늦은 퓨전(Late Fusion), 조기 퓨전(Early Fusion), 심층 퓨전(Deep Fusion)을 나타낸 예시도이다.
도 15는 제1 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이다.
도 16a는 건너뛰고 연결하기(skip-connection)을 포함하는 인공신경망의 모델을 예시적으로 나타낸 예시도이다.
도 16b는 건너뛰고 연결하기(skip-connection)을 포함하는 인공신경망의 데이터 지역성 정보를 나타낸 예시도이다.
도 17은 제2 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이다.
도 18은 제3 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이다.
도 19는 제4 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이다.
도 20은 도 13에 도시된 퓨전(fusion) 인공신경망을 도 19에 도시된 제4 예시에 따라 쓰레드로 구분한 예를 나타낸다.
도 21은 제5 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이다.
도 22는 도 21에 도시된 SFU의 파이프라인 구조의 제1 예시를 나타낸 예시도이다.
도 23a는 도 21에 도시된 SFU의 파이프라인 구조의 제2 예시를 나타낸 예시도이다.
도 23b는 도 21에 도시된 SFU의 파이프라인 구조의 제3 예시를 나타낸 예시도이다.
도 24는 제6 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이다.
도 25는 제7 예시에 따라 복수의 NPU를 활용하는 예를 나타낸 예시도이다.
도 26은 도 13에 도시된 퓨전(fusion) 인공신경망을 도 25에 도시된 복수의 NPU를 통해 처리하는 예를 나타낸 예시도이다.
도 27a 내지 27c는 근적외선(near-infrared) 센서와 카메라를 사용하는 퓨전(fusion) 인공신경망의 활용 예를 나타낸다.
도 28은 제8 예시에 따라 편광기를 활용하는 예를 나타낸다.
도 29a 및 도 29b는 편광기의 성능을 나타낸 예시도이다.1 is a schematic conceptual diagram illustrating a neural processing unit according to the present disclosure.
2 is a schematic conceptual diagram illustrating one processing element of an array of processing elements that may be applied to the present disclosure.
3 is an exemplary view showing a modified example of the NPU 100 shown in FIG.
4 is a schematic conceptual diagram illustrating an exemplary artificial neural network model.
5 is a diagram for explaining the basic structure of a convolutional neural network.
6 is a diagram for explaining input data of a convolution layer and a kernel used for a convolution operation.
7 is a diagram for explaining an operation of a convolutional neural network that generates an activation map using a kernel.
8 is a general diagram illustrating the operation of a convolutional neural network in an easy to understand manner.
9A shows an example of an autonomous vehicle to which the disclosures of the present specification are applied.
9B shows an autonomous driving level determined by the International Association of Automobile Engineers.
10 is an exemplary diagram illustrating a fusion algorithm.
11A is an exemplary diagram illustrating an example of recognizing an object, and FIG. 11B is an exemplary diagram illustrating a structure of an SSD.
12A shows an example of an artificial neural network using a radar mounted on a vehicle.
12B shows an example of a fusion processing method utilizing a radar and a camera.
13 shows an example of a fusion artificial neural network using a lidar and a camera.
14 is an exemplary diagram illustrating late fusion, early fusion, and deep fusion.
15 is an exemplary diagram illustrating a system including an NPU architecture according to the first example.
16A is an exemplary diagram illustrating a model of an artificial neural network including skip-connection.
16B is an exemplary diagram illustrating data locality information of an artificial neural network including skip-connection.
17 is an exemplary diagram illustrating a system including an NPU architecture according to a second example.
18 is an exemplary diagram illustrating a system including an NPU architecture according to a third example.
19 is an exemplary diagram illustrating a system including an NPU architecture according to a fourth example.
FIG. 20 shows an example in which the fusion artificial neural network shown in FIG. 13 is divided into threads according to the fourth example shown in FIG. 19 .
21 is an exemplary diagram illustrating a system including an NPU architecture according to a fifth example.
22 is an exemplary diagram illustrating a first example of the pipeline structure of the SFU shown in FIG. 21 .
23A is an exemplary diagram illustrating a second example of the pipeline structure of the SFU shown in FIG. 21 .
23B is an exemplary diagram illustrating a third example of the pipeline structure of the SFU shown in FIG. 21 .
24 is an exemplary diagram illustrating a system including an NPU architecture according to a sixth example.
25 is an exemplary diagram illustrating an example of utilizing a plurality of NPUs according to the seventh example.
26 is an exemplary diagram illustrating an example of processing the fusion artificial neural network shown in FIG. 13 through a plurality of NPUs shown in FIG. 25 .
27A to 27C show examples of application of a fusion artificial neural network using a near-infrared sensor and a camera.
28 shows an example of utilizing a polarizer according to the eighth example.
29A and 29B are exemplary views showing the performance of the polarizer.

본 명세서 또는 출원에 개시되어 있는 본 개시의 개념에 따른 실시 예들에 대해서 특정한 구조적 내지 단계적 설명들은 단지 본 개시의 개념에 따른 실시 예를 설명하기 위한 목적으로 예시된 것이다. Specific structural or step-by-step descriptions for the embodiments according to the concept of the present disclosure disclosed in this specification or the application are merely illustrative for the purpose of describing the embodiments according to the concept of the present disclosure.

본 개시의 개념에 따른 실시 예들은 다양한 형태로 실시될 수 있으며 본 개시의 개념에 따른 실시 예들은 다양한 형태로 실시될 수 있으며 본 명세서 또는 출원에 설명된 실시 예들에 한정되는 것으로 해석되어서는 아니 된다.Embodiments according to the concept of the present disclosure may be embodied in various forms, and embodiments according to the concept of the present disclosure may be embodied in various forms, and should not be construed as being limited to the embodiments described in the present specification or application. .

본 개시의 개념에 따른 실시 예는 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있으므로 특정 실시 예들을 도면에 예시하고 본 명세서 또는 출원에 상세하게 설명하고자 한다. 그러나, 이는 본 개시의 개념에 따른 실시 예를 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 개시의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the embodiment according to the concept of the present disclosure may have various changes and may have various forms, specific embodiments will be illustrated in the drawings and described in detail in the present specification or application. However, this is not intended to limit the embodiment according to the concept of the present disclosure with respect to the specific disclosure form, and should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present disclosure.

제1 및/또는 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만, 예컨대 본 개시의 개념에 따른 권리 범위로부터 이탈되지 않은 채, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first and/or second may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one element from another element, for example, without departing from the scope of rights according to the concept of the present disclosure, a first element may be named a second element, and similarly The second component may also be referred to as the first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being “connected” or “connected” to another component, it may be directly connected or connected to the other component, but it is understood that other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle. Other expressions describing the relationship between elements, such as "between" and "immediately between" or "neighboring to" and "directly adjacent to", etc., should be interpreted similarly.

본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 개시를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 서술된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used herein are used only to describe specific embodiments, and are not intended to limit the present disclosure. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as "comprise" or "have" are intended to designate that the stated feature, number, step, operation, component, part, or combination thereof exists, and includes one or more other features or numbers. , it is to be understood that it does not preclude the possibility of the presence or addition of steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

실시 예를 설명함에 있어서 본 개시가 속하는 기술 분야에 익히 알려져 있고 본 개시와 직접적으로 관련이 없는 기술 내용에 대해서는 설명을 생략한다. 이는 불필요한 설명을 생략함으로써 본 개시의 요지를 흐리지 않고 더욱 명확히 전달하기 위함이다.In describing the embodiments, descriptions of technical contents that are well known in the technical field to which the present disclosure pertains and are not directly related to the present disclosure will be omitted. This is to more clearly convey the gist of the present disclosure without obscuring the gist of the present disclosure by omitting unnecessary description.

<용어의 정의><Definition of Terms>

이하, 본 명세서에서 제시되는 개시들의 이해를 돕고자, 본 명세서에서 사용되는 용어들에 대하여 간략하게 정리하기로 한다.Hereinafter, in order to help understanding of the disclosures presented in the present specification, terms used in the present specification will be briefly summarized.

NPU: 신경 프로세싱 유닛(Neural Processing Unit)의 약어로서, CPU(Central processing unit)과 별개로 인공신경망 모델의 연산을 위해 특화된 프로세서를 의미할 수 있다.NPU: An abbreviation of Neural Processing Unit (NPU), which may refer to a processor specialized for computation of an artificial neural network model separately from a central processing unit (CPU).

ANN: 인공신경망(artificial neural network)의 약어로서, 인간의 지능을 모방하기 위하여, 인간 뇌 속의 뉴런들(Neurons)이 시냅스(Synapse)를 통하여 연결되는 것을 모방하여, 노드들을 레이어(Layer: 계층) 구조로 연결시킨, 네트워크를 의미할 수 있다.ANN: An abbreviation of artificial neural network. In order to imitate human intelligence, by mimicking how neurons in the human brain are connected through synapses, nodes are layered (Layer) It can mean a network connected by a structure.

예를 들면, 인공신경망모델은 Bisenet, Shelfnet, Alexnet, Densenet, Efficientnet, EfficientDet, Googlenet, Mnasnet, Mobilenet, Resnet, Shufflenet, Squeezenet, VGG, Yolo, RNN, CNN, DBN, RBM, LSTM 등의 모델일 수 있다. 단, 본 개시는 이에 제한되지 않으며, NPU(100)에서 동작할 새로운 인공신경망모델이 꾸준히 발표되고 있다.For example, the artificial neural network model can be a model such as Bisenet, Shelfnet, Alexnet, Densenet, Efficientnet, EfficientDet, Googlenet, Mnasnet, Mobilenet, Resnet, Shufflenet, Squeezenet, VGG, Yolo, RNN, CNN, DBN, RBM, LSTM, etc. have. However, the present disclosure is not limited thereto, and a new artificial neural network model to operate in the NPU 100 has been continuously released.

인공신경망의 구조에 대한 정보: 레이어의 개수에 대한 정보, 레이어 내의 노드의 개수, 각 노드의 값, 연산 처리 방법에 대한 정보, 각 노드에 적용되는 가중치 행렬에 대한 정보 등을 포함하는 정보이다.Information on the structure of the artificial neural network: Information including information on the number of layers, the number of nodes in a layer, the value of each node, information on a calculation processing method, information on a weight matrix applied to each node, and the like.

인공신경망의 데이터 지역성에 대한 정보: 인공신경망 및 상기 인공신경망을 처리하는 신경 프로세싱 유닛의 구조에 기초하여 결정된 메모리에 요청하는 데이터 접근 요청 순서를 포함하는 정보이다.Information on data locality of artificial neural network: Information including an artificial neural network and a data access request sequence for requesting a memory determined based on a structure of a neural processing unit that processes the artificial neural network.

DNN: 심층 신경망(Deep Neural Network)의 약어로서, 보다 높은 인공 지능을 구현하기 위하여, 인공신경망의 은닉 레이어의 개수를 늘린 것을 의미할 수 있다.DNN: An abbreviation of Deep Neural Network, which may mean that the number of hidden layers of the artificial neural network is increased in order to implement higher artificial intelligence.

CNN: 컨볼류션 신경망(Convolutional Neural Network)의 약어로서, 인간 뇌의 시각 피질에서 영상을 처리하는 것과 유사한 기능을 하는 신경망이다. 컨볼류션 신경망은 영상처리에 적합한 것으로 알려져 있으며, 입력 데이터의 특징들을 추출하고, 특징들의 패턴을 파악하기에 용이한 것으로 알려져 있다.CNN: Abbreviation for Convolutional Neural Network, a neural network that functions similar to image processing in the visual cortex of the human brain. Convolutional neural networks are known to be suitable for image processing, and are known to be easy to extract features from input data and to identify patterns of features.

커널: CNN에 적용되는 가중치 행렬을 의미할 수 있다.Kernel: It may mean a weight matrix applied to CNN.

칩-외부 메모리: NPU 내부에는 메모리 크기가 한정적일 수 있다. 따라서 용량이 큰 데이터 저장을 위해서 칩 외부에 메모리가 배치될 수 있다. 칩-외부 메모리는 ROM, SRAM, DRAM, Resistive RAM, Magneto-resistive RAM, Phase-change RAM, Ferroelectric RAM, Flash Memory, HBM 등과 같은 메모리 중 하나의 메모리를 포함할 수 있다. 칩-외부 메모리는 적어도 하나의 메모리 유닛으로 구성될 수 있다. 칩-외부 메모리는 단일(homogeneous) 메모리 유닛 또는 이종(heterogeneous) 메모리 유닛으로 구성될 수 있다. Out-of-Chip Memory: Memory size may be limited inside the NPU. Accordingly, a memory may be disposed outside the chip for storing large-capacity data. The off-chip memory may include one of memories such as ROM, SRAM, DRAM, Resistive RAM, Magneto-resistive RAM, Phase-change RAM, Ferroelectric RAM, Flash Memory, HBM, and the like. The off-chip memory may consist of at least one memory unit. The off-chip memory may be configured as a homogeneous memory unit or a heterogeneous memory unit.

칩-내부 메모리: NPU는 칩-내부 메모리를 포함할 수 있다. 칩-내부 메모리는 휘발성 메모리 및/또는 비휘발성 메모리를 포함할 수 있다. 예를 들면, 칩-내부 내부 메모리는 ROM, SRAM, DRAM, Resistive RAM, Magneto-resistive RAM, Phase-change RAM, Ferroelectric RAM, Flash Memory, HBM 등과 같은 메모리 중 하나의 메모리를 포함할 수 있다. 칩-내부 메모리는 적어도 하나의 메모리 유닛으로 구성될 수 있다. 칩-내부 메모리는 단일(homogeneous) 메모리 유닛 또는 이종(heterogeneous) 메모리 유닛으로 구성될 수 있다.In-Chip Memory: The NPU may include in-Chip memory. In-chip memory may include volatile memory and/or non-volatile memory. For example, the chip-internal memory may include one of memories such as ROM, SRAM, DRAM, Resistive RAM, Magneto-resistive RAM, Phase-change RAM, Ferroelectric RAM, Flash Memory, HBM, and the like. The in-chip memory may consist of at least one memory unit. The in-chip memory may be composed of a single (homogeneous) memory unit or a heterogeneous (heterogeneous) memory unit.

이하, 첨부한 도면을 참조하여 본 개시의 바람직한 실시 예를 설명함으로써, 본 개시를 상세히 설명한다. 이하, 본 개시의 실시 예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, the present disclosure will be described in detail by describing preferred embodiments of the present disclosure with reference to the accompanying drawings. Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

도 1은 본 개시에 따른 신경 프로세싱 유닛을 설명하는 개략적인 개념도이다. 1 is a schematic conceptual diagram illustrating a neural processing unit according to the present disclosure.

도 1에 도시된 신경 프로세싱 유닛(neural processing unit, NPU)(100)은 인공신경망을 위한 동작을 수행하도록 특화된 프로세서이다.The neural processing unit (NPU) 100 shown in FIG. 1 is a processor specialized to perform an operation for an artificial neural network.

인공신경망은 여러 입력 또는 자극이 들어오면 각각 가중치를 곱해 더해주고, 추가적으로 편차를 더한 값을 활성화 함수를 통해 변형하여 전달하는 인공 뉴런들이 모인 네트워크를 의미한다. 이렇게 학습된 인공신경망은 입력 데이터로부터 추론(inference) 결과를 출력하는데 사용될 수 있다. An artificial neural network refers to a network of artificial neurons that multiplies and adds weights when multiple inputs or stimuli are received, and transforms and transmits the value obtained by adding an additional deviation through an activation function. The artificial neural network trained in this way can be used to output inference results from input data.

상기 NPU(100)은 전기/전자 회로로 구현된 반도체일 수 있다. 상기 전기/전자 회로라 함은 수많은 전자 소자, (예컨대 트렌지스터, 커패시터)를 포함하는 것을 의미할 수 있다. 상기 NPU(100)은 프로세싱 엘리먼트(processing element: PE) 어레이(110), NPU 내부 메모리(120), NPU 스케줄러(130), 및 NPU 인터페이스(140)를 포함할 수 있다. 프로세싱 엘리먼트 어레이(110), NPU 내부 메모리(120), NPU 스케줄러(130), 및 NPU 인터페이스(140) 각각은 수많은 트렌지스터들이 연결된 반도체 회로일 수 있다. 따라서, 이들 중 일부는 육안으로는 식별되어 구분되기 어려울 수 있고, 동작에 의해서만 식별될 수 있다. 예컨대, 임의 회로는 프로세싱 엘리먼트 어레이(110)으로 동작하기도 하고, 혹은 NPU 스케줄러(130)로 동작될 수도 있다. NPU 스케줄러(130)는 NPU(100)의 인공신경망 추론 동작을 제어하도록 구성된 제어부의 기능을 수행하도록 구성될 수 있다.The NPU 100 may be a semiconductor implemented as an electric/electronic circuit. The electric/electronic circuit may mean including a number of electronic devices (eg, a transistor, a capacitor). The NPU 100 may include a processing element (PE) array 110 , an NPU internal memory 120 , an NPU scheduler 130 , and an NPU interface 140 . Each of the processing element array 110 , the NPU internal memory 120 , the NPU scheduler 130 , and the NPU interface 140 may be a semiconductor circuit to which numerous transistors are connected. Accordingly, some of them may be difficult to identify and distinguish with the naked eye, and may be identified only by operation. For example, any circuit may operate as the processing element array 110 , or may operate as the NPU scheduler 130 . NPU scheduler 130 may be configured to perform the function of the control unit configured to control the artificial neural network reasoning operation of the NPU (100).

상기 NPU(100)은 프로세싱 엘리먼트 어레이(110), 프로세싱 엘리먼트 어레이(110)에서 추론될 수 있는 인공신경망모델을 저장하도록 구성된 NPU 내부 메모리(120), 및 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보에 기초하여 프로세싱 엘리먼트 어레이(110) 및 NPU 내부 메모리(120)를 제어하도록 구성된 NPU 스케줄러(130)를 포함할 수 있다. 여기서, 인공신경망모델은 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보를 포함할 수 있다. 인공신경망모델은 특정 추론 기능을 수행하도록 학습된 AI 인식모델을 의미할 수 있다. The NPU 100 is a processing element array 110, an NPU internal memory 120 configured to store an artificial neural network model that can be inferred from the processing element array 110, and data locality information or structure of the artificial neural network model. and an NPU scheduler 130 configured to control the processing element array 110 and the NPU internal memory 120 based on the information. Here, the artificial neural network model may include data locality information or information about the structure of the artificial neural network model. The artificial neural network model may refer to an AI recognition model trained to perform a specific reasoning function.

프로세싱 엘리먼트 어레이(110)는 인공신경망을 위한 동작을 수행할 수 있다. The processing element array 110 may perform an operation for an artificial neural network.

NPU 인터페이스(140)는 시스템 버스를 통해서 NPU(100)와 연결된 다양한 구성요소들, 예컨대 메모리와 통신할 수 있다. The NPU interface 140 may communicate with various components connected to the NPU 100 through a system bus, for example, a memory.

NPU 스케줄러(130)는 신경 프로세싱 유닛(100)의 추론 연산을 위한 프로세싱 엘리먼트 어레이(100)의 연산 및 NPU 내부 메모리(120)의 읽기 및 쓰기 순서를 제어하도록 구성된다. The NPU scheduler 130 is configured to control the operation of the processing element array 100 for the reasoning operation of the neural processing unit 100 and the read and write order of the NPU internal memory 120 .

NPU 스케줄러(130)는 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보에 기초하여 프로세싱 엘리먼트 어레이(100) 및 NPU 내부 메모리(120)을 제어하도록 구성될 수 있다. The NPU scheduler 130 may be configured to control the processing element array 100 and the NPU internal memory 120 based on the data locality information or information about the structure of the artificial neural network model.

NPU 스케줄러(130)는 프로세싱 엘리먼트 어레이(100)에서 작동할 인공신경망모델의 구조를 분석하거나 또는 이미 분석된 정보를 제공받을 수 있다. 예를 들면, 인공신경망모델이 포함할 수 있는 인공신경망의 데이터는 각각의 레이어의 노드 데이터(즉, 특징맵), 레이어들의 배치 데이터, 지역성 정보 또는 구조에 대한 정보, 각각의 레이어의 노드를 연결하는 연결망 각각의 가중치 데이터 (즉, 가중치 커널) 중 적어도 일부를 포함할 수 있다. 인공신경망의 데이터는 NPU 스케줄러(130) 내부에 제공되는 메모리 또는 NPU 내부 메모리(120)에 저장될 수 있다. The NPU scheduler 130 may analyze the structure of an artificial neural network model to be operated in the processing element array 100 or may be provided with already analyzed information. For example, the data of the artificial neural network that the artificial neural network model may include is node data of each layer (ie, feature map), arrangement data of layers, locality information or information on structure, and nodes of each layer are connected. may include at least a portion of weight data (ie, weight kernel) of each connection network. The data of the artificial neural network may be stored in a memory provided inside the NPU scheduler 130 or in the NPU internal memory 120 .

NPU 스케줄러(130)는 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보에 기초하여 NPU(100)가 수행할 인공신경망모델의 연산 순서를 스케줄링 할 수 있다.The NPU scheduler 130 may schedule the operation sequence of the artificial neural network model to be performed by the NPU 100 based on data locality information or structure information of the artificial neural network model.

NPU 스케줄러(130)는 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보에 기초하여 인공신경망모델의 레이어의 특징맵 및 가중치 데이터가 저장된 메모리 어드레스 값을 획득할 수 있다. 예를 들면, NPU 스케줄러(130)는 메모리에 저장된 인공신경망모델의 레이어의 특징맵 및 가중치 데이터가 저장된 메모리 어드레스 값을 획득할 수 있다. 따라서 NPU 스케줄러(130)는 구동할 인공신경망모델의 레이어의 특징맵 및 가중치 데이터를 메모리(200)에서 가져와서 NPU 내부 메모리(120)에 저장할 수 있다. The NPU scheduler 130 may obtain a memory address value in which the feature map and weight data of the layer of the artificial neural network model are stored based on the data locality information or the structure information of the artificial neural network model. For example, the NPU scheduler 130 may obtain a memory address value in which the feature map and weight data of the layer of the artificial neural network model stored in the memory are stored. Therefore, the NPU scheduler 130 may bring the feature map and weight data of the layer of the artificial neural network model to be driven from the memory 200 and store it in the NPU internal memory 120 .

각각의 레이어의 특징맵은 대응되는 각각의 메모리 어드레스 값을 가질 수 있다. The feature map of each layer may have a corresponding memory address value.

각각의 가중치 데이터는 대응되는 각각의 메모리 어드레스 값을 가질 수 있다.Each weight data may have a corresponding respective memory address value.

NPU 스케줄러(130)는 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보, 예를 들면, 인공신경망모델의 인공신경망의 레이어들의 배치 데이터 지역성 정보 또는 구조에 대한 정보에 기초해서 프로세싱 엘리먼트 어레이(110)의 연산 순서를 스케줄링 할 수 있다.The NPU scheduler 130 is based on information on the data locality information or structure of the artificial neural network model, for example, the arrangement data locality information of the layers of the artificial neural network of the artificial neural network model, or information on the structure of the processing element array 110. You can schedule the operation order of

NPU 스케줄러(130)는 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보에 기초하여 스케줄링 하기 때문에, 일반적인 CPU의 스케줄링 개념과 다르게 동작할 수 있다. 일반적인 CPU의 스케줄링은 공평성, 효율성, 안정성, 반응 시간 등을 고려하여, 최상의 효율을 낼 수 있도록 동작한다. 즉, 우선 순위, 연산 시간 등을 고려해서 동일 시간내에 가장 많은 프로세싱을 수행하도록 스케줄링 한다. Since the NPU scheduler 130 performs scheduling based on data locality information or structure information of an artificial neural network model, it may operate differently from a general CPU scheduling concept. Scheduling of a general CPU operates to achieve the best efficiency by considering fairness, efficiency, stability, and response time. That is, it is scheduled to perform the most processing within the same time in consideration of priority and operation time.

종래의 CPU는 각 프로세싱의 우선 순서, 연산 처리 시간 등의 데이터를 고려하여 작업을 스케줄링 하는 알고리즘을 사용하였다. Conventional CPUs use an algorithm for scheduling tasks in consideration of data such as priority order of each processing and operation processing time.

이와 다르게 NPU 스케줄러(130)는 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보에 기초하여 결정된 NPU(100)의 프로세싱 순서대로 NPU(100)를 제어할 수 있다.Alternatively, the NPU scheduler 130 may control the NPU 100 in the processing order of the NPU 100 determined based on information on the data locality or structure of the artificial neural network model.

더 나아가면, NPU 스케줄러(130)는 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보 및/또는 사용하려는 신경 프로세싱 유닛(100)의 데이터 지역성 정보 또는 구조에 대한 정보에 기초하여 결정된 프로세싱 순서대로 NPU(100)를 구동할 수 있다. Further, the NPU scheduler 130 NPU in the processing order determined based on the information on the data locality information or structure of the artificial neural network model and/or the data locality information or the structure of the neural processing unit 100 to be used. (100) can be driven.

단, 본 개시는 NPU(100)의 데이터 지역성 정보 또는 구조에 대한 정보에 제한되지 않는다. However, the present disclosure is not limited to information on data locality information or structure of the NPU 100 .

NPU 스케줄러(130)는 인공신경망의 데이터 지역성 정보 또는 구조에 대한 정보를 저장하도록 구성될 수 있다. NPU scheduler 130 may be configured to store information about the data locality information or structure of the artificial neural network.

즉, NPU 스케줄러(130)는 적어도 인공신경망모델의 인공신경망의 데이터 지역성 정보 또는 구조에 대한 정보만 활용하더라도 프로세싱 순서를 결정할 수 있다. That is, the NPU scheduler 130 may determine the processing order even if only information on the data locality information or structure of the artificial neural network of the artificial neural network model is utilized at least.

더 나아가서, NPU 스케줄러(130)는 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보 및 NPU(100)의 데이터 지역성 정보 또는 구조에 대한 정보를 고려하여 NPU(100)의 프로세싱 순서를 결정할 수 있다. 또한, 결정된 프로세싱 순서대로 NPU(100)의 프로세싱 최적화도 가능하다. Furthermore, the NPU scheduler 130 may determine the processing order of the NPU 100 in consideration of the information on the data locality information or structure of the artificial neural network model and the data locality information or information on the structure of the NPU 100 . In addition, processing optimization of the NPU 100 in the determined processing order is also possible.

프로세싱 엘리먼트 어레이(110)는 인공신경망의 특징맵과 가중치 데이터를 연산하도록 구성된 복수의 프로세싱 엘리먼트들(PE1 to PE12)이 배치된 구성을 의미한다. 각각의 프로세싱 엘리먼트는 MAC (multiply and accumulate) 연산기 및/또는 ALU (Arithmetic Logic Unit) 연산기를 포함할 수 있다. 단, 본 개시에 따른 예시들은 이에 제한되지 않는다.The processing element array 110 refers to a configuration in which a plurality of processing elements PE1 to PE12 configured to calculate the feature map and weight data of the artificial neural network are disposed. Each processing element may include a multiply and accumulate (MAC) operator and/or an Arithmetic Logic Unit (ALU) operator. However, examples according to the present disclosure are not limited thereto.

도 1에서는 예시적으로 복수의 프로세싱 엘리먼트들이 도시되었지만, 하나의 프로세싱 엘리먼트 내부에 MAC을 대체하여, 복수의 곱셈기(multiplier) 및 가산기 트리(adder tree)로 구현된 연산기들이 병렬로 배치되어 구성되는 것도 가능하다. 이러한 경우, 프로세싱 엘리먼트 어레이(110)는 복수의 연산기를 포함하는 적어도 하나의 프로세싱 엘리먼트로 지칭되는 것도 가능하다.Although a plurality of processing elements are illustrated in FIG. 1 by way of example, by replacing the MAC within one processing element, operators implemented as a plurality of multipliers and adder trees are arranged and configured in parallel. It is possible. In this case, the processing element array 110 may be referred to as at least one processing element including a plurality of operators.

프로세싱 엘리먼트 어레이(110)는 복수의 프로세싱 엘리먼트들(PE1 to PE12)을 포함하도록 구성된다. 도 1에 도시된 복수의 프로세싱 엘리먼트들(PE1 to PE12)은 단지 설명의 편의를 위한 예시이며, 복수의 프로세싱 엘리먼트들(PE1 to PE12)의 개수는 제한되지 않는다. 복수의 프로세싱 엘리먼트들(PE1 to PE12)의 개수에 의해서 프로세싱 엘리먼트 어레이(110)의 크기 또는 개수가 결정될 수 있다. 프로세싱 엘리먼트 어레이(110)의 크기는 N x M 행렬 형태로 구현될 수 있다. 여기서 N 과 M은 0보다 큰 정수이다. 프로세싱 엘리먼트 어레이(110)는 N x M 개의 프로세싱 엘리먼트를 포함할 수 있다. 즉, 프로세싱 엘리먼트는 1개 이상일 수 있다.The processing element array 110 is configured to include a plurality of processing elements PE1 to PE12. The plurality of processing elements PE1 to PE12 illustrated in FIG. 1 is merely an example for convenience of description, and the number of the plurality of processing elements PE1 to PE12 is not limited. The size or number of the processing element array 110 may be determined by the number of the plurality of processing elements PE1 to PE12 . The size of the processing element array 110 may be implemented in the form of an N x M matrix. where N and M are integers greater than zero. The processing element array 110 may include N x M processing elements. That is, there may be more than one processing element.

프로세싱 엘리먼트 어레이(110)의 크기는 NPU(100)이 작동하는 인공신경망모델의 특성을 고려하여 설계할 수 있다. The size of the processing element array 110 may be designed in consideration of the characteristics of the artificial neural network model in which the NPU 100 operates.

프로세싱 엘리먼트 어레이(110)는 인공신경망 연산에 필요한 덧셈, 곱셈, 누산 등의 기능을 수행하도록 구성된다. 다르게 설명하면, 프로세싱 엘리먼트 어레이(110)는 MAC(multiplication and accumulation) 연산을 수행하도록 구성될 수 있다.The processing element array 110 is configured to perform functions such as addition, multiplication, and accumulation required for artificial neural network operation. In other words, the processing element array 110 may be configured to perform a multiplication and accumulation (MAC) operation.

이하 프로세싱 엘리먼트 어레이(110) 중 제1 프로세싱 엘리먼트(PE1)를 예를 들어 설명한다.Hereinafter, the first processing element PE1 of the processing element array 110 will be described as an example.

도 2는 본 개시에 적용될 수 있는 프로세싱 엘리먼트 어레이 중 하나의 프로세싱 엘리먼트를 설명하는 개략적인 개념도이다. 2 is a schematic conceptual diagram illustrating one processing element of an array of processing elements that may be applied to the present disclosure.

본 개시의 일 예시에 따른 NPU(100)은 프로세싱 엘리먼트 어레이(110), 프로세싱 엘리먼트 어레이(110)에서 추론될 수 있는 인공신경망모델을 저장하도록 구성된 NPU 내부 메모리(120) 및 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보에 기초하여 프로세싱 엘리먼트 어레이(110) 및 NPU 내부 메모리(120)을 제어하도록 구성된 NPU 스케줄러(130)를 포함하고, 프로세싱 엘리먼트 어레이(110)는 MAC 연산을 수행하도록 구성되고, 프로세싱 엘리먼트 어레이(110)는 MAC 연산 결과를 양자화해서 출력하도록 구성될 수 있다. 단, 본 개시의 예시들은 이에 제한되지 않는다. NPU 100 according to an example of the present disclosure is a processing element array 110, an NPU internal memory 120 configured to store an artificial neural network model that can be inferred from the processing element array 110, and data locality of the artificial neural network model an NPU scheduler 130 configured to control the processing element array 110 and the NPU internal memory 120 based on the information or information about the structure, wherein the processing element array 110 is configured to perform a MAC operation, The processing element array 110 may be configured to quantize and output a MAC operation result. However, examples of the present disclosure are not limited thereto.

NPU 내부 메모리(120)은 메모리 크기와 인공신경망모델의 데이터 크기에 따라 인공신경망모델의 전부 또는 일부를 저장할 수 있다.The NPU internal memory 120 may store all or part of the artificial neural network model according to the memory size and the data size of the artificial neural network model.

제1 프로세싱 엘리먼트(PE1)는 곱셈기(111), 가산기(112), 누산기(113), 및 비트 양자화 유닛(114)을 포함할 수 있다. 단, 본 개시에 따른 예시들은 이에 제한되지 않으며, 프로세싱 엘리먼트 어레이(110)는 인공신경망의 연산 특성을 고려하여 변형 실시될 수도 있다.The first processing element PE1 may include a multiplier 111 , an adder 112 , an accumulator 113 , and a bit quantization unit 114 . However, examples according to the present disclosure are not limited thereto, and the processing element array 110 may be modified in consideration of the computational characteristics of the artificial neural network.

곱셈기(111)는 입력 받은 (N)bit 데이터와 (M)bit 데이터를 곱한다. 곱셈기(111)의 연산 값은 (N+M)bit 데이터로 출력된다. The multiplier 111 multiplies the received (N)bit data and (M)bit data. The operation value of the multiplier 111 is output as (N+M)bit data.

곱셈기(111)는 하나의 변수와 하나의 상수를 입력 받도록 구성될 수 있다. The multiplier 111 may be configured to receive one variable and one constant input.

누산기(113)는 (L)loops 횟수만큼 가산기(112)를 사용하여 곱셈기(111)의 연산 값과 누산기(113)의 연산 값을 누산 한다. 따라서 누산기(113)의 출력부와 입력부의 데이터의 비트 폭은 (N+M+log2(L))bit로 출력될 수 있다. 여기서 L은 0보다 큰 정수이다.The accumulator 113 accumulates the operation value of the multiplier 111 and the operation value of the accumulator 113 by using the adder 112 as many times as (L) loops. Accordingly, the bit width of the data of the output unit and the input unit of the accumulator 113 may be output as (N+M+log2(L)) bits. where L is an integer greater than 0.

누산기(113)는 누산이 종료되면, 초기화 신호(initialization reset)를 인가받아서 누산기(113) 내부에 저장된 데이터를 0으로 초기화 할 수 있다. 단, 본 개시에 따른 예시들은 이에 제한되지 않는다.When the accumulation is finished, the accumulator 113 may receive an initialization signal (initialization reset) to initialize the data stored in the accumulator 113 to 0. However, examples according to the present disclosure are not limited thereto.

비트 양자화 유닛(114)은 누산기(113)에서 출력되는 데이터의 비트 폭을 저감할 수 있다. 비트 양자화 유닛(114)은 NPU 스케줄러(130)에 의해서 제어될 수 있다. 양자화된 데이터의 비트 폭은 (X)bit로 출력될 수 있다. 여기서 X는 0보다 큰 정수이다. 상술한 구성에 따르면, 프로세싱 엘리먼트 어레이(110)는 MAC 연산을 수행하도록 구성되고, 프로세싱 엘리먼트 어레이(110)는 MAC 연산 결과를 양자화해서 출력할 수 있는 효과가 있다. 특히 이러한 양자화는 (L)loops가 증가할수록 소비 전력을 더 절감할 수 있는 효과가 있다. 또한 소비 전력이 저감되면 발열도 저감할 수 있는 효과가 있다. 특히 발열을 저감하면 NPU(100)의 고온에 의한 오동작 발생 가능성을 저감할 수 있는 효과가 있다.The bit quantization unit 114 may reduce the bit width of data output from the accumulator 113 . The bit quantization unit 114 may be controlled by the NPU scheduler 130 . The bit width of the quantized data may be output as (X) bits. where X is an integer greater than 0. According to the above configuration, the processing element array 110 is configured to perform a MAC operation, and the processing element array 110 has an effect of quantizing and outputting the MAC operation result. In particular, such quantization has the effect of further reducing power consumption as (L)loops increases. In addition, when power consumption is reduced, there is an effect of reducing heat generation. In particular, when the heat generation is reduced, there is an effect that can reduce the possibility of malfunction due to the high temperature of the NPU 100 .

비트 양자화 유닛(114)의 출력 데이터(X)bit은 다음 레이어의 노드 데이터 또는 합성곱의 입력 데이터가 될 수 있다. 만약 인공신경망모델이 양자화되었다면, 비트 양자화 유닛(114)은 양자화된 정보를 인공신경망모델에서 제공받도록 구성될 수 있다. 단, 이에 제한되지 않으며, NPU 스케줄러(130)는 인공신경망모델을 분석하여 양자화된 정보를 추출하도록 구성되는 것도 가능하다. 따라서 양자화된 데이터 크기에 대응되도록, 출력 데이터(X)bit를 양자화 된 비트 폭으로 변환하여 출력될 수 있다. 비트 양자화 유닛(114)의 출력 데이터(X)bit는 양자화된 비트 폭으로 NPU 내부 메모리(120)에 저장될 수 있다. The output data (X) bit of the bit quantization unit 114 may be node data of a next layer or input data of convolution. If the artificial neural network model has been quantized, the bit quantization unit 114 may be configured to receive quantized information from the artificial neural network model. However, the present invention is not limited thereto, and the NPU scheduler 130 may be configured to extract quantized information by analyzing the artificial neural network model. Therefore, the output data (X)bit may be converted into a quantized bit width to correspond to the quantized data size and output. The output data (X) bit of the bit quantization unit 114 may be stored in the NPU internal memory 120 with a quantized bit width.

본 개시의 일 예시에 따른 NPU(100)의 프로세싱 엘리먼트 어레이(110)는 곱셈기(111), 가산기(112), 누산기(113), 및 비트 양자화 유닛(114)을 포함한다. The processing element array 110 of the NPU 100 according to an example of the present disclosure includes a multiplier 111 , an adder 112 , an accumulator 113 , and a bit quantization unit 114 .

도 3은 도 1에 도시된 NPU(100)의 변형예를 나타낸 예시도이다.3 is an exemplary view showing a modified example of the NPU 100 shown in FIG.

도 3에 도시된 NPU(100)은 도 1에 예시적으로 도시된 프로세싱 유닛(100)과 비교하면, 프로세싱 엘리먼트 어레이(110)를 제외하곤 실질적으로 동일하기 때문에, 이하 단지 설명의 편의를 위해서 중복 설명은 생략할 수 있다. Since the NPU 100 shown in FIG. 3 is substantially the same as the processing unit 100 exemplarily shown in FIG. 1 , except for the processing element array 110 , the following is duplicated for convenience of description only. A description may be omitted.

도 3에 예시적으로 도시된 프로세싱 엘리먼트 어레이(110)는 복수의 프로세싱 엘리먼트들(PE1 to PE12) 외에, 각각의 프로세싱 엘리먼트들(PE1 to PE12)에 대응되는 각각의 레지스터 파일들(RF1 to RF12)을 더 포함할 수 있다.The processing element array 110 exemplarily illustrated in FIG. 3 includes register files RF1 to RF12 corresponding to respective processing elements PE1 to PE12 in addition to the plurality of processing elements PE1 to PE12. may further include.

도 3에 도시된 복수의 프로세싱 엘리먼트들(PE1 to PE12) 및 복수의 레지스터 파일들(RF1 to RF12)은 단지 설명의 편의를 위한 예시이며, 복수의 프로세싱 엘리먼트들(PE1 to PE12) 및 복수의 레지스터 파일들(RF1 to RF12)의 개수는 제한되지 않는다. The plurality of processing elements PE1 to PE12 and the plurality of register files RF1 to RF12 illustrated in FIG. 3 are merely examples for convenience of description, and the plurality of processing elements PE1 to PE12 and the plurality of registers The number of files RF1 to RF12 is not limited.

복수의 프로세싱 엘리먼트들(PE1 to PE12) 및 복수의 레지스터 파일들(RF1 to RF12)의 개수에 의해서 프로세싱 엘리먼트 어레이(110)의 크기 또는 개수가 결정될 수 있다. 프로세싱 엘리먼트 어레이(110) 및 복수의 레지스터 파일들(RF1 to RF12)의 크기는 N x M 행렬 형태로 구현될 수 있다. 여기서 N 과 M은 0보다 큰 정수이다.The size or number of the processing element array 110 may be determined by the number of the plurality of processing elements PE1 to PE12 and the plurality of register files RF1 to RF12 . The size of the processing element array 110 and the plurality of register files RF1 to RF12 may be implemented in the form of an N×M matrix. where N and M are integers greater than zero.

프로세싱 엘리먼트 어레이(110)의 어레이 크기는 NPU(100)이 작동하는 인공신경망모델의 특성을 고려하여 설계할 수 있다. 부연 설명하면, 레지스터 파일의 메모리 크기는 작동할 인공신경망모델의 데이터 크기, 요구되는 동작 속도, 요구되는 소비 전력 등을 고려하여 결정될 수 있다. The array size of the processing element array 110 may be designed in consideration of the characteristics of the artificial neural network model in which the NPU 100 operates. In other words, the memory size of the register file may be determined in consideration of the data size of the artificial neural network model to be operated, the required operating speed, the required power consumption, and the like.

NPU(100)의 레지스터 파일들(RF1 to RF12)은 프로세싱 엘리먼트들(PE1 to PE12)과 직접 연결된 정적 메모리 유닛이다. 레지스터 파일들(RF1 to RF12)은 예를 들면, 플립플롭, 및/또는 래치 등으로 구성될 수 있다. 레지스터 파일들(RF1 to RF12)은 대응되는 프로세싱 엘리먼트들(PE1 to PE12)의 MAC 연산 값을 저장하도록 구성될 수 있다. 레지스터 파일들(RF1 to RF12)은 NPU 내부 메모리(120)와 가중치 데이터 및/또는 노드 데이터를 제공하거나 제공받도록 구성될 수 있다.The register files RF1 to RF12 of the NPU 100 are static memory units directly connected to the processing elements PE1 to PE12. The register files RF1 to RF12 may include, for example, flip-flops and/or latches. The register files RF1 to RF12 may be configured to store MAC operation values of the corresponding processing elements PE1 to PE12. The register files RF1 to RF12 may be configured to provide or receive the NPU internal memory 120 and weight data and/or node data.

레지스터 파일들(RF1 to RF12)은 MAC 연산 시 누산기의 임시 메모리의 기능을 수행하도록 구성되는 것도 가능하다. It is also possible that the register files RF1 to RF12 are configured to perform the function of a temporary memory of the accumulator during MAC operation.

도 4는 예시적인 인공신경망모델을 설명하는 개략적인 개념도이다. 4 is a schematic conceptual diagram illustrating an exemplary artificial neural network model.

이하 NPU(100)에서 작동될 수 있는 예시적인 인공신경망모델(110-10)의 연산에 대하여 설명한다.Hereinafter, the operation of the exemplary artificial neural network model 110-10 that can be operated in the NPU 100 will be described.

도 4의 예시적인 인공신경망모델(110-10)은 도 1 또는 도 3에 도시된 NPU(100)에서 학습되거나 별도의 기계 학습 장치에서 학습된 인공신경망일 수 있다. 인공신경망 모델은 객체 인식, 음성 인식 등 다양한 추론 기능을 수행하도록 학습된 인공신경망일 수 있다.The exemplary artificial neural network model 110-10 of FIG. 4 may be an artificial neural network learned by the NPU 100 shown in FIG. 1 or FIG. 3 or by a separate machine learning device. The artificial neural network model may be an artificial neural network trained to perform various inference functions, such as object recognition and speech recognition.

인공신경망모델(110-10)은 심층 신경망(DNN, Deep Neural Network)일 수 있다.The artificial neural network model 110 - 10 may be a deep neural network (DNN).

단, 본 개시의 예시들에 따른 인공신경망모델(110-10)은 심층 신경망에 제한되지 않는다. However, the artificial neural network model 110-10 according to examples of the present disclosure is not limited to a deep neural network.

예를 들면, 인공신경망모델은 Object Detection, Object Segmentation, Image/Video Reconstruction, Image/Video Enhancement, Object Tracking, Event Recognition, Event Prediction, Anomaly Detection, Density Estimation, Event Search, Measurement 등의 추론을 수행하도록 학습될 모델일 수 있다. For example, an artificial neural network model is trained to perform inference such as Object Detection, Object Segmentation, Image/Video Reconstruction, Image/Video Enhancement, Object Tracking, Event Recognition, Event Prediction, Anomaly Detection, Density Estimation, Event Search, Measurement, etc. It could be a model to be

예를 들면, 인공신경망모델은 Bisenet, Shelfnet, Alexnet, Densenet, Efficientnet, EfficientDet, Googlenet, Mnasnet, Mobilenet, Resnet, Shufflenet, Squeezenet, VGG, Yolo, RNN, CNN, DBN, RBM, LSTM 등의 모델일 수 있다. 단, 본 개시는 이에 제한되지 않으며, NPU에서 동작할 새로운 인공신경망모델이 꾸준히 발표되고 있다.For example, the artificial neural network model can be a model such as Bisenet, Shelfnet, Alexnet, Densenet, Efficientnet, EfficientDet, Googlenet, Mnasnet, Mobilenet, Resnet, Shufflenet, Squeezenet, VGG, Yolo, RNN, CNN, DBN, RBM, LSTM, etc. have. However, the present disclosure is not limited thereto, and new artificial neural network models to operate in the NPU are being continuously released.

단, 본 개시는 상술한 모델들에 제한되지 않는다. 또한 인공신경망모델(110-10)은 적어도 두 개의 서로 다른 모델들에 기초한 앙상블 모델일 수도 있다.However, the present disclosure is not limited to the above-described models. Also, the artificial neural network model 110-10 may be an ensemble model based on at least two different models.

인공신경망모델(110-10)은 NPU(100)의 NPU 내부 메모리(120)에 저장될 수 있다. The artificial neural network model 110 - 10 may be stored in the NPU internal memory 120 of the NPU 100 .

이하 도 4를 참조하여 예시적인 인공신경망모델(110-10)에 의환 추론 과정이 NPU(100)에 의해서 수행되는 것에 관해 설명한다.Hereinafter, with reference to FIG. 4 , it will be described that an inference inference process is performed by the NPU 100 in the exemplary artificial neural network model 110 - 10 .

인공신경망모델(110-10)은 입력 레이어(110-11), 제1 연결망(110-12), 제1 은닉 레이어(110-13), 제2 연결망(110-14), 제2 은닉 레이어(110-15), 제3 연결망(110-16), 및 출력 레이어(110-17)을 포함하는 예시적인 심층 신경망 모델이다. 단, 본 개시는 도 4에 도시된 인공신경망모델에만 제한되는 것은 아니다. 제1 은닉 레이어(110-13) 및 제2 은닉 레이어(110-15)는 복수의 은닉 레이어로 지칭되는 것도 가능하다.The artificial neural network model 110-10 includes an input layer 110-11, a first connection network 110-12, a first hidden layer 110-13, a second connection network 110-14, and a second hidden layer ( 110-15), a third connection network 110-16, and an exemplary deep neural network model including an output layer 110-17. However, the present disclosure is not limited to the artificial neural network model shown in FIG. 4 . The first hidden layer 110-13 and the second hidden layer 110-15 may be referred to as a plurality of hidden layers.

입력 레이어(110-11)는 예시적으로, x1 및 x2 입력 노드를 포함할 수 있다. 즉, 입력 레이어(110-11)는 2개의 입력 값에 대한 정보를 포함할 수 있다. 도 1 또는 도 3에 도시된 NPU 스케줄러(130)는 입력 레이어(110-11)로부터의 입력 값에 대한 정보가 저장되는 메모리 어드레스를 도 1 또는 도 3에 도시된 NPU 내부 메모리(120)에 설정할 수 있다.The input layer 110 - 11 may include, for example, x1 and x2 input nodes. That is, the input layer 110-11 may include information on two input values. The NPU scheduler 130 shown in FIG. 1 or 3 sets a memory address in which information about an input value from the input layer 110-11 is stored in the NPU internal memory 120 shown in FIG. 1 or 3 . can

제1 연결망(110-12)은 예시적으로, 입력 레이어(110-11)의 각각의 노드를 제1 은닉 레이어(110-13)의 각각의 노드로 연결시키기 위한 6개의 가중치 값에 대한 정보를 포함할 수 있다. 도 1 또는 도 3에 도시된 NPU 스케줄러(130)는 제1 연결망(110-12)의 가중치 값에 대한 정보가 저장되는 메모리 어드레스를 NPU 내부 메모리(120)에 설정할 수 있다. 각각의 가중치 값은 입력 노드 값과 곱해지고, 곱해진 값들의 누산된 값이 제1 은닉 레이어(110-13)에 저장된다. 여기서 노드들은 특징맵으로 지칭될 수 있다.The first connection network 110-12 exemplarily provides information on six weight values for connecting each node of the input layer 110-11 to each node of the first hidden layer 110-13. may include The NPU scheduler 130 shown in FIG. 1 or 3 may set a memory address in which information about the weight value of the first connection network 110 - 12 is stored in the NPU internal memory 120 . Each weight value is multiplied by an input node value, and an accumulated value of the multiplied values is stored in the first hidden layer 110 - 13 . Here, the nodes may be referred to as a feature map.

제1 은닉 레이어(110-13)는 예시적으로 a1, a2, 및 a3 노드를 포함할 수 있다. 즉, 제1 은닉 레이어(110-13)는 3개의 노드 값에 대한 정보를 포함할 수 있다. 도 1 또는 도 3에 도시된 NPU 스케줄러(130)는 제1 은닉 레이어(110-13)의 노드 값에 대한 정보를 저장시키기 위한 메모리 어드레스를 NPU 내부 메모리(120)에 설정할 수 있다.The first hidden layer 110-13 may include nodes a1, a2, and a3 for example. That is, the first hidden layer 110-13 may include information about three node values. The NPU scheduler 130 shown in FIG. 1 or FIG. 3 may set a memory address for storing information about the node value of the first hidden layer 110-13 in the NPU internal memory 120 .

NPU 스케줄러(130)는 제1 프로세싱 엘리먼트(PE1)가 제1 은닉 레이어(110-13)의 a1 노드의 MAC 연산을 수행하도록 연산 순서를 스케줄링 하도록 구성될 수 있다. NPU 스케줄러(130)는 제2 프로세싱 엘리먼트(PE2)가 제1 은닉 레이어(110-13)의 a2 노드의 MAC 연산을 수행하도록 연산 순서를 스케줄링 하도록 구성될 수 있다. NPU 스케줄러(130)는 제3 프로세싱 엘리먼트(PE3)가 제1 은닉 레이어(110-13)의 a3 노드의 MAC 연산을 수행하도록 연산 순서를 스케줄링 하도록 구성될 수 있다. 여기서 NPU 스케줄러(130)는 3개의 프로세싱 엘리먼트들이 병렬로 동시에 MAC 연산을 각각 수행하도록 연산 순서를 미리 스케줄링 할 수 있다.The NPU scheduler 130 may be configured to schedule the operation order so that the first processing element PE1 performs the MAC operation of the a1 node of the first hidden layer 110-13. The NPU scheduler 130 may be configured to schedule an operation order so that the second processing element PE2 performs the MAC operation of the a2 node of the first hidden layer 110-13. The NPU scheduler 130 may be configured to schedule an operation order so that the third processing element PE3 performs the MAC operation of the a3 node of the first hidden layer 110-13. Here, the NPU scheduler 130 may pre-schedule an operation order so that the three processing elements simultaneously perform MAC operations in parallel.

제2 연결망(110-14)은 예시적으로, 제1 은닉 레이어(110-13)의 각각의 노드를 제2 은닉 레이어(110-15)의 각각의 노드로 연결시키기 위한 9개의 가중치 값에 대한 정보를 포함할 수 있다. 도 1 또는 도 3에 도시된 NPU 스케줄러(130)는 제2 연결망(110-14)의 가중치 값에 대한 정보를 저장시키기 위한 메모리 어드레스를 NPU 내부 메모리(120)에 설정할 수 있다. 상기 제2 연결망(110-14)의 가중치 값은 제1 은닉 레이어(110-13)로부터 입력되는 노드 값과 각기 곱해지고, 곱해진 값들의 누산된 값이 제2 은닉 레이어(110-15)에 저장된다. The second connection network 110-14 exemplarily relates to nine weight values for connecting each node of the first hidden layer 110-13 to each node of the second hidden layer 110-15. may contain information. The NPU scheduler 130 shown in FIG. 1 or 3 may set a memory address for storing information on the weight value of the second connection network 110-14 in the NPU internal memory 120 . The weight value of the second connection network 110-14 is multiplied by the node value input from the first hidden layer 110-13, and the accumulated value of the multiplied values is transmitted to the second hidden layer 110-15. is saved

제2 은닉 레이어(110-15)는 예시적으로 b1, b2, 및 b3 노드를 포함할 수 있다. 즉, 제2 은닉 레이어(110-15)는 3개의 노드 값에 대한 정보를 포함할 수 있다. NPU 스케줄러(130)는 제2 은닉 레이어(110-15)의 노드 값에 대한 정보를 저장시키기 위한 메모리 어드레스를 NPU 내부 메모리(120)에 설정할 수 있다.The second hidden layer 110 - 15 may include nodes b1, b2, and b3 for example. That is, the second hidden layer 110 - 15 may include information about three node values. The NPU scheduler 130 may set a memory address for storing information about the node value of the second hidden layer 110 - 15 in the NPU internal memory 120 .

NPU 스케줄러(130)는 제4 프로세싱 엘리먼트(PE4)가 제2 은닉 레이어(110-15)의 b1 노드의 MAC 연산을 수행하도록 연산 순서를 스케줄링 하도록 구성될 수 있다. NPU 스케줄러(130)는 제5 프로세싱 엘리먼트(PE5)가 제2 은닉 레이어(110-15)의 b2 노드의 MAC 연산을 수행하도록 연산 순서를 스케줄링 하도록 구성될 수 있다. NPU 스케줄러(130)는 제6 프로세싱 엘리먼트(PE6)가 제2 은닉 레이어(110-15)의 b3 노드의 MAC 연산을 수행하도록 연산 순서를 스케줄링 하도록 구성될 수 있다. The NPU scheduler 130 may be configured to schedule the operation order so that the fourth processing element PE4 performs the MAC operation of the b1 node of the second hidden layer 110-15. The NPU scheduler 130 may be configured to schedule an operation order so that the fifth processing element PE5 performs the MAC operation of the b2 node of the second hidden layer 110-15. The NPU scheduler 130 may be configured to schedule the operation order so that the sixth processing element PE6 performs the MAC operation of the b3 node of the second hidden layer 110-15.

여기서 NPU 스케줄러(130)는 3개의 프로세싱 엘리먼트들이 병렬로 동시에 MAC 연산을 각각 수행하도록 연산 순서를 미리 스케줄링 할 수 있다. Here, the NPU scheduler 130 may pre-schedule an operation order so that the three processing elements simultaneously perform MAC operations in parallel.

여기서 NPU 스케줄러(130)는 인공신경망모델의 제1 은닉 레이어(110-13)의 MAC 연산 이후 제2 은닉 레이어(110-15)의 연산이 수행되도록 스케줄링을 결정할 수 있다. Here, the NPU scheduler 130 may determine the scheduling so that the operation of the second hidden layer 110-15 is performed after the MAC operation of the first hidden layer 110-13 of the artificial neural network model.

즉, NPU 스케줄러(130)는 인공신경망모델의 데이터 지역성 정보 또는 구조에 대한 정보에 기초하여 프로세싱 엘리먼트 어레이(100) 및 NPU 내부 메모리(120)을 제어하도록 구성될 수 있다.That is, the NPU scheduler 130 may be configured to control the processing element array 100 and the NPU internal memory 120 based on the data locality information or structure information of the artificial neural network model.

제3 연결망(110-16)은 예시적으로, 제2 은닉 레이어(110-15)의 각각의 노드와 출력 레이어(110-17)의 각각의 노드를 연결하는 6개의 가중치 값에 대한 정보를 포함할 수 있다. NPU 스케줄러(130)는 제3 연결망(110-16)의 가중치 값에 대한 정보를 저장시키기 위한 메모리 어드레스를 NPU 내부 메모리(120)에 설정할 수 있다. 제3 연결망(110-16)의 가중치 값은 제2 은닉 레이어(110-15)로부터 입력되는 노드 값과 각기 곱해지고, 곱해진 값들의 누산된 값이 출력 레이어(110-17)에 저장된다. The third connection network 110-16 includes, for example, information on six weight values connecting each node of the second hidden layer 110-15 and each node of the output layer 110-17. can do. The NPU scheduler 130 may set a memory address for storing information about the weight value of the third connection network 110 - 16 in the NPU internal memory 120 . The weight value of the third connection network 110 - 16 is multiplied by the node value input from the second hidden layer 110 - 15 , and the accumulated value of the multiplied values is stored in the output layer 110 - 17 .

출력 레이어(110-17)는 예시적으로 y1, 및 y2 노드를 포함할 수 있다. 즉, 출력 레이어(110-17)는 2개의 노드 값에 대한 정보를 포함할 수 있다. NPU 스케줄러(130)는 출력 레이어(110-17)의 노드 값에 대한 정보를 저장시키기 위해 메모리 어드레스를 NPU 내부 메모리(120)에 설정할 수 있다.The output layer 110-17 may include, for example, y1 and y2 nodes. That is, the output layer 110-17 may include information about two node values. The NPU scheduler 130 may set a memory address in the NPU internal memory 120 to store information about the node value of the output layer 110-17.

NPU 스케줄러(130)는 제7 프로세싱 엘리먼트(PE7)가 출력 레이어(110-17)의 y1 노드의 MAC 연산을 수행하도록 연산 순서를 스케줄링 하도록 구성될 수 있다. NPU 스케줄러(130)는 제8 프로세싱 엘리먼트(PE8)가 출력 레이어(110-15)의 y2 노드의 MAC 연산을 수행하도록 연산 순서를 스케줄링 하도록 구성될 수 있다.The NPU scheduler 130 may be configured to schedule the operation order so that the seventh processing element PE7 performs the MAC operation of the y1 node of the output layer 110-17. The NPU scheduler 130 may be configured to schedule the operation order so that the eighth processing element PE8 performs the MAC operation of the y2 node of the output layer 110-15.

여기서 NPU 스케줄러(130)는 2개의 프로세싱 엘리먼트들이 병렬로 동시에 MAC 연산을 각각 수행하도록 연산 순서를 미리 스케줄링 할 수 있다. Here, the NPU scheduler 130 may pre-schedule an operation order so that the two processing elements simultaneously perform MAC operations in parallel.

여기서 NPU 스케줄러(130)는 인공신경망모델의 제2 은닉 레이어(110-15)의 MAC 연산 이후 출력 레이어(110-17)의 연산이 수행되도록 스케줄링을 결정할 수 있다. Here, the NPU scheduler 130 may determine the scheduling so that the operation of the output layer 110-17 is performed after the MAC operation of the second hidden layer 110-15 of the artificial neural network model.

즉, NPU 스케줄러(130)는 프로세싱 엘리먼트 어레이(100)에서 작동할 인공신경망모델의 구조를 분석하거나 또는 분석된 정보를 제공받을 수 있다. 인공신경망모델이 포함할 수 있는 인공신경망의 정보는 각각의 레이어의 노드 값에 대한 정보, 레이어들의 배치 데이터 지역성 정보 또는 구조에 대한 정보, 각각의 레이어의 노드를 연결하는 연결망 각각의 가중치 값에 대한 정보를 포함할 수 있다.That is, the NPU scheduler 130 may analyze the structure of the artificial neural network model to operate in the processing element array 100 or may receive the analyzed information. The artificial neural network information that can be included in the artificial neural network model includes information on node values of each layer, information on locality information or structure of arrangement data of layers, and weight values of each connection network connecting nodes of each layer. may contain information.

NPU 스케줄러(130)는 예시적인 인공신경망모델(110-10)의 데이터 지역성 정보 또는 구조에 대한 정보를 제공받았기 때문에, NPU 스케줄러(130)는 인공신경망모델(110-10)의 입력부터 출력까지의 연산 순서를 파악할 수 있다. Since the NPU scheduler 130 has been provided with information on the data locality information or structure of the exemplary neural network model 110-10, the NPU scheduler 130 from the input to the output of the neural network model 110-10. You can figure out the order of operations.

따라서, NPU 스케줄러(130)는 각각의 레이어의 MAC 연산 값들이 저장되는 메모리 어드레스를 스케줄링 순서를 고려해서 NPU 내부 메모리(120)에 설정할 수 있다. Accordingly, the NPU scheduler 130 may set the memory address in which the MAC operation values of each layer are stored in the NPU internal memory 120 in consideration of the scheduling order.

NPU 내부 메모리(120)는 NPU(100)의 추론 연산이 지속되는 동안 NPU 내부 메모리(120)에 저장된 연결망들의 가중치 데이터를 보존하도록 구성될 수 있다. 따라서 메모리 읽기 쓰기 동작을 저감할 수 있는 효과가 있다.The NPU internal memory 120 may be configured to preserve the weight data of the connections stored in the NPU internal memory 120 while the reasoning operation of the NPU 100 is continued. Accordingly, there is an effect of reducing memory read/write operations.

즉, NPU 내부 메모리(120)는 추론 연산이 지속되는 동안 NPU 내부 메모리(120)에 저장된 MAC 연산 값을 재사용 하도록 구성될 수 있다.That is, the NPU internal memory 120 may be configured to reuse the MAC operation value stored in the NPU internal memory 120 while the speculation operation is continued.

도 5는 컨볼류션 신경망의 기본 구조를 설명하기 위한 도면이다.5 is a diagram for explaining the basic structure of a convolutional neural network.

도 5를 참조하면, 컨볼류션 신경망은 하나 또는 여러 개의 컨볼류션 레이어(convolutional layer)와 통합 레이어(pooling layer), 완전하게 연결된 레이어(fully connected layer)들의 조합일 수 있다. 컨볼류션 신경망은 2차원 데이터의 학습 및 추론에 적합한 구조를 가지고 있으며, 역전달(Backpropagation algorithm)을 통해 학습될 수 있다. Referring to FIG. 5 , a convolutional neural network may be a combination of one or several convolutional layers, a pooling layer, and fully connected layers. The convolutional neural network has a structure suitable for learning and inference of two-dimensional data, and can be learned through a backpropagation algorithm.

본 개시의 예시에서, 컨볼류션 신경망은 채널마다 채널의 입력 영상의 특징을 추출하는 커널이 존재한다. 커널은 2차원 행렬로 구성될 수 있으며, 입력 데이터를 순회하면서 합성곱 연산 수행한다. 커널의 크기는 임의로 결정될 수 있으며, 커널이 입력 데이터를 순회하는 간격(stride) 또한 임의로 결정될 수 있다. 커널 하나당 입력 데이터 전체에 대한 합성곱 결과는 특징맵(feature map) 또는 활성화 맵으로 지칭될 수 있다. 이하에서 커널은 일 세트의 가중치 값들 또는 복수의 세트의 가중치 값들을 포함할 수 있다. 각 레이어 별 커널의 개수는 채널의 개수로 지칭될 수 있다.In the example of the present disclosure, in the convolutional neural network, there is a kernel for extracting the features of the input image of the channel for each channel. The kernel may be composed of a two-dimensional matrix, and convolution operation is performed while traversing input data. The size of the kernel may be arbitrarily determined, and the stride at which the kernel traverses the input data may also be arbitrarily determined. A result of convolution of all input data per one kernel may be referred to as a feature map or an activation map. Hereinafter, the kernel may include a set of weight values or a plurality of sets of weight values. The number of kernels for each layer may be referred to as the number of channels.

이처럼 합성곱 연산은 입력 데이터와 커널의 조합으로 이루어진 연산이므로, 이후 비선형성을 추가하기 위한 활성화 함수가 적용될 수 있다. 합성곱 연산의 결과인 특징맵에 활성화 함수가 적용되면 활성화 맵으로 지칭될 수 있다. As such, since the convolution operation is an operation formed by combining input data and a kernel, an activation function for adding non-linearity may be applied thereafter. When an activation function is applied to a feature map that is a result of a convolution operation, it may be referred to as an activation map.

구체적으로 도 5를 참조하면, 컨볼류션 신경망은 적어도 하나의 컨볼류션 레이어, 적어도 하나의 풀링 레이어, 및 적어도 하나의 완전 연결 레이어를 포함한다. Specifically, referring to FIG. 5 , the convolutional neural network includes at least one convolutional layer, at least one pooling layer, and at least one fully connected layer.

예를 들면, 합성곱(컨볼류션)은, 입력 데이터의 크기(통상적으로 1Х1, 3Х3 또는 5Х5 행렬)와 출력 피처 맵(Feature Map)의 깊이(커널의 수)와 같은 두 개의 주요 파라미터에 의해 정의될 수 있다. 이러한 주요 파라미터는 합성곱에 의해 연산 될 수 있다. 이들 합성곱은, 깊이 32에서 시작하여, 깊이 64로 계속되며, 깊이 128 또는 256에서 종료될 수 있다. 합성곱 연산은, 입력 데이터인 입력 이미지 행렬 위로 3Х3 또는 5Х5 크기의 커널(kernel)을 슬라이딩하여 커널의 각 가중치와 겹쳐지는 입력 이미지 행렬의 각 원소를 곱한 후 이들을 모두 더하는 연산을 의미할 수 있다.For example, convolution (convolution) is defined by two main parameters: the size of the input data (typically a 1Х1, 3Х3 or 5Х5 matrix) and the depth of the output feature map (the number of kernels). can be defined. These key parameters can be computed by convolution. These convolutions may start at depth 32, continue to depth 64, and end at depth 128 or 256. The convolution operation may refer to an operation of sliding a kernel having a size of 3Х3 or 5Х5 over an input image matrix that is input data, multiplying each element of the input image matrix overlapping with each weight of the kernel, and then adding them all.

이와 같이 생성된 출력 특징맵에 활성화 함수가 적용되어 활성화 맵이 최종적으로 출력될 수 있다. 또한, 현재 레이어에서의 사용된 가중치는 합성곱을 통해 다음 레이어에 전달될 수 있다. 풀링 레이어는 출력 데이터(즉, 활성화 맵)을 다운 샘플링하여 특징맵의 크기를 줄이는 풀링 연산을 수행할 수 있다. 예를 들어, 풀링 연산은 최대 풀링(max pooling) 및/또는 평균 풀링(average pooling)을 포함할 수 있으나, 이에 한정되지 않는다. An activation function may be applied to the output feature map generated as described above to finally output the activation map. Also, the weight used in the current layer may be transmitted to the next layer through convolution. The pooling layer may perform a pooling operation to reduce the size of the feature map by downsampling the output data (ie, the activation map). For example, the pooling operation may include, but is not limited to, max pooling and/or average pooling.

최대 풀링 연산은 커널을 이용하며, 특징맵과 커널이 슬라이딩되어 커널과 겹쳐지는 특징맵의 영역에서 최대 값을 출력한다. 평균 풀링 연산은 특징맵과 커널이 슬라이딩되어 커널과 겹쳐지는 특징맵의 영역 내에서 평균값을 출력한다. 이처럼 풀링 연산에 의해 특징맵의 크기가 줄어들기 때문에 특징맵의 가중치 개수 또한 줄어든다.The maximum pooling operation uses the kernel, and outputs the maximum value in the area of the feature map overlapping the kernel by sliding the feature map and the kernel. The average pooling operation outputs the average value within the area of the feature map overlapping the kernel by sliding the feature map and the kernel. As such, since the size of the feature map is reduced by the pooling operation, the number of weights of the feature map is also reduced.

완전 연결 레이어는 풀링 레이어를 통해서 출력된 데이터를 복수의 클래스(즉, 추정값)로 분류하고, 분류된 클래스 및 이에 대한 점수(score)를 출력할 수 있다. 풀링 레이어를 통해서 출력된 데이터는 3차원 특징맵 형태를 이루며, 이러한 3차원 특징맵이 1차원 벡터로 변환되어 완전 연결 레이어로 입력될 수 있다.The fully connected layer may classify data output through the pooling layer into a plurality of classes (ie, estimated values), and output the classified class and a score thereof. Data output through the pooling layer forms a three-dimensional feature map, and this three-dimensional feature map can be converted into a one-dimensional vector and input as a fully connected layer.

도 6은 컨볼류션 레이어의 입력 데이터와 합성곱 연산에 사용되는 커널을 설명하기 위한 도면이다.6 is a diagram for explaining input data of a convolution layer and a kernel used for a convolution operation.

입력 데이터(300)는, 특정 크기의 행(310)과 특정 크기의 열(320)로 구성된 2차원적 행렬로 표시되는 이미지 또는 영상일 수 있다. 입력 데이터(300)는 특징맵으로 지칭될 수 있다. 입력 데이터(300)는 복수의 채널(330)을 가질 수 있는데, 여기서 채널(330)은 입력 데이터 이미지의 컬러 RGB채널을 나타낼 수 있다. The input data 300 may be an image or an image displayed as a two-dimensional matrix composed of a row 310 of a specific size and a column 320 of a specific size. The input data 300 may be referred to as a feature map. The input data 300 may have a plurality of channels 330 , where the channel 330 may represent a color RGB channel of the input data image.

한편, 커널(340)은, 입력 데이터(300)의 일정 부분을 스캐닝하면서 해당 부분의 특징을 추출하기 위한 합성곱에 사용되는 가중치 파라미터일 수 있다. 커널(340)은, 입력 데이터 이미지와 마찬가지로 특정 크기의 행(350), 특정 크기의 열(360), 특정 수의 채널(370)을 갖도록 구성될 수 있다. 일반적으로 커널(340)의 행(350), 열(360)의 크기는 동일하도록 설정되며, 채널(370)의 수는 입력 데이터 이미지의 채널(330)의 수와 동일할 수 있다.Meanwhile, the kernel 340 may be a weight parameter used for convolution for extracting features of a certain portion of the input data 300 while scanning it. Like the input data image, the kernel 340 may be configured to have a specific size of a row 350 , a specific size of a column 360 , and a specific number of channels 370 . In general, the size of the row 350 and the column 360 of the kernel 340 is set to be the same, and the number of channels 370 may be the same as the number of channels 330 of the input data image.

도 7은 커널을 사용하여 특징 맵을 생성하는 컨볼류션 신경망의 동작을 설명하기 위한 도면이다.7 is a diagram for explaining an operation of a convolutional neural network that generates a feature map using a kernel.

커널(410)은, 입력 데이터(420)를 지정된 간격으로 순회하며 합성곱을 실행함으로써, 최종적으로 특징맵(430)을 생성할 수 있다. 합성곱은, 입력 데이터(420)의 일 부분에 커널(410)을 적용하였을 때, 그 부분의 특정 위치의 입력 데이터 값들과 커널(410)의 해당 위치의 값들을 각각 곱한 뒤 생성된 값들을 모두 더하여 실행될 수 있다. The kernel 410 may finally generate the feature map 430 by traversing the input data 420 at specified intervals and performing convolution. Convolution, when the kernel 410 is applied to a part of the input data 420, multiplies the input data values of a specific position of the part and the values of the corresponding position of the kernel 410, and then adds all the generated values. can be executed

이러한 합성곱 과정을 통해, 특징맵의 계산 값들이 생성되며, 커널(410)이 입력 데이터(420)를 순회할 때마다 이러한 합성곱의 결과값들이 생성되어 특징맵(430)을 구성할 수 있다. Through this convolution process, calculated values of the feature map are generated, and whenever the kernel 410 traverses the input data 420 , the result values of the convolution are generated to configure the feature map 430 . .

특징맵의 각 구성요소 값들은 컨볼류션 레이어의 활성화 함수를 통해 활성화맵(430)으로 변환될 수 있다.Each component value of the feature map may be converted into the activation map 430 through the activation function of the convolution layer.

도 7에서 컨볼류션 레이어에 입력되는 입력 데이터(420)는 4 x 4의 크기를 갖는 2차원적 행렬로 표시되며, 커널(410)은 3 x 3 크기를 갖는 2차원적 행렬로 표시된다. 그러나, 컨볼류션 레이어의 입력 데이터(420) 및 커널(410)의 크기는, 이에 한정되는 것은 아니며, 컨볼류션 레이어가 포함되는 컨볼류션 신경망의 성능 및 요구사항에 따라 다양하게 변경될 수 있다.In FIG. 7 , the input data 420 input to the convolution layer is represented by a two-dimensional matrix having a size of 4×4, and the kernel 410 is represented by a two-dimensional matrix having a size of 3×3. However, the sizes of the input data 420 and the kernel 410 of the convolution layer are not limited thereto, and may be variously changed according to the performance and requirements of the convolutional neural network including the convolution layer. have.

도시된 바와 같이, 컨볼류션 레이어에 입력 데이터(420)가 입력되면, 커널(410)이 입력 데이터(420) 상에서 사전 결정된 간격(예를 들어, stride = 1)으로 순회하며, 입력 데이터(420)와 커널(410)의 동일 위치의 값들을 각각 곱하고 각각의 값들을 더하는 MAC 연산을 수 있다.As shown, when the input data 420 is input to the convolution layer, the kernel 410 traverses the input data 420 at a predetermined interval (eg, stride = 1), and the input data 420 ) and the values of the same location of the kernel 410 are multiplied, respectively, and the MAC operation of adding the respective values may be performed.

구체적으로, 커널(410)이 입력 데이터(420)의 특정 위치(421)에서 계산한 MAC 연산 값 "15"를 특징맵(430)의 대응 요소(431)에 배정한다. 커널(410)이 입력 데이터(420)의 다음 위치(422)에서 계산한 MAC 연산 값 "16"을 특징맵(430)의 대응 요소(432)에 배정한다. 커널(410)이 입력 데이터(420)의 다음 위치(423)에서 계산한 MAC 연산 값 "6"을 특징맵(430)의 대응 요소(433)에 배정한다. 다음으로, 커널(410)이 입력 데이터(420)의 다음 위치(424)에서 계산한 MAC 연산 값 "15"를 특징맵(430)의 대응 요소(434)에 배정한다. Specifically, the kernel 410 assigns the MAC operation value "15" calculated at a specific location 421 of the input data 420 to the corresponding element 431 of the feature map 430 . The kernel 410 assigns the MAC operation value "16" calculated at the next position 422 of the input data 420 to the corresponding element 432 of the feature map 430 . The kernel 410 assigns the MAC operation value "6" calculated at the next position 423 of the input data 420 to the corresponding element 433 of the feature map 430 . Next, the kernel 410 assigns the MAC operation value "15" calculated at the next position 424 of the input data 420 to the corresponding element 434 of the feature map 430 .

이와 같이 커널(410)이 입력 데이터(420) 상을 순회하면서 계산한 MAC 연산 값들을 특징맵(430)에 모두 배정하면, 2 x 2 크기의 특징맵(430)이 완성될 수 있다. In this way, when the kernel 410 allocates all MAC operation values calculated while traversing the input data 420 to the feature map 430 , the feature map 430 having a size of 2 x 2 can be completed.

이때, 입력 데이터(510)가 예를 들어 3가지 채널(R채널, G채널, B채널)로 구성된다면, 동일 커널 또는 채널 별 상이한 채널을 각각 입력 데이터(420)의 각 채널 별 데이터 상을 순회하며 다중 곱과 합을 진행하는 합성곱을 통해 채널 별 특징맵을 생성할 수 있다.At this time, if the input data 510 is composed of, for example, three channels (R channel, G channel, and B channel), the same kernel or different channels for each channel are traversed over the data phase for each channel of the input data 420 , respectively. In addition, it is possible to generate a feature map for each channel through convolution with multiple multiplication and summation.

상기 MAC 연산을 위해서 스케줄러(130)는 기 설정된 연산 순서를 기초로 각각의 MAC 연산을 수행할 프로세싱 엘리먼트들(PE1 to PE12)을 할당하고, MAC 연산 값들이 저장되는 메모리 어드레스를 스케줄링 순서를 고려해서 NPU 내부 메모리(120)에 설정할 수 있다. For the MAC operation, the scheduler 130 allocates processing elements PE1 to PE12 to perform each MAC operation based on a preset operation order, and determines a memory address in which MAC operation values are stored in consideration of the scheduling order. It can be set in the NPU internal memory 120 .

도 8은 컨볼류션 신경망의 동작을 이해하기 쉽게 나타낸 종합도이다.8 is a general diagram illustrating the operation of a convolutional neural network in an easy to understand manner.

도 8을 참조하면, 예시적으로 입력 이미지가 5 x 5 크기를 갖는 2차원적 행렬인 것으로 나타나 있다. 또한, 도 9에는 예시적으로 3개의 노드, 즉 채널 1, 채널 2, 채널 3이 사용되는 것으로 나타내었다.Referring to FIG. 8 , for example, an input image is shown as a two-dimensional matrix having a size of 5×5. In addition, FIG. 9 illustrates that three nodes, ie, channel 1, channel 2, and channel 3, are used by way of example.

먼저, 레이어 1의 합성곱 동작에 대해서 설명하기로 한다. First, the convolution operation of layer 1 will be described.

입력 이미지는 레이어 1의 첫 번째 노드에서 채널 1을 위한 커널 1과 합성곱되고, 그 결과로서 특징맵1이 출력된다. 또한, 상기 입력 이미지는 레이어 1의 두 번째 노드에서 채널 2를 위한 커널 2와 합성곱되고 그 결과로서 특징맵 2가 출력된다. 또한, 상기 입력 이미지는 세 번째 노드에서 채널 3을 위한 커널 3과 합성곱되고, 그 결과로서 특징맵3이 출력된다. The input image is convolutioned with kernel 1 for channel 1 at the first node of layer 1, and as a result, feature map 1 is output. Also, the input image is convolutioned with kernel 2 for channel 2 at the second node of layer 1, and as a result, feature map 2 is output. In addition, the input image is convolutioned with kernel 3 for channel 3 at the third node, and as a result, feature map 3 is output.

다음으로, 레이어 2의 폴링(pooling) 동작에 대해서 설명하기로 한다.Next, the layer 2 polling operation will be described.

상기 레이어 1 로부터 출력되는 특징맵1, 특징맵2, 특징맵3은 레이어 2의 3개의 노드로 입력된다. 레이어 2는 레이어 1로부터 출력되는 특징맵들을 입력으로 받아서 폴링(pooling)을 수행할 수 있다. 상기 폴링이라 함은 크기를 줄이거나 행렬 내의 특정 값을 강조할 수 있다. 폴링 방식으로는 최대값 폴링과 평균 폴링, 최소값 폴링이 있다. 최대값 폴링은 행렬의 특정 영역 안에 값의 최댓값을 모으기 위해서 사용되고, 평균 폴링은 특정 영역내의 평균을 구하기 위해서 사용될 수 있다.The feature map 1, the feature map 2, and the feature map 3 output from the layer 1 are input to the three nodes of the layer 2. Layer 2 may receive feature maps output from layer 1 as input and perform polling. The polling may reduce the size or emphasize a specific value in a matrix. Polling methods include maximum polling, average polling, and minimum polling. Maximum polling is used to collect the maximum values of values within a specific region of a matrix, and average polling can be used to find the average within a specific region of a matrix.

각각의 합성곱을 처리하기 위해서 NPU(100)의 프로세싱 엘리먼트들(PE1 to PE12)은 MAC 연산을 수행하도록 구성된다.In order to process each convolution, the processing elements PE1 to PE12 of the NPU 100 are configured to perform a MAC operation.

도 8의 예시에서는 5 x 5 행렬의 피처맵이 폴링에 의하여 4x4 행렬로 크기가 줄어지는 것으로 나타내었다.In the example of FIG. 8 , the size of the feature map of a 5×5 matrix is reduced to a 4×4 matrix by polling.

구체적으로, 레이어 2의 첫 번째 노드는 채널 1을 위한 특징맵1을 입력으로 받아 폴링을 수행한 후, 예컨대 4x4 행렬로 출력한다. 레이어 2의 두 번째 노드는 채널 2을 위한 특징맵2을 입력으로 받아 폴링을 수행한 후, 예컨대 4x4 행렬로 출력한다. 레이어 2의 세 번째 노드는 채널 3을 위한 특징맵3을 입력으로 받아 폴링을 수행한 후, 예컨대 4x4 행렬로 출력한다. Specifically, the first node of the layer 2 receives the feature map 1 for channel 1 as an input, performs polling, and outputs it as, for example, a 4x4 matrix. The second node of the layer 2 receives the feature map 2 for the channel 2 as an input, performs polling, and outputs, for example, a 4x4 matrix. The third node of layer 2 receives the feature map 3 for channel 3 as an input, performs polling, and outputs, for example, a 4x4 matrix.

다음으로, 레이어 3의 합성곱 동작에 대해서 설명하기로 한다.Next, the layer 3 convolution operation will be described.

레이어 3의 첫 번째 노드는 레이어 2의 첫 번째 노드로부터의 출력을 입력으로 받아, 커널 4와 합성곱을 수행하고, 그 결과를 출력한다. 레이어 3의 두 번째 노드는 레이어 2의 두 번째 노드로부터의 출력을 입력으로 받아, 채널 2를 위한 커널 5와 합성곱을 수행하고, 그 결과를 출력한다. 마찬가지로, 레이어 3의 세 번째 노드는 레이어 2의 세 번째 노드로부터의 출력을 입력으로 받아, 채널 3을 위한 커널 6과 합성곱을 수행하고, 그 결과를 출력한다.The first node of layer 3 receives the output from the first node of layer 2 as input, performs convolution with kernel 4, and outputs the result. The second node of layer 3 receives the output from the second node of layer 2 as input, performs convolution with kernel 5 for channel 2, and outputs the result. Similarly, the third node of layer 3 receives the output from the third node of layer 2 as input, performs convolution with kernel 6 for channel 3, and outputs the result.

이와 같이 합성곱과 폴링이 반복되고 최종적으로는, 도 7과 같이 fully connected layer로 입력될 수 있다. In this way, convolution and polling are repeated, and finally, as shown in FIG. 7 , it may be input to a fully connected layer.

전술한 CNN은 자율 주행 분야에서도 널리 사용된다The aforementioned CNN is also widely used in the field of autonomous driving.

도 9a는 본 명세서의 개시들이 적용되는 자율주행차량의 일 예를 나타내고, 도 9b는 국제 자동차 기술자 협회에서 정한 자율주행 레벨을 나타낸다.9A shows an example of an autonomous driving vehicle to which the disclosures of the present specification are applied, and FIG. 9B shows an autonomous driving level determined by the International Association of Automobile Engineers.

도 9a를 참조하면, 자율 주행 차량(vehicle)에는 라이다(Light Detection And Ranging, LiDAR), 레이더(RADAR), 카메라, GPS, 초음파 센서, NPU 등이 장착될 수 있다.Referring to FIG. 9A , an autonomous vehicle (vehicle) may be equipped with a light detection and ranging (LiDAR), a radar (RADAR), a camera, a GPS, an ultrasonic sensor, an NPU, and the like.

이 특허의 발명자는 딥러닝을 활용하여 자율 주행을 보조할 수 있는 NPU에 대하여 연구하였다. The inventor of this patent studied an NPU that can assist autonomous driving by using deep learning.

자율주행을 위하여, NPU는 4개의 핵심 기술 요구 사항을 만족해야 한다.For autonomous driving, NPUs must satisfy four key technical requirements.

1. Perception: NPU는 센서들을 사용하여 다른 차량, 보행자, 도로 표지판, 교통 신호 및 도로 연석과 같은 정적 및 동적 장애물을 포함하여 주변 환경을 감지, 이해 및 해석할 수 있어야 한다.1. Perception: NPUs should be able to use sensors to sense, understand and interpret their surroundings, including static and dynamic obstacles such as other vehicles, pedestrians, road signs, traffic signals, and road curbs.

2. Localization & Mapping: NPU는 차량 위치를 찾고, 차량 주변의지도를 작성하고 해당지도와 관련하여 차량의 위치를 지속적으로 추적할 수 있어야 한다.2. Localization & Mapping: The NPU should be able to locate a vehicle, create a map around the vehicle, and continuously track the location of the vehicle with respect to that map.

3. Path planning: NPU는 이전 두 작업의 출력을 활용하여 차량이 목적지에 도달할 수 있는 최적의 안전하고 실현 가능한 경로를 채택하고, 도로의 장애물을 고려할 수 있어야 한다.3. Path planning: The NPU should be able to utilize the output of the previous two tasks to adopt the optimal, safe and feasible path for the vehicle to reach its destination, taking into account obstacles in the road.

4. Control: NPU는 선택한 경로를 기반으로 제어 요소는 차량이 선택한 경로를 따라가는 데 필요한 가속, 토크 및 조향 각도 값을 출력할 수 있어야 한다. 4. Control: Based on the NPU selected path, the control element should be able to output the acceleration, torque and steering angle values required for the vehicle to follow the selected path.

한편, 자율 주행 기술은 첨단 운전자 보조 시스템(ADAS, Advanced Driver Assistance System) 및/또는 DSM(Driver's Statuis Monitoring)을 요구한다. ADAS & DSM은 하기의 기술 등을 포함한다Meanwhile, autonomous driving technology requires an advanced driver assistance system (ADAS) and/or a driver's status monitoring (DSM). ADAS & DSM includes the following technologies, etc.

- 스마트 크루즈 컨트롤(Smart Cruise Control, SCC) - Smart Cruise Control (SCC)

- 자동 긴급 제동 시스템(Autonomous Emergency Braking, AEB) - Autonomous Emergency Braking (AEB)

- 주차 조향 보조 시스템(Smart Parking Assistance System, SPAS) - Smart Parking Assistance System (SPAS)

- 차선 이탈 경보 시스템(Lane Departure Warning System, LDWS) - Lane Departure Warning System (LDWS)

- 차선 유지 지원 시스템(Lane Keeping Assist System, LKAS) - Lane Keeping Assist System (LKAS)

- 졸음 감지, 음주 감지, 더위 및 추위 감지, 부주의 감지, 영유아 방치 감지 등 - Drowsiness detection, alcohol detection, heat and cold detection, carelessness detection, infant neglect detection, etc.

상기 ADAS 기술에는 다양한 센서들이 활용되고, 해당 센서들을 딥러닝의 입력신호로 활용할 수 있다.Various sensors are used in the ADAS technology, and the corresponding sensors can be used as input signals for deep learning.

- RGB CAMERA SENSOR (380nm~680nm) - RGB CAMERA SENSOR (380nm~680nm)

- RGB CAMERA with Polarizer - RGB CAMERA with Polarizer

- DEPTH CAMERA SENSOR - DEPTH CAMERA SENSOR

- NIR CAMERA SENSOR (850nm~940nm) - NIR CAMERA SENSOR (850nm~940nm)

- THERMAL CAMERA SENSOR (9,000nm-14,000nm)　 - THERMAL CAMERA SENSOR (9,000nm-14,000nm)

- RGB+IR HYBRID SENSOR (380nm~940nm) - RGB+IR HYBRID SENSOR (380nm~940nm)

- RADAR SENSOR - RADAR SENSOR

- LIDAR SENSOR - LIDAR SENSOR

- ULTRASOUND SENSOR - ULTRASOUND SENSOR

한편, 도 9b를 참조하여, 국제 자동차 기술자 협회에서 정한 자율주행 레벨을 기준으로 각 레벨에 대하여 설명하면 다음과 같다. Meanwhile, with reference to FIG. 9B , each level will be described based on the autonomous driving level determined by the International Association of Automobile Engineers.

0 단계인 비자동화 단계는, V2X(Vehicle to everything) 통신 기능이 제공되지 않는 수동주행 차량이, 주행 중 안전을 위해 시스템이 단순히 경고하고 일시 개입하는 전방 충돌방지 보조(Forward Collision-Avoidance Assist, FCA), 후측 방 충돌경고(Blind-Spot Collision Warning, BCW)를 제공한다. 따라서, 0 단계에 있어서는, 운전자가 차량 제어를 전부 수행해야 한다. In the non-automation stage, which is stage 0, the system simply warns and temporarily intervenes for safety while driving for manually driven vehicles that do not provide V2X (Vehicle to everything) communication function (Forward Collision-Avoidance Assist, FCA). ) and Blind-Spot Collision Warning (BCW). Therefore, in step 0, the driver must perform all vehicle control.

1 단계인 운전자 보조 단계는, 특정 주행모드에서 시스템이 조향 또는 감·가속 중 하나를 수행하는 수동주행 차량이, 차로 유지 보조(Lane Following Assist, LFA), 스마트 크루즈 컨트롤(Smart Cruise Control, SCC) 등을 제공한다. 따라서, 1 단계에 있어서는, 운전자가 속도 등을 인지하고 있어야 한다. In the first stage, the driver assistance stage, a manually driven vehicle in which the system performs either steering or deceleration/acceleration in a specific driving mode, Lane Following Assist (LFA), and Smart Cruise Control (SCC) etc. are provided. Therefore, in step 1, the driver must be aware of the speed and the like.

2 단계인 부분 자동화 단계는, 특정 주행모드에서 시스템이 조향 및 감·가속을 모두 수행하는 자율주행 차량이, 고속도로 주행 보조(Highway Driving Assist, HDA) 등을 제공한다. 따라서, 2 단계에 있어서는, 운전자가 물체 등을 인지하고 있어야 한다. In the second stage, the partial automation stage, an autonomous vehicle in which the system performs both steering and deceleration/acceleration in a specific driving mode provides Highway Driving Assist (HDA). Therefore, in step 2, the driver must be aware of the object or the like.

2 단계까지는 시스템이 차량의 일부 주행을 보조하나(어시스트), 3 단계부터는 시스템이 전체 주행을 수행할 수 있다(파일럿). 즉, 차량은 스스로 차선을 변경하거나 앞 차량을 추월할 수 있고, 장애물도 피할 수 있다. Up to stage 2, the system assists with some driving of the vehicle (assist), but from stage 3 onwards the system can perform the entire driving (pilot). That is, the vehicle can change lanes on its own or overtake the vehicle in front and avoid obstacles.

3 단계인 조건부 자동화 단계는, 차량을 제어하는 동시에 주행 환경을 인식하지만, 비상 상황 시 운전 제어권 이양을 운전자에게 요청해야 할 수 있다. 따라서, 3 단계에 있어서는, 운전자가 특정 도로 상황 등을 인지하고 있어야 한다. The third stage, the conditional automation stage, controls the vehicle and recognizes the driving environment at the same time, but may require the driver to transfer driving control in case of an emergency. Accordingly, in step 3, the driver must be aware of a specific road condition or the like.

4 단계인 고등 자동화 단계는, 3 단계와 동일하게 시스템이 전체 주행을 수행함과 아울러 위험 상황 발생 시에 도 안전하게 대응할 수 있는 단계를 의미한다. 따라서, 4 단계에 있어서는, 운전자가 날씨, 재난, 사고에 대해 인지하고 있어야 한다. The advanced automation stage, which is stage 4, means the stage in which the system performs the entire driving as in stage 3 and can respond safely even in the event of a dangerous situation. Therefore, in step 4, the driver must be aware of the weather, disasters, and accidents.

5 단계인 완전 자동화 단계는, 4 단계와는 달리 자율주행을 할 수 있는 지역에 제한이 없는 단계를 의미한다. 5 단계에 있어서는, 운전자의 인식이 불필요하다.The 5th stage, the fully automated stage, means a stage in which there are no restrictions on the areas where autonomous driving can be performed, unlike the 4th stage. In step 5, driver recognition is unnecessary.

<이종 센서로부터 제공되는 서로 다른 데이터 신호의 처리><Processing of different data signals from heterogeneous sensors>

자율주행 성능 향상을 위해서 이종 센서로부터 제공되는 서로 다른 데이터를 처리하기 위한 퓨전(fusion) 알고리즘 필요성이 대두되고 있다. 이하, 퓨전 알고리즘들에 대해서 소개한다. In order to improve autonomous driving performance, the need for a fusion algorithm to process different data provided from heterogeneous sensors is emerging. Hereinafter, fusion algorithms are introduced.

도 10은 퓨전 알고리즘을 나타낸 예시도이다.10 is an exemplary diagram illustrating a fusion algorithm.

도 10에 도시된 바와 같이, 이종 센서로부터 제공되는 서로 다른 데이터를 처리하기 위해서 예시적으로 CNN(Convolutional Neural Network)과 RNN(Recurrent Neural Network)가 사용될 수 있다. CNN은 하나의 이미지 내의 오브젝트(object)를 검출하는데 사용될 수 있고, RNN은 시간 개념을 활용하여 오브젝트를 예측하기 위해서 사용될 수 있다. 그 밖에 R-CCN(Region-based CNN), SPP-Net (Spatial Pyramid Pooling network), YOLO (You only look once), SSD (Single-Shot Multibox Detector), DSSD (Deconvolutional Single-Shot Multibox Detector), LTSM (Long-Short Term Memory), GRU (Gated Recurrent Unit) 등이 사용될 수 있다. As shown in FIG. 10 , a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) may be used for example to process different data provided from heterogeneous sensors. CNN can be used to detect an object in an image, and RNN can be used to predict an object using the concept of time. In addition, R-CCN (Region-based CNN), SPP-Net (Spatial Pyramid Pooling network), YOLO (You only look once), SSD (Single-Shot Multibox Detector), DSSD (Deconvolutional Single-Shot Multibox Detector), LTSM (Long-Short Term Memory), GRU (Gated Recurrent Unit), etc. may be used.

도 11a는 오브젝트를 인식하는 예를 나타낸 예시도이고, 도 11b는 SSD의 구조를 나타낸 예시도이다.11A is an exemplary diagram illustrating an example of recognizing an object, and FIG. 11B is an exemplary diagram illustrating a structure of an SSD.

도 11a를 참고하여 알 수 있는 바와 같이, SSD 인공신경망 모델을 이용하면 이미지 내에서 다수의 오프젝트를 검출할 수 있다. 도 11b를 참고하면, SSD는 각 단계 별 피쳐맵에서 오브젝트를 검출할 수 있다. 예를 들면, SSD는 VGG 구조 또는 Mobilenet 구조의 백본(backbone)과 결합될 수 있다.As can be seen with reference to FIG. 11A , a plurality of objects can be detected in an image by using the SSD artificial neural network model. Referring to FIG. 11B , the SSD may detect an object in the feature map for each step. For example, the SSD may be combined with a backbone of a VGG structure or a Mobilenet structure.

도 12a는 차량에 장착되는 레이더를 이용하는 인공신경망의 예를 나타내고, 도 12b는 레이더와 카메라를 활용하는 퓨전 처리 방식의 예를 나타낸다.12A shows an example of an artificial neural network using a radar mounted on a vehicle, and FIG. 12B shows an example of a fusion processing method using a radar and a camera.

레이더로부터 제공되는 신호를 처리하기 위하여, 도 12a에 도시된 인공신경망은 컨볼류션, 풀링 그리고 ResNet 등을 포함할 수 있다. In order to process the signal provided from the radar, the artificial neural network shown in FIG. 12A may include convolution, pooling, ResNet, and the like.

레이더로부터 제공되는 신호와 카메라로부터 제공되는 RGB 신호를 처리하기 위하여, 도 12b에 도시된 퓨전(fusion) 인공신경망이 사용될 수도 있다.In order to process the signal provided from the radar and the RGB signal provided from the camera, the fusion artificial neural network shown in FIG. 12B may be used.

도 13은 라이다와 카메라를 활용하는 퓨전(fusion) 인공신경망의 예를 나타낸다.13 shows an example of a fusion artificial neural network using a lidar and a camera.

도 13을 참고하면, 카메라로부터 제공되는 RGB 신호와 라이다로부터 제공되는 신호를 병렬 프로세싱을 통해 처리하는 예가 나타나 있다. 병렬 프로세싱 도중에, 트랜스포머를 통해서 서로 다른 정보가 교환될 수 있다. 상기 방식은 도 14에 도시된 심층 퓨전 방식일 수 있다. Referring to FIG. 13 , an example of processing an RGB signal provided from a camera and a signal provided from a lidar through parallel processing is shown. During parallel processing, different information can be exchanged through transformers. The method may be the deep fusion method shown in FIG. 14 .

한편, 도시되지는 않았으나, 이종 센서로부터 제공되는 서로 다른 데이터를 처리하기 위해 인공신경망은 연접(Concatenation) 동작과 건너뛰고 연결하기 (Skip-connection) 동작을 포함할 수 있다. 연접 동작은 특정 레이어의 출력 결과를 서로 합치는 것을 의미하고, 건너뛰고 연결하기 동작은 특정 레이어의 출력 결과를 후속 레이어를 건너뛰고, 다른 레이어로 전달하는 것을 의미한다.Meanwhile, although not shown, in order to process different data provided from heterogeneous sensors, the artificial neural network may include a concatenation operation and a skip-connection operation. The concatenation operation means merging output results of a specific layer with each other, and the skip and concatenation operation means skipping the output result of a specific layer and transferring the output result of a specific layer to another layer.

이러한, 연접 동작과 건너뛰고 연결하기 동작은 NPU(100)의 내부 메모리(120)의 제어 난이도 증가 및 사용량을 증가시킬 수 있다.Such a concatenating operation and skipping and connecting operation may increase the control difficulty and usage of the internal memory 120 of the NPU 100 .

지금까지는, 이종 센서로부터 제공되는 서로 다른 데이터를 퓨전하여 처리하기 위한 인공신경망에 대해서 설명하였으나, 위 설명된 내용들만으로는, 인공신경망의 성능 향상을 꾀할 수 없는 약점이 있었다. 따라서, 이하에서는 최적화된 인공신경망 및 NPU 구조에 대해서 설명하기로 한다.Up to now, an artificial neural network for fusion processing of different data provided from heterogeneous sensors has been described, but only the above-described contents have a weakness in that the performance of the artificial neural network cannot be improved. Therefore, the optimized artificial neural network and NPU structure will be described below.

<이종 센서로부터의 서로 다른 데이터를 처리하기 위해 최적화된 퓨전(fusion) 인공신경망 및 NPU 구조><Fusion artificial neural network and NPU structure optimized to process different data from heterogeneous sensors>

먼저, 본 특허의 발명자는 이종 센서로부터의 서로 다른 데이터를 처리하기 위한 NPU에 대하여 연구하였다. First, the inventor of this patent studied NPU for processing different data from heterogeneous sensors.

상기 NPU의 설계에 있어서 하기의 구성을 고려해야 한다:In the design of the NPU, the following configuration should be considered:

i. 이종 데이터 신호 처리(예컨대, RGB 카메라 + 레이더)에 적합한 NPU 구조를 가지는 것이 필요하다. i. It is necessary to have an NPU structure suitable for processing heterogeneous data signals (eg, RGB camera + radar).

ii. 이종 입력 신호 처리(예컨대, RGB 카메라 + 레이더)에 적합한 NPU 메모리 제어가 필요하다.ii. NPU memory control suitable for heterogeneous input signal processing (eg RGB camera + radar) is required.

iii. 다중 입력 채널(ADAS 및 DSM)에 적합한 NPU 구조를 가지는 것이 필요하다.iii. It is necessary to have an NPU structure suitable for multiple input channels (ADAS and DSM).

iv. 다중 입력 채널(ADAS & DSM)에 적합한 NPU 메모리 제어가 필요하다.iv. NPU memory control suitable for multiple input channels (ADAS & DSM) is required.

v. 퓨전(fusion) 인공신경망 모델 연산에 적합한 NPU 구조를 가지는 것이 필요하다. v. It is necessary to have an NPU structure suitable for fusion artificial neural network model computation.

vi. 실시간 적용을 위해서 16ms 이하의 빠른 처리 속도가 필요하다vi. For real-time application, a fast processing speed of 16ms or less is required.

vii. 배터리 구동을 위해서 저소비 전력 달성이 필요하다.vii. It is necessary to achieve low power consumption for battery operation.

퓨전(fusion) 인공신경망을 구현하기 위한 NPU는 하기의 기능을 지원해야 한다. 예상되는 요구 사항들은 아래와 같다:An NPU to implement a fusion artificial neural network must support the following functions. The expected requirements are:

i. CNN 기능 지원: 합성곱에 최적화된 PE 어레이 및 메모리를 제어할 수 있어야 한다.i. Support CNN function: It should be able to control the PE array and memory optimized for convolution.

ii. Depthwise-separable convolution을 효율적으로 처리할 수 있어야 한다. PE 이용률 및 성능(throughput)을 향상시키는 구조를 가져야 한다. ii. It should be able to process depthwise-separable convolution efficiently. It should have a structure that improves PE utilization and performance (throughput).

iii. Batch mode 기능 지원 : 다중 채널(카메라 1~6), 이종 센서를 동시에 처리할 수 있도록 메모리 구성이 필요하다.(PE 어레이의 크기와 메모리 크기가 적정한 비율이어야 한다) iii. Batch mode function support: Memory configuration is required to simultaneously process multiple channels (cameras 1 to 6) and heterogeneous sensors. (The size of PE array and memory size must be in an appropriate ratio)

iv. 연접(Concatenation) 기능 지원 : 퓨전(fusion) 인공신경망을 위한 NPU는 이종 입력 데이터 신호를 연접(Concatenation) 기능으로 처리할 수 있어야 한다.iv. Concatenation function support: An NPU for a fusion artificial neural network must be able to process heterogeneous input data signals with a concatenation function.

v. 건너뛰고 연결하기(Skip) 연결 기능 지원 : 퓨전(fusion) 인공신경망을 위한 NPU는 건너뛰고 연결하기(skip) 기능을 제공할 수 있는 SFU(Special Function Unit)를 포함할 수 있다.v. Skip and connect (Skip) connection function support: The NPU for a fusion artificial neural network may include a SFU (Special Function Unit) that can provide a skip function.

vi. 딥러닝 영상 전처리 기능 지원 : 퓨전(fusion) 인공신경망을 위한 NPU는 서로 다른 데이터 신호를 전처리하는 기능을 제공할 수 있어야 한다.vi. Support for deep learning image preprocessing function: An NPU for a fusion artificial neural network should be able to provide a function to preprocess different data signals.

vii. 퓨전(fusion) 인공신경망을 효율적으로 컴파일 할 수 있는 컴파일러가 제공되어야 한다.vii. A compiler capable of efficiently compiling fusion neural networks should be provided.

본 특허의 발명자는 다음과 같은 특징을 가진 NPU를 제안한다.The inventor of this patent proposes an NPU having the following characteristics.

i. NPU는 늦은 퓨전(Late Fusion), 조기 퓨전(Early Fusion), 심층 퓨전(Deep Fusion) 등, 인공신경망의 ANN 데이터 지역성 정보를 분석하는 컴파일러를 포함할 수 있다.i. The NPU may include a compiler that analyzes ANN data locality information of an artificial neural network, such as late fusion, early fusion, and deep fusion.

ii. NPU는 ADC(artificial neural network data locality controller)에 기초하여 이종의 센서 데이터를 처리하도록 PE 어레이를 제어하도록 구성될 수 있다. 즉, 퓨전(fusion)된 인공신경망은 센서에 따라 다양한 구조로 퓨전되며, 상기 구조에 대응되는 NPU(100)를 제공함에 따라 PE 가동률(utilization rate)을 향상 시킬 수 있다.ii. The NPU may be configured to control the PE array to process heterogeneous sensor data based on an artificial neural network data locality controller (ADC). That is, the fused artificial neural network is fused into various structures depending on the sensor, and the PE utilization rate can be improved by providing the NPU 100 corresponding to the structure.

iii. ANN 데이터 지역성 정보에 기초하여 이종의 센서 데이터를 처리하도록 칩-내부 메모리(120)의 크기를 적절히 설정하도록 구성될 수 있다. 즉, ANN 데이터 지역성 정보를 분석하면 퓨전(fusion) 인공신경망을 처리하는 NPU의 메모리 대역폭을 개선할 수 있다.iii. It may be configured to appropriately set the size of the chip-internal memory 120 to process heterogeneous sensor data based on the ANN data locality information. That is, the memory bandwidth of the NPU that processes the fusion artificial neural network can be improved by analyzing the locality information of the ANN data.

iv. NPU는 퓨전(fusion) 인공신경망에서 필요한 쌍선형 보간(Bilinear interpolation), 연접(Concatenation) 및 건너뛰고 연결하기(skip-connection 등)을 효율적으로 처리할 수 있는 SFU(Special Function Unit)를 포함할 수 있다.iv. NPU can include SFU (Special Function Unit) that can efficiently process bilinear interpolation, concatenation, and skip-connection (skip-connection, etc.) required in fusion artificial neural networks. have.

도 14는 늦은 퓨전(Late Fusion), 조기 퓨전(Early Fusion), 심층 퓨전(Deep Fusion)을 나타낸 예시도이다.14 is an exemplary diagram illustrating late fusion, early fusion, and deep fusion.

도 14를 참조하면, F"는 퓨전 연산을 의미하며, 각각의 블록은 각각의 레이어를 의미한다. 도 14를 참고하여 알 수 있는 바와 같이, 늦은 퓨전은 각 레이어 별로 연산을 수행한 후, 마지막 과정에서 연산 결과를 퓨전하는 것을 의미하고, 조기 퓨전은 조기에 서로 다른 데이터를 퓨전한 후, 레이어 별로 연산을 수행하는 것을 의미한다. 심층 퓨전은 서로 다른 데이터를 퓨전한 후, 서로 다른 레이어에서 연산을 수행하고, 연산 수행 결과를 다시 퓨전한 후, 레이어 별로 연산을 수행하는 것을 의미한다.Referring to FIG. 14 , F" means a fusion operation, and each block means each layer. As can be seen with reference to FIG. 14 , in the late fusion operation, after performing an operation for each layer, the last It means fusion of the operation result in the process, Early fusion means fusion of different data early and then performing the operation for each layer Deep fusion means fusion of different data and then operation on different layers , and after fusion of the operation result again, it means that the operation is performed for each layer.

도 15는 제1 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이다.15 is an exemplary diagram illustrating a system including an NPU architecture according to the first example.

도 15에 도시된 바와 같이, 상기 NPU(100)는 퓨전(fusion) 인공신경망을 위한 PE 어레이(110)와, 칩-내부(On-chip) 메모리(120)와, NPU 스케줄러(130)과, SFU(special function unit)(160)을 포함할 수 있다. 도 15을 설명함에 있어서 중복되는 설명은 단지 설명의 편의를 위해 생략될 수 있다.15, the NPU 100 is a PE array 110 for a fusion artificial neural network, and a chip-internal (On-chip) memory 120, an NPU scheduler 130 and, It may include a special function unit (SFU) 160 . In the description of FIG. 15 , overlapping descriptions may be omitted for convenience of description only.

퓨전 인공신경망을 위한 PE 어레이(110)는 적어도 하나의 퓨전 레이어를 가지는 다층 구조의 인공신경망모델의 합성곱을 처리하도록 구성된 PE 어레이(110)를 의미할 수 있다. 즉, 퓨전 레이어는 이종 센서의 데이터가 퓨전 된 특징맵을 출력하도록 구성될 수 있다. 부연 설명하면 NPU(100)의 SFU(160) 다중 센서를 입력 받아 각각의 센서 입력을 퓨전 시키는 기능을 제공하도록 구성될 수 있다. 퓨전 인공신경망을 위한 PE 어레이(110)는 SFU(160)에서 퓨전 된 데이터를 입력 받아 합성곱을 처리하도록 구성될 수 있다. The PE array 110 for the fusion neural network may refer to the PE array 110 configured to process the convolution of a multi-layered neural network model having at least one fusion layer. That is, the fusion layer may be configured to output a feature map in which data of heterogeneous sensors is fused. In more detail, the SFU 160 of the NPU 100 may be configured to receive multiple sensors and provide a function of fusion of each sensor input. The PE array 110 for the fusion artificial neural network may be configured to receive fusion data from the SFU 160 and process convolution.

NPU(100)는 M개의 이종 센서들(311, 312)로부터 서로 다른 데이터를 수신할 수 있다. 상기 이종의 센서들은 카메라, 레이더, 라이다, 초음파, 열화상 카메라 등을 포함할 수 있다.The NPU 100 may receive different data from the M heterogeneous sensors 311 and 312 . The heterogeneous sensors may include a camera, radar, lidar, ultrasound, thermal imaging camera, and the like.

상기 NPU(100)는 컴파일러(200)로부터 퓨전(fusion) 인공신경망(ANN) 데이터 지역성 정보를 획득할 수 있다.The NPU 100 may obtain fusion artificial neural network (ANN) data locality information from the compiler 200 .

상기 퓨전 인공신경망의 적어도 하나의 레이어는 복수의 센서의 입력 데이터가 퓨전 된 레이어일 수 있다.At least one layer of the fusion artificial neural network may be a layer in which input data of a plurality of sensors are fused.

상기 NPU(100)는 이종 센서 입력 데이터의 퓨전을 위해서 적어도 하나의 레이어에 연접 기능을 제공하도록 구성될 수 있다. 연접된 레이어의 이종 센서들의 각각의 특징맵은 서로 연접되기 위해서 적어도 하나의 축의 크기는 서로 동일하게 처리될 수 있다. 예를 들면, X-축으로 이종 센서 데이터의 연접을 위해서 이종 센서 데이터 각각의 X-축의 크기는 서로 동일할 수 있다. 예를 들면, Y-축으로 이종 센서 데이터의 연접을 위해서 이종 센서 데이터 각각의 Y-축의 크기는 서로 동일할 수 있다. 예를 들면, Z-축으로 이종 센서 데이터의 연접을 위해서 이종 센서 데이터 각각의 Z-축의 크기는 서로 동일할 수 있다.The NPU 100 may be configured to provide a concatenation function to at least one layer for fusion of heterogeneous sensor input data. In order to connect each feature map of the heterogeneous sensors of the concatenated layer to each other, the size of at least one axis may be processed to be the same. For example, in order to concatenate the heterogeneous sensor data along the X-axis, the size of the X-axis of each of the heterogeneous sensor data may be the same. For example, in order to concatenate heterogeneous sensor data along the Y-axis, the Y-axis size of each of the heterogeneous sensor data may be the same. For example, in order to concatenate heterogeneous sensor data along the Z-axis, the Z-axis sizes of the different types of sensor data may be the same.

상기 이종 센서들(311, 312)로부터 서로 다른 데이터를 제공받아 처리하기 위하여, 상기 NPU 스케줄러(130)는 퓨전(fusion) 인공신경망 모델의 추론을 처리할 수 있다.In order to receive and process different data from the heterogeneous sensors 311 and 312 , the NPU scheduler 130 may process inference of a fusion artificial neural network model.

상기 NPU 스케줄러(130)는 도시된 바와 같이 제어부 내에 포함될 수 있다.The NPU scheduler 130 may be included in the control unit as shown.

상기 NPU 스케줄러(130)는 컴파일러(200)으로부터 퓨전(fusion) 인공신경망의 데이터 지역성 정보를 획득 분석하고, 상기 칩-내부 메모리(120)의 동작을 제어할 수 있다.The NPU scheduler 130 obtains and analyzes data locality information of the fusion artificial neural network from the compiler 200 , and the chip-can control the operation of the internal memory 120 .

구체적으로 설명하면, 다음과 같다. 상기 컴파일러(200)는 상기 NPU(100)에서 처리할 퓨전(fusion) 인공신경망의 데이터 지역성 정보를 생성할 수 있다. Specifically, it is as follows. The compiler 200 may generate data locality information of a fusion artificial neural network to be processed by the NPU 100 .

상기 NPU 스케줄러(130)는 상기 퓨전(fusion) 인공신경망에 필요한 특수 기능(special function)에 대한 목록을 생성할 수 있다. 특수 기능은 합성곱 이외의 인공신경망 연산에 필요한 다양한 기능 등을 의미할 수 있다.The NPU scheduler 130 may generate a list for a special function (special function) required for the fusion (fusion) artificial neural network. The special function may mean various functions required for artificial neural network operation other than convolution.

상기 퓨전(fusion) 인공신경망 데이터 지역성 정보를 잘 활용하면, non-maximum suppression(NMS), 건너뛰고 연결하기(SKIP-CONNECTION), 병목(Bottleneck), 쌍선형 보간(Bilinear interpolation) 등 퓨전(fusion) 인공신경망에서 자주 발생하는 메모리 접근 증가 문제를 효율적으로 제어할 수 있다.If the fusion artificial neural network data locality information is well utilized, fusion such as non-maximum suppression (NMS), skip and connect (SKIP-CONNECTION), bottleneck, and bilinear interpolation It is possible to efficiently control the problem of increasing memory access, which frequently occurs in artificial neural networks.

상기 퓨전(fusion) 인공신경망 데이터 지역성 정보를 활용하면, 먼저 연산된 제1 출력 특징맵 정보와 그리고 더 늦게 처리되는 제2 출력 특징맵 정보가 퓨전될 때까지, 저장해야 하는 데이터(예컨대, 제1 출력 특징맵)의 크기, 저장 기간 등을 컴파일 단계에서 알 수 있기 때문에, 칩-내부(on-chip) 메모리(120)를 위한 메모리 맵을 사전에 효율적으로 설정할 수 있다.If the fusion artificial neural network data locality information is used, data to be stored (eg, the first Since the size, storage period, etc. of the output feature map) can be known at the compilation stage, it is possible to efficiently set the memory map for the on-chip memory 120 in advance.

상기 SFU(160)은 퓨전(fusion) 인공신경망에 필요한 건너뛰고 연결하기(skip-connection) 그리고 연접(concatenation)을 수행할 수 있다. 부연 설명하면, 연접은 이종의 센서 데이터를 퓨전하는데 활용될 수 있다. 연접을 위해서 각각의 센서 데이터의 크기는 재조정될 수 있다. 예를 들면, NPU(100)는 크기조정(resize), 보간법(interpolation) 등의 기능을 제공하여 퓨전 인공신공망의 연접을 처리하도록 구성될 수 있다.The SFU 160 may perform skip-connection and concatenation necessary for a fusion artificial neural network. In other words, concatenation can be utilized to fuse heterogeneous sensor data. For concatenation, the size of each sensor data can be readjusted. For example, the NPU 100 may be configured to handle the concatenation of the fusion artificial artificial network by providing functions such as resizing, interpolation, and the like.

상기 NPU(100)의 칩-내부 메모리(120)는 인공신경망 데이터 지역성 정보에 기초하여 PE 어레이(110) 또는 SFU(160)에 따른 특정 데이터를 특정 기간동안 선택적으로 보존할 수 있다. 상기 선택적 보존 여부는 제어부(130)에 의해서 제어될 수 있다.The chip-internal memory 120 of the NPU 100 may selectively preserve specific data according to the PE array 110 or the SFU 160 for a specific period based on the artificial neural network data locality information. Whether or not to preserve the selective preservation may be controlled by the controller 130 .

또한 PE 어레이(110)는 이종 센서의 개수에 대응되는 쓰레드(thread)의 개수를 가지도록 구성될 수 있다. 즉, 2개의 센서 데이터를 입력 받도록 구성된 NPU(100)의 어레이(110)는 2개의 쓰레드를 가지도록 구성될 수 있다. 즉 하나의 쓰레드가 N x M 개의 프로세싱 엘리먼트들로 구성되면, 2개의 쓰레드는 N x M x 2 개의 프로세싱 엘리먼트들로 구성될 수 있다. 예를 들면, PE 어레이(110)의 각각의 쓰레드는 각각의 이종 센서의 특징맵을 처리하도록 구성될 수 있다.Also, the PE array 110 may be configured to have the number of threads corresponding to the number of heterogeneous sensors. That is, the array 110 of the NPU 100 configured to receive two sensor data may be configured to have two threads. That is, if one thread is configured with N×M processing elements, two threads may be configured with N×M×2 processing elements. For example, each thread of the PE array 110 may be configured to process a feature map of each heterogeneous sensor.

상기 NPU(100)는 상기 퓨전 인공신경망의 연산 결과를 출력부를 통해 출력할 수 있다.The NPU 100 may output the operation result of the fusion artificial neural network through an output unit.

전술한 제1 예시에 따른 NPU 아키텍처는 다양하게 변형될 수 있다.The NPU architecture according to the first example described above may be variously modified.

도 16a는 건너뛰고 연결하기(skip-connection)을 포함하는 인공신경망의 모델을 예시적으로 나타낸 예시도이고, 도 16b는 건너뛰고 연결하기(skip-connection)을 포함하는 인공신경망의 데이터 지역성 정보를 나타낸 예시도이다.16A is an exemplary diagram illustrating a model of an artificial neural network including skip-connection, and FIG. 16B is a data locality information of an artificial neural network including skip-connection. It is an example diagram shown.

도 16a에 도시된 바와 같이, 건너뛰고 연결하기(Skip-connection) 동작을 포함하는 5개의 레이어를 연산하기 위해서, 도 16b에 도시된 바와 같이 예를 들면, 컴파일러(200)는 16개 단계의 순서를 가지는 인공신경망 데이터 지역성 정보를 생성할 수 있다.As shown in FIG. 16A , in order to calculate five layers including a Skip-connection operation, for example, as shown in FIG. 16B , the compiler 200 executes a sequence of 16 steps. It is possible to generate artificial neural network data locality information having

NPU(100)는 인공신경망 데이터 지역성 정보 순서 대로 칩-내부(On-chip) 메모리(120)에 데이터 오퍼레이션을 요청한다The NPU 100 requests a data operation to the on-chip memory 120 in the order of the artificial neural network data locality information.

건너뛰고 연결하기(Skip-connection) 동작의 경우, 제1 레이어의 출력 특징맵(OFMAP)은 제4 레이어의 출력 특징맵(OFMAP)과 더해질 수 있다.In the case of a skip-connection operation, the output feature map OFMAP of the first layer may be added to the output feature map OFMAP of the fourth layer.

이와 같은, 건너뛰고 연결하기(Skip-connection) 동작을 위해서 제1 레이어의 출력 특징맵을 제5 레이어 연산까지 보존해야 한다. 하지만, 다른 데이터들은 메모리 공간 활용을 위해서 연산 이후 삭제되어도 무방하다. For such a skip-connection operation, the output feature map of the first layer must be preserved until the fifth layer operation. However, other data may be deleted after operation in order to utilize memory space.

삭제된 메모리 영역에는 인공신경망 데이터 지역성 정보 순서를 기초로 이후에 연산 될 데이터가 저장될 수 있다. 따라서, 인공신경망 데이터 지역 정보 순서를 따라 필요한 데이터를 칩-내부 메모리(120)로 순차적으로 가져오고, 재사용되지 않는 데이터를 삭제할 수 있기 때문에, 칩-내부(On-chip) 메모리(120)의 메모리 크기가 작더라도 칩-내부 메모리(120)의 동작 효율을 향상시킬 수 있다. In the deleted memory area, data to be calculated later based on the sequence of locality information of the artificial neural network data may be stored. Accordingly, since necessary data can be sequentially brought into the chip-internal memory 120 according to the artificial neural network data area information sequence, and data that is not reused can be deleted, the on-chip memory 120 . Even if the size is small, the operating efficiency of the chip-internal memory 120 may be improved.

따라서 NPU(100)는 인공신경망 데이터 지역성 정보에 기초하여 일정 기간 동안 칩-내부 메모리(120)의 특정 데이터를 선택적으로 보존하거나, 삭제할 수 있다. Therefore, the NPU 100 may selectively preserve or delete specific data of the chip-internal memory 120 for a predetermined period based on the artificial neural network data locality information.

이러한 메커니즘은 건너뛰고 연결하기(Skip-connection) 동작 뿐만 아니라, 연접(concatenation), non-maximum suppression(NMS), 쌍선형 보간(Bilinear interpolation) 등 다양한 동작에 적용될 수도 있다. Such a mechanism may be applied not only to a skip-connection operation, but also to various operations such as concatenation, non-maximum suppression (NMS), and bilinear interpolation.

예를 들면, NPU(100)는 칩-내부 메모리(120)의 효율적인 제어를 위해서 제2 레이어의 합성곱 연산을 수행한 후, 제1 레이어의 출력 특징맵(OFMAP)을 제외한 제1 레이어의 데이터가 삭제되도록 할 수 있다. 또 다른 예를 들어, NPU(100)는 칩-내부 메모리(120)의 효율적인 제어를 위해서 제3 레이어의 연산을 수행한 후, 제1 레이어의 출력 특징맵(OFMAP)을 제외한 제2 레이어의 데이터가 삭제되도록 할 수 있다. 또 다른 예를 들어, NPU(100)는 칩-내부 메모리(120)의 효율적인 제어를 위해서 제4 레이어의 연산을 수행한 후, 제1 레이어의 출력 특징맵(OFMAP)을 제외한 제3 레이어의 데이터가 삭제되도록 할 수 있다. 또한, NPU(100)는 칩-내부 메모리(120)의 효율적인 제어를 위해서 제5 레이어의 연산을 수행한 후, 제1 레이어의 출력 특징맵(OFMAP)을 포함한 제4 레이어의 데이터가 삭제되도록 할 수 있다.For example, the NPU 100 performs a convolution operation of the second layer for efficient control of the chip-internal memory 120 , and then the data of the first layer excluding the output feature map OFMAP of the first layer can be deleted. As another example, the NPU 100 performs the operation of the third layer for efficient control of the chip-internal memory 120 , and then the data of the second layer excluding the output feature map OFMAP of the first layer can be deleted. As another example, the NPU 100 performs the operation of the fourth layer for efficient control of the chip-internal memory 120 , and then the data of the third layer except for the output feature map OFMAP of the first layer can be deleted. In addition, the NPU 100 performs the operation of the fifth layer for efficient control of the chip-internal memory 120 , and then deletes the data of the fourth layer including the output feature map OFMAP of the first layer. can

상기 인공신경망 데이터 지역성 정보는 하기에 나열된 조건들을 고려하여 컴파일러(200)가 생성하고 NPU(100)가 수행할 데이터 처리 순서를 의미한다The artificial neural network data locality information refers to a data processing order to be generated by the compiler 200 and performed by the NPU 100 in consideration of the conditions listed below.

1. ANN 모델의 구조 (이종 센서 데이터를 입력 받도록 설계된 Resnet, YOLO, SSD, 등 퓨전(fusion) 인공신경망) 1. Structure of ANN model (fusion artificial neural networks such as Resnet, YOLO, SSD, etc. designed to receive heterogeneous sensor data)

2. 프로세서의 구조 (예컨대, CPU, GPU, NPU 등의 아키텍처)2. Processor architecture (eg, CPU, GPU, NPU, etc. architecture)

NPU의 경우 PE 개수, PE의 구조(예컨대, 입력 고정(input stationary), 출력 고정(output stationary), 가중치 고정(weight stationary) 등), PE 어레이와 유기적으로 동작하도록 구성된 SFU 구조 등 In the case of NPU, the number of PEs, the structure of the PE (eg, input stationary, output stationary, weight stationary, etc.), the SFU structure configured to operate organically with the PE array, etc.

3. 칩-내부 메모리(120) 크기(예컨대, 캐시가 데이터 보다 작을 때, 타일링(tiling) 알고리즘 적용 필요 등)3. Chip-internal memory 120 size (eg, when the cache is smaller than the data, need to apply a tiling algorithm, etc.)

4. 처리할 퓨전-인공신경망모델의 각 레이어의 데이터 사이즈 4. Data size of each layer of the fusion-artificial neural network model to be processed

5. 프로세싱 정책. 즉, NPU(100)가 입력특징맵(IFMAP) 먼저 읽기 요청 또는 커널(Kernel) 먼저 읽기 요청할지에 대한 순서 등을 결정. 이는, 프로세서 또는 컴파일러에 따라 다양해질 수 있다. 5. Processing Policy. That is, the NPU 100 determines the order of whether the input feature map (IFMAP) first read request or the kernel (Kernel) first read request. This may vary depending on the processor or compiler.

도 17은 제2 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이다.17 is an exemplary diagram illustrating a system including an NPU architecture according to a second example.

도 17을 참고하여 알 수 있는 바와 같이, 상기 NPU(100)는 퓨전(fusion) 인공신경망을 위한 PE 어레이(110)와, 칩-내부(On-chip) 메모리(120)와, NPU 스케줄러(130)과, SFU(special function unit)(160)을 포함할 수 있다. 도 17을 설명함에 있어서 중복되는 설명은 단지 설명의 편의를 위해 생략될 수 있다.As can be seen with reference to FIG. 17 , the NPU 100 is a PE array 110 for a fusion artificial neural network, and an on-chip memory 120 and an NPU scheduler 130 . ) and may include a special function unit (SFU) 160 . In the description of FIG. 17 , overlapping descriptions may be omitted for convenience of description only.

상기 NPU(100)는 M개의 이종 센서들(311, 312)로부터 서로 다른 데이터를 수신할 수 있다. 상기 이종의 센서들은 카메라, 레이더, 라이다, 초음파, 열화상 카메라 등을 포함할 수 있다.The NPU 100 may receive different data from the M heterogeneous sensors 311 and 312 . The heterogeneous sensors may include a camera, radar, lidar, ultrasound, thermal imaging camera, and the like.

상기 NPU(100)는 N개의 출력부를 통하여 N개의 결과(예컨대, 이종의 추론 결과)를 출력할 수 있다. 상기 NPU(100)로부터 출력되는 이종의 데이터는 분류(Classification), 시맨틱 세그멘테이션(Semantic segmentation), 오브젝트 검출(Object detection), 예측 등일 수 있다.The NPU 100 may output N results (eg, heterogeneous inference results) through N output units. The heterogeneous data output from the NPU 100 may be classification, semantic segmentation, object detection, prediction, or the like.

도 18은 제3 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이다.18 is an exemplary diagram illustrating a system including an NPU architecture according to a third example.

도 18을 참고하여 알 수 있는 바와 같이, 상기 NPU(100)는 퓨전(fusion) 인공신경망을 위한 PE 어레이(110)와, 칩-내부(On-chip) 메모리(120)와, NPU 스케줄러(130)과, SFU(special function unit)(160)을 포함할 수 있다. 도 18을 설명함에 있어서 중복되는 설명은 단지 설명의 편의를 위해 생략될 수 있다.As can be seen with reference to FIG. 18 , the NPU 100 has a PE array 110 for a fusion artificial neural network, and a chip-internal (On-chip) memory 120 , and an NPU scheduler 130 . ) and may include a special function unit (SFU) 160 . In the description of FIG. 18 , overlapping descriptions may be omitted for convenience of description only.

상기 NPU(100)는 ADC(artificial neural network data locality controller)(400)를 통해 칩-외부 메모리(500)로부터 인공신경망 연산에 필요한 데이터를 제공받을 수 있다.The NPU 100 may receive data necessary for artificial neural network operation from the chip-external memory 500 through an artificial neural network data locality controller (ADC) 400 .

상기 ADC(400)는 상기 컴파일러(200)로부터 제공되는 인공신경망 데이터 지역성 정보에 기초하여, 데이터를 미리 관리할 수 있다. The ADC 400 may manage data in advance based on artificial neural network data locality information provided from the compiler 200 .

구체적으로, ADC(400)는 상기 컴파일러(200)로부터 퓨전(fusion) 인공신경망의 인공신경망 데이터 지역성 정보를 제공받아 분석하거나 또는 상기 컴파일러(200)로부터 분석된 정보를 제공받아, 상기 칩-외부 메모리(500)의 동작을 제어할 수 있다.Specifically, the ADC 400 receives and analyzes artificial neural network data locality information of a fusion artificial neural network from the compiler 200 or receives the analyzed information from the compiler 200, the chip-external memory The operation of 500 can be controlled.

상기 ADC(400)는 상기 퓨전 인공신경망 데이터 지역성 정보에 따라, 상기 칩-외부 메모리(500) 내에 저장된 데이터를 읽어와서 내부 버퍼 메모리에 사전에 캐싱할 수 있다. 칩-외부 메모리(500)는 상기 퓨전 인공신경망의 모든 가중치 커널이 저장될 수 있으며, 칩-내부 메모리(120)는 칩-외부 메모리(500)에 저장된 모든 가중치 커널 중 인공신경망 데이터 지역성 정보에 따라 필요한 적어도 일부의 가중치 커널만 저장할 수 있다. 칩-외부 메모리(500)의 메모리 용량은 칩-내부 메모리(120)의 메모리 용량보다 더 클 수 있다. The ADC 400 may read data stored in the chip-external memory 500 according to the fusion neural network data locality information and cache the data in advance in an internal buffer memory. The chip-external memory 500 may store all the weight kernels of the fusion artificial neural network, and the chip-internal memory 120 stores the artificial neural network data locality information among all the weight kernels stored in the chip-external memory 500 . We can only store at least some of the weight kernels we need. The memory capacity of the on-chip memory 500 may be greater than the memory capacity of the on-chip memory 120 .

ADC(400)는 인공신경망 데이터 지역성 정보에 기초하여 NPU(100)와 연동하거나 또는 독립적으로 NPU(100)에 필요한 데이터를 칩-외부 메모리(500)로부터 사전에 준비하여, NPU(100)의 추론 동작의 레이턴시를 저감하거나 또는 동작 속도를 향상하도록 구성될 수 있다.The ADC 400 interworks with the NPU 100 or independently prepares the data required for the NPU 100 from the chip-external memory 500 based on the artificial neural network data locality information in advance, and the inference of the NPU 100 It may be configured to reduce the latency of the operation or to improve the speed of the operation.

상기 NPU(100)는 N개의 출력부를 통하여 N개의 결과(예컨대, 이종의 추론 결과)를 출력할 수 있다.The NPU 100 may output N results (eg, heterogeneous inference results) through N output units.

도 19는 제4 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이고, 도 20은 도 13에 도시된 퓨전(fusion) 인공신경망을 도 19에 도시된 제4 예시에 따라 쓰레드로 구분한 예를 나타낸다.19 is an exemplary diagram illustrating a system including an NPU architecture according to a fourth example, and FIG. 20 is an example in which the fusion artificial neural network shown in FIG. 13 is divided into threads according to the fourth example shown in FIG. indicates.

도 19를 참고하여 알 수 있는 바와 같이, 상기 NPU(100)는 퓨전(fusion) 인공신경망을 위한 PE 어레이(110)와, 칩-내부(On-chip) 메모리(120)와, NPU 스케줄러(130)과, SFU(special function unit)(160)을 포함할 수 있다.As can be seen with reference to FIG. 19 , the NPU 100 is a PE array 110 for a fusion artificial neural network, and a chip-internal (On-chip) memory 120 , and an NPU scheduler 130 . ) and may include a special function unit (SFU) 160 .

상기 NPU(100)는 N개의 이종 데이터(예컨대, 이종의 추론 결과)를 출력할 수 있다. 상기 NPU(100)로부터 출력되는 이종의 데이터는 분류(Classification), 시맨특 세그멘테이션(Semantic segmentation), 오브젝트 검출(Object detection), 예측 등일 수 있다.The NPU 100 may output N pieces of heterogeneous data (eg, heterogeneous inference results). The heterogeneous data output from the NPU 100 may be classification, semantic segmentation, object detection, prediction, or the like.

상기 PE 어레이(110)는 다중 쓰레드를 처리할 수 있다. 도 20에 도시된 바와 같이, 카메라로부터 얻은 RGB 이미지 데이터를 쓰레드 #1을 통해서 처리하고, 변환은 쓰레드 #2를 통해서 처리하고, 라이더로부터 얻은 데이터는 쓰레드 #3을 통해서 처리할 수 있다.The PE array 110 can process multiple threads. As shown in FIG. 20 , RGB image data obtained from the camera may be processed through thread #1, conversion may be processed through thread #2, and data obtained from the rider may be processed through thread #3.

이를 위해, 컴파일러(200)는 인공신경망 모델을 분석하고, 병렬 연산 흐름을 기초로, 쓰레드를 구분할 수 있다.To this end, the compiler 200 may analyze the artificial neural network model and classify the threads based on the parallel operation flow.

NPU(100)의 PE 어레이(110)는 퓨전(fusion) 인공신경망의 병렬 처리 연산이 가능한 레이어를 다중 쓰레드를 통해서 연산 효율을 향상시킬 수 있다.The PE array 110 of the NPU 100 may improve computational efficiency through multiple threads of a layer capable of parallel processing of a fusion artificial neural network.

NPU(100)의 PE 어레이(110)는 미리 설정된 쓰레드를 포함할 수 있다.The PE array 110 of the NPU 100 may include a preset thread.

NPU(100)는 PE 어레이(110) 내의 각 쓰레드가 칩-내부(On-chip) 메모리(120)와 통신할 수 있도록 제어할 수 있다.The NPU 100 may control each thread in the PE array 110 to communicate with the on-chip memory 120 .

NPU(100)는 쓰레드 별 칩-내부(On-chip) 메모리(120) 내부 공간을 선택적으로 할당할 수 있다. The NPU 100 may selectively allocate an internal space of the on-chip memory 120 for each thread.

NPU(100)는 쓰레드 별로 적절한 칩-내부(On-chip) 메모리(120)를 할당할 수 있다. 칩-내부 메모리(120)의 메모리 할당은 퓨전 인공신경망의 인공신경망 데이터 지역성 정보에 기초하여 제어부가 결정할 수 있다. The NPU 100 may allocate an appropriate on-chip memory 120 for each thread. The memory allocation of the chip-internal memory 120 may be determined by the controller based on the neural network data locality information of the fusion neural network.

NPU(100)는 퓨전(fusion) 인공신경망에 기초하여, PE 어레이(110) 내에 쓰레드를 설정할 수 있다.The NPU 100 may set a thread in the PE array 110 based on a fusion artificial neural network.

도 21은 제5 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이고, 도 22는 도 21에 도시된 SFU의 파이프라인 구조의 제1 예시를 나타낸 예시도이다.21 is an exemplary diagram illustrating a system including an NPU architecture according to a fifth example, and FIG. 22 is an exemplary diagram illustrating a first example of the pipeline structure of the SFU shown in FIG.

도 21을 참고하여 알 수 있는 바와 같이, 상기 NPU(100)는 퓨전(fusion) 인공신경망을 위한 PE 어레이(110)와, 칩-내부(On-chip) 메모리(120)와, NPU 스케줄러(130)과, SFU(special function unit)(160)을 포함할 수 있다.As can be seen with reference to FIG. 21 , the NPU 100 has a PE array 110 for a fusion artificial neural network, and a chip-internal (On-chip) memory 120 , and an NPU scheduler 130 . ) and may include a special function unit (SFU) 160 .

상기 NPU(100)는 N개의 이종 데이터(예컨대, 이종의 추론 결과)를 출력할 수 있다. 상기 NPU(100)로부터 출력되는 이종의 데이터는 분류(Classification), 시맨틱 세그멘테이션(Semantic segmentation), 오브젝트 검출(Object detection), 예측 등일 수 있다.The NPU 100 may output N pieces of heterogeneous data (eg, heterogeneous inference results). The heterogeneous data output from the NPU 100 may be classification, semantic segmentation, object detection, prediction, or the like.

도 22에 도시된 바와 같이, 상기 SFU(160)은 여러 기능 유닛을 포함한다. 각각의 기능 유닛은 선택적으로 동작될 수 있다. 각각의 기능 유닛은 선택적으로 턴-온되거나 턴-오프될 수 있다. 즉, 각각의 기능 유닛은 설정이 가능하다.22, the SFU 160 includes several functional units. Each functional unit can be selectively operated. Each functional unit can be selectively turned on or off. That is, each functional unit is configurable.

다시 말해서, 상기 SFU(160)은 퓨전(fusion) 인공신경망 추론에 필요한 다양한 기능 유닛들을 포함할 수 있다.In other words, the SFU 160 may include various functional units required for fusion artificial neural network inference.

예를 들면, 상기 SFU(160)의 기능 유닛들은 건너뛰고 연결하기(skip-connection) 동작을 위한 기능 유닛, 활성화 함수(activation function) 동작을 위한 기능 유닛, 풀링(pooling) 동작을 위한 기능 유닛, 양자화(quantization) 동작을 위한 기능 유닛, NMS(non-maximum suppression) 동작을 위한 기능 유닛, 정수 및 부동 소수점 변환(INT to FP32) 동작을 위한 기능 유닛, 배치 정규화(batch-normalization) 동작을 위한 기능 유닛, 보간법(interpolation) 동작을 위한 기능 유닛, 연접(concatenation) 동작을 위한 기능 유닛, 및 바이아스(bias) 동작을 위한 기능 유닛 등을 포함할 수 있다.For example, the functional units of the SFU 160 are a functional unit for a skip-connection operation, a functional unit for an activation function operation, a functional unit for a pooling operation, Functional unit for quantization operation, function unit for non-maximum suppression (NMS) operation, function unit for integer and floating-point conversion (INT to FP32) operation, function for batch-normalization operation It may include a unit, a functional unit for an interpolation operation, a functional unit for a concatenation operation, and a functional unit for a bias operation.

상기 SFU(160)의 기능 유닛들은 인공신경망 데이터 지역성 정보에 의해서 선택적으로 턴-온되거나 혹은 턴-오프될 수 있다. 인공신경망 데이터 지역성 정보는 특정 레이어를 위한 연산이 수행될 때, 해당 기능 유닛의 턴-오프 혹은 턴-오프와 관련된 제어 정보를 포함할 수 있다.The functional units of the SFU 160 may be selectively turned on or turned off by artificial neural network data locality information. The artificial neural network data locality information may include turn-off or turn-off-related control information of a corresponding functional unit when an operation for a specific layer is performed.

도 23a는 도 21에 도시된 SFU의 일 예시를 나타낸 예시도이고, 도 23b는 도 21에 도시된 SFU의 다른 예시를 나타낸 예시도이다.23A is an exemplary diagram illustrating an example of the SFU shown in FIG. 21 , and FIG. 23B is an exemplary diagram illustrating another example of the SFU illustrated in FIG. 21 .

도 23a 및 도 23b에 도시된 바와 같이, 상기 SFU(160)의 기능 유닛들 중 활성화된 유닛은 턴-온 될 수 있다. 23A and 23B , an activated unit among functional units of the SFU 160 may be turned on.

구체적으로 도 23a에 도시된 바와 같이, SFU(160)는 건너뛰고 연결하기(skip-connection) 동작과 연접(concatenation) 동작을 선택적으로 활성화 할 수 있다. 예시적으로, 활성화된 각각의 기능 유닛은 해칭(hatching)으로 표기될 수 있다.Specifically, as shown in FIG. 23A , the SFU 160 may selectively activate a skip-connection operation and a concatenation operation. Illustratively, each activated functional unit may be marked with hatching.

예를 들면, SFU(160)는 퓨전 동작을 위해서 이종 센서 데이터를 연접할 수 있다. 예를 들면, SFU(160)의 건너뛰고 연결하기 동작을 위해서 제어부는 칩-내부 메모리(120)와 SFU(160)를 제어할 수 있다.For example, the SFU 160 may concatenate heterogeneous sensor data for a fusion operation. For example, in order to skip and connect the SFU 160 , the controller may control the chip-internal memory 120 and the SFU 160 .

구체적으로 도 23b에 도시된 바와 같이, 양자화(quantization) 동작과 바이아스(bias) 동작을 선택적으로 활성화할 수 있다. 예를 들면, PE 어레이(110)에서 출력되는 특징맵 데이터의 크기를 저감하기 위해서 PE 어레이(110)에서 출력되는 특징맵을 SFU(160)의 양자화 기능 유닛이 입력 받아 특징맵을 특정 비트폭으로 양자화 할 수 있다. 그리고 양자화된 특징맵을 칩-내부 메모리(120)에 저장할 수 있다. 일련의 동작들은 제어부를 통해서 순차적으로 할 수 있으며, NPU 스케쥴러(130)가 상기 동작들의 동작 순서를 제어하도록 구성될 수 있다.Specifically, as shown in FIG. 23B , a quantization operation and a bias operation may be selectively activated. For example, in order to reduce the size of the feature map data output from the PE array 110 , the quantization function unit of the SFU 160 receives the feature map output from the PE array 110 and converts the feature map to a specific bit width. can be quantized. In addition, the quantized feature map may be stored in the chip-internal memory 120 . A series of operations may be sequentially performed through the control unit, and the NPU scheduler 130 may be configured to control the operation sequence of the operations.

이와 같이 SFU(160)의 일부 기능 유닛을 선택적으로 턴-오프하는 경우, NPU(100)의 소비 전력을 절감할 수 있다. 한편, 일부 기능 유닛을 턴-오프하기 위하여, 파워 게이팅(power gating)을 이용할 수 있다. 또는, 일부 기능 유닛을 턴-오프하기 위하여, 클럭 게이팅(clock gating)을 수행할 수도 있다.In this way, when selectively turning off some functional units of the SFU 160, it is possible to reduce the power consumption of the NPU (100). Meanwhile, in order to turn off some functional units, power gating may be used. Alternatively, clock gating may be performed to turn off some functional units.

도 24는 제6 예시에 따른 NPU 아키텍처를 포함하는 시스템 나타낸 예시도이다.24 is an exemplary diagram illustrating a system including an NPU architecture according to a sixth example.

도 24에 도시된 바와 같이, NPU 배치 모드(Batch mode)가 적용될 수 있다.As shown in Figure 24, NPU batch mode (Batch mode) may be applied.

상기 NPU(100)는 퓨전(fusion) 인공신경망을 위한 PE 어레이(110)와, 칩-내부(On-chip) 메모리(120)와, NPU 스케줄러(130)과, SFU(special function unit)(160)을 포함할 수 있다.The NPU 100 is a PE array 110 for a fusion artificial neural network, and an on-chip memory 120 , an NPU scheduler 130 , and a special function unit (SFU) 160 . ) may be included.

본 예시에서 개시하는 배치 모드는 하나의 인공신경망모델로 다수의 동일 센서를 순차적으로 처리하여 상기 하나의 인공신경망모델의 가중치를 상기 다수의 동일 센서의 개수만큼 재사용하도록 하여 저전력을 달성하도록 구성된 모드를 의미한다. The batch mode disclosed in this example is a mode configured to achieve low power by sequentially processing a plurality of identical sensors with one artificial neural network model and reusing the weights of the one artificial neural network model as much as the number of the plurality of identical sensors. it means.

배치 모드 동작을 위해서 NPU(100)의 제어부는 칩-내부 메모리에 저장되는 가중치가 각각의 배치 채널에 입력되는 센서의 개수만큼 재사용되도록 NPU 스케줄러(130)를 제어하도록 구성될 수 있다. 즉, 예시적으로, NPU(100)가 M개의 센서로 배치 모드로 동작되도록 구성될 수 있다. 이때, NPU(100)의 상기 배치 모드 동작은 퓨전-인공신경망모델로 동작하도록 구성될 수 있다. For batch mode operation, the controller of the NPU 100 may be configured to control the NPU scheduler 130 so that the weight stored in the chip-internal memory is reused as much as the number of sensors input to each batch channel. That is, by way of example, the NPU 100 may be configured to operate in batch mode with M sensors. In this case, the batch mode operation of the NPU 100 may be configured to operate in a fusion-artificial neural network model.

퓨전-인공신경망의 동작을 위해서 NPU(100)는 퓨전을 위한 복수의 배치 채널들(BATCH CH#1, BATCH CH#2)을 가지도록 구성될 수 있다. 각각의 배치 채널은 동일한 복수의 센서들을 포함하도록 구성될 수 있다. 제1 배치 채널(BATCH CH#1)은 복수의 제1 센서들로 구성될 수 있다. 이때, 제1 센서들은 M개일 수 있다. 제K 배치 채널(BATCH CH#K)은 복수의 제2 센서들로 구성될 수 있다. 이때, 제2 센서들은 M개일 수 있다.For the operation of the fusion-artificial neural network, the NPU 100 may be configured to have a plurality of deployment channels (BATCH CH#1, BATCH CH#2) for fusion. Each deployment channel may be configured to include the same plurality of sensors. The first batch channel BATCH CH#1 may include a plurality of first sensors. In this case, the number of first sensors may be M. The K-th batch channel BATCH CH#K may include a plurality of second sensors. In this case, the number of second sensors may be M.

상기 NPU(100)는 제1 배치 채널을 통해 센서들(311, 312)로부터의 입력을위해 대응되는 가중치를 칩-내부 메모리(120)에서 재사용하며 처리할 수 있다. 그리고, 상기 NPU(100)는 제2 배치 채널을 통해 센서들(321, 322)로부터의 입력을위해 대응되는 가중치를 칩-내부 메모리(120)에서 재사용하며 처리할 수 있다. The NPU 100 may reuse and process the weight corresponding to the input from the sensors 311 and 312 in the chip-internal memory 120 through the first placement channel. In addition, the NPU 100 may reuse and process the weight corresponding to the input from the sensors 321 and 322 in the chip-internal memory 120 through the second arrangement channel.

이와 같이, 상기 NPU(100)는 복수의 배치 채널을 통해 여러 센서들로부터의 입력을 제공받고 가중치를 재사용하며 배치 모드로 퓨전-인공신경망을 처리할 수 있다. 상기 복수의 배치 채널들 중 적어도 하나의 채널의 센서와 다른 적어도 하나의 채널의 센서는 서로 상이할 수 있다.As such, the NPU 100 may receive inputs from various sensors through a plurality of deployment channels, reuse weights, and process the fusion-artificial neural network in a deployment mode. A sensor of at least one channel among the plurality of arrangement channels and a sensor of at least one other channel may be different from each other.

상기 NPU(100) 내의 칩-내부(On-chip) 메모리(120)는 복수의 배치 채널에 대응되는 저장 공간을 가지도록 설정될 수 있다.The on-chip memory 120 in the NPU 100 may be configured to have a storage space corresponding to a plurality of placement channels.

상기 NPU(100) 내의 NPU 스케줄러(130)는 배치 모드에 따라 PE 어레이(110)를 동작시킬 수 있다.The NPU scheduler 130 in the NPU 100 may operate the PE array 110 according to the batch mode.

상기 NPU(100) 내의 SFU(160)는 적어도 하나의 퓨전 동작을 처리하기 위한 특수 기능을 제공할 수 있다. The SFU 160 in the NPU 100 may provide a special function for processing at least one fusion operation.

상기 NPU(100)는 복수의 배치 채널들을 통해 각각의 출력을 전달할 수 있다.The NPU 100 may deliver each output through a plurality of deployment channels.

상기 복수의 배치 채널들 중 적어도 하나의 채널은 퓨전(fusion) 인공신경망의 추론 데이터일 수 있다.At least one of the plurality of deployment channels may be inferred data of a fusion artificial neural network.

도 25는 제7 예시에 따라 복수의 NPU를 활용하는 예를 나타낸 예시도이고, 도 26은 도 13에 도시된 퓨전(fusion) 인공신경망을 도 25에 도시된 복수의 NPU를 통해 처리하는 예를 나타낸 예시도이다.25 is an exemplary diagram illustrating an example of using a plurality of NPUs according to the seventh example, and FIG. 26 is an example of processing the fusion artificial neural network shown in FIG. 13 through the plurality of NPUs shown in FIG. It is an example diagram shown.

도 25에 도시된 바와 같이, 자율 주행을 위하여 복수개의, 예시적으로 M개의 NPU가 사용될 수 있다. M개의 NPU 중에서 제1 NPU(100-1)는 예컨대 센서#1(311)로부터 제공되는 데이터를 처리할 수 있고, M번째 NPU(100-M)는 예컨대 센서#M(312)로부터 제공되는 데이터를 처리할 수 있다. 상기 복수의 NPU(예컨대 100-1, 100-2)는 ADC/DMA(Direct Memory Access)(400)을 통하여 칩-외부 메모리(500)에 접근할 수 있다.As shown in FIG. 25 , a plurality of, for example, M NPUs may be used for autonomous driving. Among the M NPUs, the first NPU 100-1 may process data provided from, for example, the sensor #1 311, and the M-th NPU 100-M may, for example, process data provided from the sensor #M 312. can be processed. The plurality of NPUs (eg, 100-1, 100-2) may access the chip-external memory 500 through the ADC/DMA (Direct Memory Access) 400 .

상기 복수의 NPU(예컨대 100-1, 100-2)는 컴파일러(200)로부터 퓨전(fusion) 인공신경망 데이터 지역성 정보를 획득할 수 있다.The plurality of NPUs (eg, 100-1, 100-2) may obtain fusion neural network data locality information from the compiler 200 .

각각의 NPU는 퓨전(fusion) 인공신경망을 처리하고 퓨전(fusion)을 위한 연산을 ADC/DMA(400)를 통해서 서로 다른 NPU로 전달할 수 있다.Each NPU may process a fusion artificial neural network and transfer an operation for fusion to different NPUs through the ADC/DMA 400 .

상기 ADC/DMA(400)는 상기 컴파일러(200)로부터 퓨전(fusion) 인공신경망을 위한 데이터 지역성 정보를 획득할 수 있다.The ADC/DMA 400 may obtain data locality information for a fusion artificial neural network from the compiler 200 .

상기 컴파일러(200)는 인공신경망 데이터 지역성 정보에 따른 연산들 중에서 병렬 처리되어야 하는 연산들이 각 NPU에서 처리될 수 있도록, 인공신경망 데이터 지역성 정보를 데이터 지역성 정보 #1 그리고 데이터 지역성 정보#M으로 분리하여 생성할 수 있다.The compiler 200 divides the neural network data locality information into data locality information #1 and data locality information #M so that operations to be processed in parallel among operations according to the artificial neural network data locality information can be processed in each NPU. can create

상기 칩-외부 메모리(500)는 복수의 NPU들이 공유가능한 데이터를 저장하고, 각각의 NPU에 전달할 수 있다.The chip-external memory 500 may store data that can be shared by a plurality of NPUs, and may be transmitted to each NPU.

도 26에 도시된 바와 같이, 카메라로부터 제공되는 데이터를 처리하기 위한 제1 인공신경망을 NPU#1가 담당할 수 있고, 라이다로부터 제공되는 데이터를 처리하기 제2 인공신경망을 NPU#2가 담당할 수 있다. 또한, 상기 NPU#2는 제1 인공신경망과 제2 인공신경망의 퓨전을 위한 변환을 담당할 수 있다. As shown in FIG. 26 , NPU#1 may be in charge of the first artificial neural network for processing data provided from the camera, and NPU#2 may be in charge of the second artificial neural network for processing data provided from LiDAR. can do. In addition, the NPU #2 may be in charge of conversion for the fusion of the first artificial neural network and the second artificial neural network.

도 27a 내지 27c는 근적외선(near-infrared) 센서와 카메라를 사용하는 퓨전(fusion) 인공신경망의 활용 예를 나타낸다.27A to 27C show examples of application of a fusion artificial neural network using a near-infrared sensor and a camera.

도 27a에 도시된 바와 같이, 일반적으로 차량에는 일반 헤드라이트가 가시 광선을 수평선 이하의 각도로 조사하도록 설치되어 있다. 그러나, 본 특허의 발명자는 전방 방향으로 근적외선(near-infrared, NIR)을 조사하는 광원을 추가 설치하고, NIR 센서를 차량에 설치하는 것을 제안한다.As shown in FIG. 27A , in general, in a vehicle, a general headlight is installed so as to irradiate visible light at an angle less than or equal to the horizontal line. However, the inventor of the present patent proposes to additionally install a light source irradiating near-infrared (NIR) in the forward direction, and to install the NIR sensor in the vehicle.

일반적인 카메라는 일반적으로 380nm~680nm 파장의 RGB 이미지를 촬영할 수 있다. 반면, 상기 NIR 센서는 850nm~940nm 파장의 이미지를 촬영할 수 있다.A typical camera can generally shoot RGB images with a wavelength of 380 nm to 680 nm. On the other hand, the NIR sensor may take an image having a wavelength of 850 nm to 940 nm.

이와 같이, NIR 광원과 NIR 센서를 추가하면, 야간에 맞은편 차량을 주행하는 운전자의 시야에 방해를 주지 않으며, 고품질의 이미지를 얻을 수 있다.In this way, when the NIR light source and the NIR sensor are added, a high-quality image can be obtained without obstructing the view of a driver driving an oncoming vehicle at night.

상기 NIR 센서는 대응되는 NIR 광원과 동기화 되고 PWM(Pulse Width Modulation)에 따라 구동될 수 있다. 따라서, 소비 전력과 SNR이 향상될 수 있다.The NIR sensor may be synchronized with a corresponding NIR light source and driven according to pulse width modulation (PWM). Accordingly, power consumption and SNR can be improved.

한편, NIR 광원은 프레임마다 턴-온되거나 턴-오프될 수 있다. 도 27b에 도시된 바와 같이, NIR 광원이 턴-온되거나 턴-오프되는 경우, 재귀반사(retro-reflector) 특성을 가진 표지판들이, 전체 이미지 내에서 구분될 수 있다. 도 27c는 재귀반사의 특성이 나타나 있다. Meanwhile, the NIR light source may be turned on or turned off every frame. As shown in FIG. 27B , when the NIR light source is turned on or turned off, signs having a retro-reflector characteristic may be distinguished within the entire image. 27C shows the characteristics of retroreflection.

이와 같이 NIR 광원을 턴-온 그리고 턴-오프함으로써, 재귀반사(retro-reflector) 특성을 가진 표지판들을 구분해낼 수 있다. 부연 설명하면, NIR 광원과 NIR 센서가 인접할 경우, NIR 광원이 재귀반사판에 반사된 광량은 일반적인 물체에 반사된 광량 대비 300 이상 더 밝게 감지될 수 있다. 따라서 온-오프 할때, 재귀 반사 물체를 감지할 수 있다. By turning the NIR light source on and off in this way, it is possible to distinguish signs having retro-reflector characteristics. In other words, when the NIR light source and the NIR sensor are adjacent to each other, the amount of light reflected by the NIR light source on the retroreflective plate may be detected to be 300 or more brighter than the amount of light reflected by a general object. Therefore, when on-off, retroreflective objects can be detected.

NIR 센서는 NIR 반사광을 감지할 수 있지만, 일반 신호등 불빛 등은 감지되지 않으므로, 퓨전(fusion) 인공신경망은 조명과 NIR 반사광을 구분할 수 있도록, 학습될 수 있다.The NIR sensor can detect the NIR reflected light, but the general traffic light light is not detected, so the fusion artificial neural network can be trained to distinguish the light from the NIR reflected light.

이상에서 설명한 바와 같이, RGB 영상과 NIR 영상을 혼합하여, 야간 자율주행을 가능하게 할 수 있다. As described above, by mixing the RGB image and the NIR image, it is possible to enable autonomous driving at night.

이러한 활용예는 다른 방식으로 확장될 수 있다.These applications can be extended in other ways.

예를 들어, NIR 광원을 차량 전조등에 추가 설치하고, 가시광선 380nm~680nm 파장 및 근적외선 850nm~940nm 파장을 감지할 수 있는 이미지 센서를 포함하는 카메라를 설치할 수 있다. 퓨전(fusion) 인공신경망은 영상에서 전방 및 후방 접근 차량, 신호등, 장애물, 노면 상태, 보행자를 구분할 수 있다. For example, an NIR light source may be additionally installed in a vehicle headlamp, and a camera including an image sensor capable of detecting a wavelength of 380 nm to 680 nm of visible light and a wavelength of 850 nm to 940 nm of near infrared light may be installed. The fusion artificial neural network can distinguish front and rear approaching vehicles, traffic lights, obstacles, road surface conditions, and pedestrians in an image.

또 다른 예를 들어, 야간에 차량 실내를 촬영하기 위하여, NIR 광원과 NIR 센서가 차량 실내에 설치될 수 있다. 예를 들어, NIR 광원은 운전자 및 동승자의 상태를 촬영하도록 복수개가 서로 다른 최적의 위치에 설치될 수 있다. 이를 통하여, 운전자 및 동승자의 건강 상태 등을 모니터링할 수 있다. As another example, in order to photograph the interior of the vehicle at night, the NIR light source and the NIR sensor may be installed in the interior of the vehicle. For example, a plurality of NIR light sources may be installed at different optimal positions to photograph the driver and passenger states. Through this, it is possible to monitor the health status of the driver and passengers.

도 28은 제8 예시에 따라 편광기를 활용하는 예를 나타내고, 도 29a 및 도 29b는 편광기의 성능을 나타낸 예시도이다.28 shows an example of using a polarizer according to the eighth example, and FIGS. 29A and 29B are exemplary views showing the performance of the polarizer.

도 28에 도시된 바와 같이, 이미지 센서#1(311)에 편광기가 추가적으로 연결되고, 상기 편광기로부터의 출력이 상기 NPU#1(100)로 입력된다. As shown in FIG. 28 , a polarizer is additionally connected to the image sensor #1 ( 311 ), and an output from the polarizer is input to the NPU #1 ( 100 ).

상기 이미지 센서#1(311)에 편광기를 추가하면, 태양광 반사를 저감할 수 있다. 도 29a 및 도 29b에 도시된 바와 같이, 편광기를 활용하면, 차량 도장, 유리, 물, 직광 등에 반사되는 광을 필터링될 수 있다, 다만, 편광기를 활용하면, 영상의 밝기가 25%정도 어두워질 수 있다. 따라서, 상기 NPU(100)에 의해서 구동되는 인공신경망은 상기 편광기로 인해 저감된 밝기를 보상하도록 학습될 수 있다. When a polarizer is added to the image sensor #1 ( 311 ), reflection of sunlight can be reduced. 29A and 29B, if a polarizer is used, light reflected from vehicle paint, glass, water, direct light, etc. can be filtered. However, if a polarizer is used, the brightness of the image will be darkened by 25%. can Accordingly, the artificial neural network driven by the NPU 100 may be trained to compensate for the reduced brightness due to the polarizer.

본 개시의 다양한 예시들은 AI 연산 속도 및 전력 소모를 최소화 하기 위해서 PE 어레이(110)는 추론 전용 PE 어레이로 구성될 수 있다. 추론 전용 PE 어레이는 인공신경망의 학습 기능을 배제하도록 구성될 수 있다. 즉, 추론 전용 PE 어레이는 부동 소수점 연산기를 배제하도록 구성될 수 있다. 따라서 인공신경망 학습을 위해서는 별도의 학습 전용 하드웨어를 구비할 수 있다. 예를 들면, 본 개시의 다양한 예시들에 따른 PE 어레이(110)는 8비트 정수만 처리하도록 구성될 추론 전용 PE 어레이로 구성될 수 있다. 상술한 구성에 따르면 PE 어레이(110)는 부동 소수점 대비 소비전력을 대폭 절감할 수 있는 효과가 있다. 이때, SFU(160)는 부동 소수점 연산이 필요한 몇몇 특수 기능을 위해서 정수 및 부동 소수점 변환(INT to FP32) 동작을 위한 기능 유닛을 활용하도록 구성될 수 있다. In various examples of the present disclosure, in order to minimize AI operation speed and power consumption, the PE array 110 may be configured as an inference-only PE array. An inference-only PE array can be configured to exclude the learning function of an artificial neural network. That is, a speculation-only PE array can be configured to exclude floating-point operators. Therefore, for artificial neural network learning, a separate dedicated hardware for learning may be provided. For example, the PE array 110 according to various examples of the present disclosure may be configured as an inference-only PE array that will be configured to process only 8-bit integers. According to the above-described configuration, the PE array 110 has the effect of significantly reducing power consumption compared to the floating point. At this time, the SFU 160 may be configured to utilize a functional unit for integer and floating-point conversion (INT to FP32) operations for some special functions requiring floating-point arithmetic.

즉, 몇몇 예시에 따르면, PE 어레이(110)는 정수 연산만 가능하도록 구성되고, SFU(160)에서 부동 소수점 연산이 가능하도록 구성될 수 있다. That is, according to some examples, the PE array 110 may be configured to enable only integer arithmetic, and may be configured to enable floating point arithmetic in the SFU 160 .

즉, 몇몇 예시에 따르면, 칩-내부 메모리(120) 효율적인 운용을 위해서 NPU(100)의 제어부는 SFU(160)에서 칩-내부 메모리(120)로 저장되는 모든 데이터가 정수가 되도록 제어할 수 있다. That is, according to some examples, the control unit of the NPU 100 for efficient operation of the chip-internal memory 120 may control all data stored in the chip-internal memory 120 from the SFU 160 to be integers. .

<본 명세서의 개시들의 간략 정리><Brief summary of the disclosures of the present specification>

본 명세서의 일 개시에 따르면, 신경 프로세싱 유닛(NPU)가 제공된다. 상기 NPU는 컴파일된 퓨전-인공신경망의 머신 코드를 입력받도록 구성된 제어부; 상기 퓨전-인공신경망에 대응되는 복수의 입력 신호를 수신받도록 구성된 입력부(또는 센서); 상기 퓨전-인공신경망 연산을 수행하도록 구성된 프로세싱 엘리먼트(PE) 어레이; 상기 퓨전-인공신경망 연산의 특수 기능을 수행하도록 구성된 특수 기능 유닛(SFU); 및 상기 퓨전-인공신경망 연산 데이터를 저장하도록 구성된 온-칩 메모리를 포함할 수 있다. 상기 제어부는, 상기 머신-코드에 포함된 인공신경망 데이터 지역성 정보에 따라 상기 퓨전-인공신경망의 모든 연산 순서가 기 설정된 순서대로 처리되도록, 상기 프로세싱 엘리먼트 어레이, 상기 특수 기능 유닛 및 상기 온-칩 메모리를 제어하도록 구성된 스케줄러를 포함할 수 있다.According to one disclosure herein, a neural processing unit (NPU) is provided. The NPU includes a compiled fusion-control unit configured to receive the machine code of the artificial neural network; an input unit (or sensor) configured to receive a plurality of input signals corresponding to the fusion-artificial neural network; an array of processing elements (PE) configured to perform the fusion-artificial neural network operation; a special function unit (SFU) configured to perform a special function of the fusion-artificial neural network operation; and an on-chip memory configured to store the fusion-artificial neural network computation data. The control unit is configured to: the processing element array, the special function unit, and the on-chip memory so that all operation orders of the fusion-artificial neural network are processed in a preset order according to the artificial neural network data locality information included in the machine-code It may include a scheduler configured to control the.

상기 입력부는, 카메라, 편광 카메라, 3D 카메라, 근적외선 카메라, 열화상 카메라, 레이더, 라이다, 및 초음파 센서 중 적어도 하나로부터 감지된 신호를 입력받도록 설정될 수 있다.The input unit may be configured to receive a signal sensed from at least one of a camera, a polarization camera, a 3D camera, a near-infrared camera, a thermal imaging camera, a radar, a lidar, and an ultrasonic sensor.

상기 퓨전-인공신경망은, 스마트 크루즈 컨트롤, 자동 긴급 제동 시스템, 주차 조향 보조 시스템, 차선 이탈 경보 시스템, 차선 유지 지원 시스템, 졸음 감지, 음주 감지, 더위 및 추위 감지, 부주의 감지 등을 추론하도록 학습된 퓨전-인공신경망일 수 있다.The fusion-artificial neural network is trained to infer smart cruise control, automatic emergency braking system, parking steering assistance system, lane departure warning system, lane keeping assistance system, drowsiness detection, drinking detection, heat and cold detection, carelessness detection, etc. It may be a fusion-artificial neural network.

상기 입력부는, 이종의 데이터를 입력받도록 설정될 수 있다.The input unit may be set to receive different types of data.

상기 SFU는 인공신경망 퓨전을 위한, 건너띄고 연결하기(Skip-connection) 및 연접(Concatenation) 중 적어도 하나의 기능을 수행할 수 있다.The SFU may perform at least one function of skip-connection and concatenation for artificial neural network fusion.

상기 스케줄러는, 상기 인공신경망 데이터 지역성 정보에 기초하여 상기 온-칩 메모리에 저장된 특정 데이터를 상기 퓨전-인공신경망의 특정 연산 단계까지 보호하도록, 상기 온-칩 메모리를 제어할 수 있다.The scheduler may control the on-chip memory to protect the specific data stored in the on-chip memory up to a specific operation stage of the fusion-artificial neural network based on the artificial neural network data locality information.

상기 NPU는, 상기 PE 어레이에 의해서 Classification, Semantic segmentation, Object detection, 및 Prediction 중 적어도 하나의 추론 연산을 처리하도록 학습된 상기 퓨전-인공신경망의 적어도 하나의 추론 결과를 출력하도록 구성된 출력부를 더 포함할 수 있다.The NPU further comprises an output unit configured to output at least one inference result of the fusion-artificial neural network trained to process at least one inference operation of Classification, Semantic segmentation, Object detection, and Prediction by the PE array. can

상기 PE 어레이는 복수의 쓰레드를 더 포함할 수 있다. 상기 제어부는 상기 인공신경망 데이터 지역성에 기초하여 상기 퓨전-인공신경망의 병렬 구간을 처리하도록 상기 복수의 쓰레드를 제어할 수 있다.The PE array may further include a plurality of threads. The controller may control the plurality of threads to process the parallel section of the fusion-artificial neural network based on the locality of the artificial neural network data.

본 명세서의 다른 일 개시에 따르면, 신경 프로세싱 유닛(NPU)가 제공된다. 상기 NPU는 퓨전-인공신경망의 머신 코드를 입력받도록 구성된 제어부; 상기 머신 코드에 기초하여 상기 퓨전-인공신경망의 연산을 수행하도록 구성된 프로세싱 엘리먼트(PE) 어레이; 및 상기 PE 어레이에서 처리된 합성곱 연산값을 입력받아 대응되는 특수 기능을 연산하도록 구성된 특수 기능 유닛(SFU)을 포함할 수 있다. 상기 SFU는 복수의 기능 유닛들을 포함할 수 있다. 상기 SFU는 상기 머신-코드에 포함된 인공신경망 데이터 지역성 정보에 따라 복수의 기능 유닛들 중 적어도 일부를 선택적으로 제어할 수 있다.According to another disclosure of the present specification, a neural processing unit (NPU) is provided. The NPU includes: a fusion-control unit configured to receive the machine code of the artificial neural network; an array of processing elements (PE) configured to perform operations of the fusion-artificial neural network based on the machine code; and a special function unit (SFU) configured to receive the convolution operation value processed in the PE array and calculate a corresponding special function. The SFU may include a plurality of functional units. The SFU may selectively control at least some of the plurality of functional units according to the artificial neural network data locality information included in the machine-code.

상기 복수의 기능 유닛들은 파이프라인 구조로 연결될 수 있다.The plurality of functional units may be connected in a pipeline structure.

상기 복수의 기능 유닛들은 제어부에 의해서 선택적으로 활성화될 수 있다. The plurality of functional units may be selectively activated by the control unit.

상기 복수의 기능 유닛들은 제어부에 의해서 선택적으로 비활성화될 수 있다. The plurality of functional units may be selectively deactivated by the control unit.

상기 복수의 기능 유닛들 각각은 제어부에 의해서 특정 연산 단계마다 선택적으로 클럭 게이팅 및/또는 파워 게이팅 될 수 있다.Each of the plurality of functional units may be selectively clock-gated and/or power-gated for each specific operation step by the controller.

상기 NPU는 상기 퓨전-인공신경망에 대응되는 복수의 입력 신호를 수신받도록 구성된 입력부; 및 상기 퓨전-인공신경망 연산 데이터를 저장하도록 구성된 온-칩 메모리를 더 포함할 수 있다.The NPU includes: an input unit configured to receive a plurality of input signals corresponding to the fusion-artificial neural network; and an on-chip memory configured to store the fusion-artificial neural network computation data.

상기 NPU는, 상기 퓨전-인공신경망에 대응되는 복수의 입력 신호를 배치 모드로 수신받도록 구성된 배치 입력부; 상기 퓨전-인공신경망 연산 데이터를 배치 모드로 저장하도록 구성된 온-칩 메모리; 및 상기 프로세싱 엘리먼트 어레이에 의해서 Classification, Semantic segmentation, Object detection, 및 Pridection 중 적어도 하나의 추론 연산을 배치 모드로 처리한 추론 결과를 출력하도록 구성된 출력부를 더 포함할 수 있다.The NPU may include: a batch input unit configured to receive a plurality of input signals corresponding to the fusion-artificial neural network in a batch mode; an on-chip memory configured to store the fusion-artificial neural network computation data in a batch mode; and an output unit configured to output a speculation result obtained by processing at least one speculation operation of Classification, Semantic segmentation, Object detection, and Pridection in a batch mode by the processing element array.

본 명세서의 또 다른 일 개시에 따르면, 시스템이 제공된다. 상기 시스템은 퓨전-인공신경망의 머신 코드를 입력받도록 구성된 제어부, 적어도 하나의 입력 신호를 수신받도록 구성된 입력부, 합성곱 연산을 수행하도록 구성된 프로세싱 엘리먼트 어레이 및 상기 합성곱 연산 결과를 저장하도록 구성된 온-칩 메모리를 포함하는 적어도 하나의 신경 프로세싱 유닛(NPU); 그리고 상기 적어도 하나의 신경 프로세싱 유닛의 연속된 메모리 오퍼레이션 요청들을 예측할 수 있는 상기 퓨전-인공신경망의 인공신경망 데이터 지역성 정보를 제공받도록 구성되고, 상기 인공신경망 데이터 지역성 정보에 기초하여, 대응되는 상기 적어도 하나의 신경 프로세싱 유닛이 요청할 상기 다음 메모리 오퍼레이션 요청을 사전에 캐싱하도록 구성된 메모리를 포함하는, 메모리 제어부를 포함할 수 있다.According to another disclosure of the present specification, a system is provided. The system includes a control unit configured to receive machine code of a fusion-artificial neural network, an input unit configured to receive at least one input signal, an array of processing elements configured to perform a convolution operation, and an on-chip configured to store a result of the convolution operation. at least one neural processing unit (NPU) comprising a memory; and receive neural network data locality information of the fusion-artificial neural network capable of predicting successive memory operation requests of the at least one neural processing unit, and based on the neural network data locality information, the corresponding at least one and a memory control unit, comprising a memory configured to pre-caching the next memory operation request to be requested by a neural processing unit of .

상기 NPU는 복수개이고, 각각의 NPU의 제어부에 입력되는 상기 퓨전-인공신경망의 머신 코드는 상기 복수의 NPU에서 병렬 처리될 수 있다.The NPU is plural, and the fusion-machine code of the artificial neural network input to the control unit of each NPU may be processed in parallel in the plurality of NPUs.

상기 NPU는 복수개이고, 상기 메모리 제어부는 상기 복수의 NPU의 병렬처리를 직접 제어할 수 있다.The plurality of NPUs, the memory control unit may directly control the parallel processing of the plurality of NPUs.

상기 NPU는 복수개이고, 상기 머신 코드는 상기 복수의 NPU에서 병렬처리되도록 컴파일될 수 있다.The number of NPUs may be plural, and the machine code may be compiled to be processed in parallel in the plurality of NPUs.

상기 시스템은 적외선 광원 및 가시광선 광원을 더 포함할 수 있다. 상기 입력부는 가시광선 영상 및 적외선 영상을 입력 받도록 설정될 수 있다. 상기 머신 코드는 상기 가시광선 영상과 상기 적외선 영상을 퓨전하도록 설정될 수 있다.The system may further include an infrared light source and a visible light source. The input unit may be set to receive a visible ray image and an infrared image. The machine code may be configured to fuse the visible light image and the infrared image.

상기 적외선 광원은 PWM 구동되도록 설정되고, 상기 적외선 영상은 상기 적외선 광원과 동기화될 수 있다.The infrared light source may be set to be PWM driven, and the infrared image may be synchronized with the infrared light source.

상기 적외선 광원의 조사각과 상기 가시광선 광원의 조사각은 서로 일부 중첩될 수 있다. The irradiation angle of the infrared light source and the irradiation angle of the visible light source may partially overlap each other.

본 명세서와 도면에 게시된 본 개시의 예시들은 본 개시의 기술내용을 쉽게 설명하고 본 개시의 이해를 돕기 위해 특정 예를 제시한 것뿐이며, 본 명의 범위를 한정하고자 하는 것은 아니다. 여기에 게시된 예시들 이외에도 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자에게 자명한 것이다.Examples of the present disclosure published in the present specification and drawings are merely specific examples to easily explain the technical content of the present disclosure and help the understanding of the present disclosure, and are not intended to limit the scope of the present disclosure. It will be apparent to those of ordinary skill in the art to which the present disclosure pertains that other modified examples based on the technical spirit of the invention can be implemented in addition to the examples posted here.

100: 신경 프로세싱 유닛(NPU)
110: 프로세싱 엘레먼트(PE)
120: NPU 내부 메모리
130: NPU 스케줄러
140: NPU 인터페이스
150: 커널 생성부100: neural processing unit (NPU)
110: processing element (PE)
120: NPU internal memory
130: NPU scheduler
140: NPU interface
150: kernel generator

Claims

a controller configured to receive the compiled fusion-artificial neural network machine code;
an input unit configured to receive a plurality of input signals corresponding to the fusion-artificial neural network;
an array of processing elements configured to perform the fusion-artificial neural network operation;
a special function unit configured to perform a special function of the fusion-artificial neural network operation; and
an on-chip memory configured to store the fusion-artificial neural network computation data;
The control unit is configured to: the processing element array, the special function unit, and the on-chip memory so that all operation orders of the fusion-artificial neural network are processed in a preset order according to the artificial neural network data locality information included in the machine-code comprising a scheduler configured to control
neural processing unit.

According to claim 1,
The input unit is configured to receive a signal sensed from at least one of a camera, a polarization camera, a 3D camera, a near-infrared camera, a thermal imaging camera, a radar, a lidar, and an ultrasonic sensor.

According to claim 1,
The fusion-artificial neural network is trained to infer smart cruise control, automatic emergency braking system, parking steering assistance system, lane departure warning system, lane keeping assistance system, drowsiness detection, drinking detection, heat and cold detection, carelessness detection, etc. A neural processing unit, characterized in that it is a fusion-artificial neural network.

According to claim 1,
The input unit, a neural processing unit configured to receive heterogeneous data.

According to claim 1,
The special function unit further comprises at least one function of Skip-connection and Concatenation for artificial neural network fusion.

According to claim 1,
The scheduler is configured to control the on-chip memory to protect the specific data stored in the on-chip memory up to a specific operation step of the fusion-AI, based on the artificial neural network data locality information.

According to claim 1,
and an output unit configured to output at least one inference result of the fusion-artificial neural network trained to process the inference operation of at least one of Classification, Semantic segmentation, Object detection, and Pridection by the array of processing elements. unit.

According to claim 1,
the processing element array further comprises a plurality of threads;
The control unit is configured to control the plurality of threads to process the parallel section of the fusion-artificial neural network based on the artificial neural network data locality.

a control unit configured to receive a machine code of a fusion-artificial neural network;
an array of processing elements configured to perform calculation of the fusion-artificial neural network based on the machine code; and
a special function unit configured to receive a convolution operation value processed in the processing element array and calculate a corresponding special function;
The special function unit includes a plurality of functional units,
and the special function unit is configured to selectively control at least some of the plurality of functional units according to the artificial neural network data locality information included in the machine-code.

10. The method of claim 9,
wherein the plurality of functional units are configured in a pipeline structure.

10. The method of claim 9,
wherein the plurality of functional units are configured to be selectively activated by a control unit.

10. The method of claim 9,
wherein the plurality of functional units are configured to be selectively deactivated by a control unit.

10. The method of claim 9,
Each of the plurality of functional units is configured to be selectively clock-gated and/or power-gated for each specific operation step by a control unit.

10. The method of claim 9,
an input unit configured to receive a plurality of input signals corresponding to the fusion-artificial neural network; and
and an on-chip memory configured to store the fusion-artificial neural network computation data.

10. The method of claim 9,
a batch input unit configured to receive a plurality of input signals corresponding to the fusion-artificial neural network in a batch mode;
an on-chip memory configured to store the fusion-artificial neural network computation data in a batch mode; and
and an output unit configured to output a speculation result obtained by processing in a batch mode a speculation operation of at least one of Classification, Semantic segmentation, Object detection, and Pridection by the processing element array.

a control unit configured to receive machine code of the fusion-artificial neural network, an input unit configured to receive at least one input signal, an array of processing elements configured to perform a convolution operation, and an on-chip memory configured to store a result of the convolution operation at least one neural processing unit; and
configured to receive neural network data locality information of the fusion-artificial neural network capable of predicting successive memory operation requests of the at least one neural processing unit, and based on the neural network data locality information, the corresponding at least one a memory controller, comprising a memory configured to pre-caching the next memory operation request to be requested by a neural processing unit.

17. The method of claim 16,
wherein the neural processing unit is plural, and the machine code of the fusion-artificial neural network input to the control unit of each neural processing unit is configured to be processed in parallel in the plurality of neural processing units.

17. The method of claim 16,
The neural processing unit is a plurality,
and the memory controller is configured to directly control the parallel processing of the plurality of neural processing units.

17. The method of claim 16,
The neural processing unit is a plurality,
wherein the machine code is compiled for parallel processing in the plurality of neural processing units.

17. The method of claim 16,
Further comprising an infrared light source and a visible light source,
The input unit is configured to receive a visible light image and an infrared image,
wherein the machine code is machine code of a fusion-artificial neural network configured to fuse the visible light image and the infrared image.

21. The method of claim 20,
The infrared light source is configured to be PWM driven,
and the infrared image is synchronized with the infrared light source.

21. The method of claim 20,
The system, wherein the irradiation angle of the infrared light source and the irradiation angle of the visible light source are configured to partially overlap each other.