KR20200063958A

KR20200063958A - Convolutional operation device with dimension converstion

Info

Publication number: KR20200063958A
Application number: KR1020190054411A
Authority: KR
Inventors: 석정희; 여준기
Original assignee: 한국전자통신연구원
Priority date: 2018-11-28
Filing date: 2019-05-09
Publication date: 2020-06-05

Abstract

Provided is a convolutional operation device enabling dimensional conversion, capable of maximizing parallel operation efficiency by increasing a utilization rate of processing elements (PEs). According to the present invention, the convolutional operation device that performs convolutional neural network (CNN) processing includes: an input sharing network including first input feature map (IFM) registers and second IFM registers, which are arranged in rows and columns and configured to sequentially shift each IFM inputted in row units in a row direction or a column direction to output the shifted IFM; a first multiply-accumulator (MAC) array connected to the first IFM registers; an IFM switching network configured to select one of the first IFM registers and the second IFM registers; a second MAC array connected to one of the first IFM registers and the second IFM registers, which is selected by the IFM switching network; and an output shift network configured to shift an output feature map output from the first MAC array and the second MAC array to transmit the shifted output feature map to an output memory.

Description

CONVOLUTIONAL OPERATION DEVICE WITH DIMENSION CONVERSTION

본 발명은 신경망 시스템에 관한 것으로, 좀 더 구체적으로 영상 기반 심층 신경망(Deep Neural Network)에서 사용되는 컨볼루션 신경망(Convolutional Neural Network)의 컨볼루션 연산을 효율적으로 고속 병렬처리하기 위한 장치에 관한 것이다.The present invention relates to a neural network system, and more particularly, to an apparatus for efficiently and efficiently performing convolutional computation of a convolutional neural network used in an image-based deep neural network.

인간의 뇌를 모방한 인공 지능(Artificial Intelligence: 이하, AI) 반도체 설계 기술은 수십 년의 역사를 가지고 발전해 오고 있다. 다만, 인공 지능 반도체 설계 기술은 실리콘 기반 반도체의 연산량의 한계에 따라 정체되어 왔다. 입력 값에 대한 가중치(Weight)를 학습하는 과정을 통하여 뉴런(Neuron)의 신경 전달을 모델링한 신경망(Neural Network)은 반도체 기술의 한계로 인하여 각광받지 못하였다. 하지만, 최근에 지속적인 반도체 공정 미세화 및 고도화에 따라, 인공 지능 반도체 설계 기술 및 신경망 모델이 다시 각광받고 있다.Artificial intelligence (AI) semiconductor design technology that mimics the human brain has been developed with decades of history. However, artificial intelligence semiconductor design technology has been stagnant due to limitations in the amount of computation of silicon-based semiconductors. The neural network, which modeled neuronal transmission of neurons through the process of learning the weight of the input value, has not been spotlighted due to the limitations of semiconductor technology. However, with the continuous refinement and refinement of semiconductor processes, artificial intelligence semiconductor design technology and neural network models are in the spotlight again.

인공 지능 반도체는 대량의 입력 정보를 이용하여 특정 서비스에 최적화된 사고, 추론, 행동, 및 동작을 구현할 수 있다. 이러한 인공 지능 반도체 기술에 다층 신경 회로망(MLP; Multi-layer Perceptron) 및 신경망(Neural Network) 회로의 개념이 도입되면서, 인공 지능 기술의 응용 분야가 다양화 및 다변화되고 있다.Artificial intelligence semiconductors can use a large amount of input information to implement thinking, reasoning, behavior, and behavior optimized for a specific service. As the concept of a multi-layer perceptron (MLP) and a neural network (MLP) circuit is introduced into the artificial intelligence semiconductor technology, application fields of the artificial intelligence technology are diversifying and diversifying.

최근에는 다양한 딥러닝 기술들이 등장하고 있다. 딥러닝 기술 중 영상을 이용하는 분야는 컨볼루션 신경망(CNN)을 이용한다. 컨볼루션 신경망은 복수의 컨볼루션 계층과 맵의 사이즈를 줄이기 위한 풀링(Pooling) 계층으로 이루어진 심층 신경망(Deep Neural Network)의 하나이다. 각 컨볼루션 계층은 M 개의 입력 피처 맵(Input Feature Map: 이하, IFM)을 받아 N 개의 출력 피처 맵(Output Feature Map: 이하, OFM)을 생성한다. 이러한 컨볼루션 신경망의 계층 수는 수십에서 수백 개에 이르며 점점 더 늘어나는 추세이다. 이에 따라 컨볼루션 신경망을 고속으로 처리하기 위한 다양한 병렬처리 하드웨어 기술들이 개발되고 있는 실정이다. Recently, various deep learning technologies have appeared. The field of deep learning technology that uses images uses convolutional neural networks (CNNs). The convolutional neural network is one of deep neural networks composed of a plurality of convolutional layers and a pooling layer for reducing the size of the map. Each convolution layer receives M input feature maps (hereinafter referred to as IFM) and generates N output feature maps (hereinafter referred to as OFM). The number of layers of such convolutional neural networks is increasing from tens to hundreds. Accordingly, various parallel processing hardware technologies are being developed to process convolutional neural networks at high speed.

본 발명의 목적은 심층 컨볼루션 신경망을 이루는 각 컨볼루션 계층의 차원 특성을 고려하여 컨볼루션 연산의 차원 구조를 변경하는 방법을 통해 처리 요소(Processing Element: 이하, PE)들의 사용률을 높여 병렬 연산 효율을 극대화하는 장치 및 방법을 제공하는 데 있다.An object of the present invention is to increase the utilization rate of processing elements (hereinafter referred to as PEs) through a method of changing the dimensional structure of the convolution operation in consideration of the dimensional characteristics of each convolutional layer constituting the deep convolutional neural network, thereby increasing the efficiency of parallel operation. It is to provide a device and method to maximize the.

본 발명의 실시 예에 따른 컨볼루션 신경망 처리를 수행하는 컨볼루션 연산 장치는, 행 단위로 입력되는 입력 피처 맵(IFM)을 각각 행 방향 또는 열 방향으로 순차적으로 시프트시켜 출력하는, 행과 열로 배열되는 제 1 입력 피처 맵(IFM) 레지스터들 및 제 2 입력 피처 맵(IFM) 레지스터들을 포함하는 입력 공유 네트워크, 상기 제 1 입력 피처 맵(IFM) 레지스터들과 연결되는 제 1 MAC 어레이, 상기 제 1 입력 피처 맵(IFM) 레지스터들 및 상기 제 2 입력 피처 맵(IFM) 레지스터들 중 어느 하나를 선택하는 입력 피처 맵(IFM) 스위칭 네트워크, 상기 제 1 입력 피처 맵(IFM) 레지스터들과 상기 제 2 입력 피처 맵(IFM) 레지스터들 중 상기 입력 피처 맵(IFM) 스위칭 네트워크가 선택한 어느 하나와 연결되는 제 2 MAC 어레이, 그리고 상기 제 1 MAC 어레이 및 상기 제 2 MAC 어레이에서 출력되는 상기 출력 피처 맵을 시프트하여 출력 메모리로 전달하는 출력 시프트 네트워크를 포함한다. The convolutional computing apparatus for performing convolutional neural network processing according to an embodiment of the present invention sequentially outputs an input feature map (IFM) input in units of rows in row or column directions, respectively, and outputs An input sharing network including first input feature map (IFM) registers and second input feature map (IFM) registers, a first MAC array connected to the first input feature map (IFM) registers, the first An input feature map (IFM) switching network that selects one of input feature map (IFM) registers and the second input feature map (IFM) registers, the first input feature map (IFM) registers and the second A second MAC array connected to any one of the input feature map (IFM) registers selected by the input feature map (IFM) switching network, and the output feature map output from the first MAC array and the second MAC array. And an output shift network that shifts and delivers to the output memory.

본 발명의 실시 예에 따르면, 최근에 개발되고 있는 심층 컨볼루션 신경망들의 컨볼루션 계층은 다양한 형태를 가진다. 즉, 고정된 구조 및 데이터 패스를 가지는 병렬 컨볼루션 연산 장치는 이러한 다양한 형태의 컨볼루션 계층에 맞게 연산 구조를 변경할 수 없기 때문에 동작하지 않는 PE 들이 생기고 이는 연산 효율의 감소로 이어진다. 본 발명은 컨볼루션 계층의 특징에 적합하게 2차원 MAC 어레이의 차원을 변환할 수 있는 컨볼루션 장치 및 방법으로서 병렬 처리 효율을 높을 수 있다.According to an embodiment of the present invention, convolutional layers of deep convolutional neural networks that have been recently developed have various forms. That is, since the parallel convolution computing device having a fixed structure and a data path cannot change the computational structure for these various types of convolutional layers, PEs that do not operate are generated, which leads to a decrease in computational efficiency. The present invention is a convolution apparatus and method capable of transforming the dimensions of a two-dimensional MAC array to suit the features of the convolution layer, and thus can increase parallel processing efficiency.

도 1은 본 발명의 실시 예에 따른 심층 신경망에서 수행하는 컨볼루션 연산의 예를 간략히 보여주는 도면이다.
도 2는 도 1의 컨볼루션 연산을 고속으로 병렬처리하기 위한 신경망 시스템의 하드웨어 구조를 보여주는 블록도이다.
도 3은 본 발명의 예시적인 차원 변환 가능한 컨볼루션 연산 장치를 예시적으로 보여주는 블록도이다.
도 4는 본 발명의 다른 실시 예에 따른 컨볼루션 연산 장치의 구조를 보여주는 블록도이다.
도 5는 본 발명의 실시 예에 따른 컨볼루션 연산 장치의 다른 구조를 보여주는 블록도이다.
도 6은 본 발명의 실시 예에 따른 컨볼루션 연산 장치의 또 다른 구조를 보여주는 블록도이다.
도 7은 본 발명의 실시 예에 따른 컨볼루션 연산 장치의 또 다른 구조를 보여주는 블록도이다.1 is a diagram briefly showing an example of a convolution operation performed in a deep neural network according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a hardware structure of a neural network system for parallel processing of the convolution operation of FIG. 1 at high speed.
3 is a block diagram illustrating an exemplary dimension-convertable convolution operation device of the present invention.
4 is a block diagram showing the structure of a convolutional computing device according to another embodiment of the present invention.
5 is a block diagram showing another structure of a convolution computing device according to an embodiment of the present invention.
6 is a block diagram showing another structure of a convolution computing device according to an embodiment of the present invention.
7 is a block diagram showing another structure of a convolutional computing device according to an embodiment of the present invention.

아래에서는, 본 발명의 기술 분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있을 정도로, 본 발명의 실시 예들이 명확하고 상세하게 기재된다.In the following, embodiments of the present invention are described clearly and in detail so that those skilled in the art of the present invention can easily implement the present invention.

도 1은 본 발명의 실시 예에 따른 심층 신경망에서 수행하는 컨볼루션 연산의 예를 간략히 보여주는 도면이다. 도 1을 참조하면, 입력 피처 맵들(111, 112, 113)은 컨볼루션 연산에 의해서 출력 피처 맵들(131, 132, 133, 134)로 생성된다.1 is a diagram briefly showing an example of a convolution operation performed in a deep neural network according to an embodiment of the present invention. Referring to FIG. 1, input feature maps 111, 112, and 113 are generated as output feature maps 131, 132, 133, and 134 by a convolution operation.

입력 피처 맵(111)은 가중치 커널들(121, 122, 123)과의 컨볼루션 연산들에 의해서 어레이 형태의 피처 맵(131)으로 변환된다. 즉, 입력 피처 맵(111)과 가중치 커널들(121, 122, 123)과의 중첩되는 위치의 컨볼루션 처리를 통해서 생성된 요소들의 값들을 가산하면 출력 피처 맵(131)의 한 포인트(141)에 대응하는 피처값이 생성된다. 가중치 커널들(121, 122, 123)을 입력 피처 맵(111)의 모든 위치에 대해서 시프트(Shift)하면서 컨볼루션 연산을 수행하면, 하나의 출력 피처 맵(131)이 생성될 수 있다. 이러한 컨볼루션 연산을 X×Y 사이즈의 M 개의 입력 피처 맵들(111, 112, 113)에 적용하면, C×R 사이즈의 N 개의 출력 피처 맵들(154)이 생성될 수 있을 것이다. The input feature map 111 is converted into an array-shaped feature map 131 by convolution operations with the weight kernels 121, 122, and 123. That is, when the values of elements generated through the convolution processing of the overlapping position of the input feature map 111 and the weight kernels 121, 122, and 123 are added, one point 141 of the output feature map 131 Feature values corresponding to are generated. When the convolution operation is performed while shifting the weight kernels 121, 122, and 123 to all positions of the input feature map 111, one output feature map 131 may be generated. If this convolution operation is applied to M input feature maps 111, 112, and 113 of X×Y size, N output feature maps 154 of C×R size may be generated.

상술한 컨볼루션 연산은 하나의 컨볼루션 계층에 대응한다. 심층 컨볼루션 신경망에서는 컨볼루션 계층이 수개에서 수백 개로 이루어진 신경망이다. 입력 영상으로부터 첫 번째 컨볼루션 계층을 연산하여 출력 피처 맵(OFM)을 생성하고, 이렇게 생성된 출력 피처 맵(OFM)은 다시 다음 컨볼루션 계층의 입력 피처 맵(IFM)이 된다. 컨볼루션 연산은 이러한 과정을 모든 계층이 끝날 때까지 반복되는 단계를 갖는다. The above-described convolution operation corresponds to one convolution layer. In a deep convolutional neural network, a convolutional layer is a network of several hundreds to hundreds. An output feature map (OFM) is generated by calculating the first convolution layer from the input image, and the generated output feature map (OFM) becomes an input feature map (IFM) of the next convolution layer. The convolution operation repeats this process until all layers are completed.

도 2는 도 1의 컨볼루션 연산을 고속으로 병렬처리하기 위한 신경망 시스템의 하드웨어 구조를 보여주는 블록도이다. 도 2를 참조하면, 신경망 시스템(200)은 예시적으로 입력 메모리(230)로부터 제공되는 입력 데이터를 처리하는 처리 요소 어레이(210, Processing Element: 이하, PE)를 포함한다. 처리 요소 어레이(210)에 의해서 입력 데이터는 가중치 메모리(220)로부터 제공되는 가중치 커널과 컨볼루션 연산 처리되어 출력 메모리(240)로 전달된다.FIG. 2 is a block diagram showing a hardware structure of a neural network system for parallel processing of the convolution operation of FIG. 1 at high speed. Referring to FIG. 2, the neural network system 200 illustratively includes an array of processing elements (hereinafter referred to as PEs) 210 that process input data provided from the input memory 230. The input data is processed by the processing element array 210 and processed by the weight kernel provided from the weight memory 220 and convolutionally transferred to the output memory 240.

처리 요소 어레이(210)는 2차원의 시스톨릭(Systolic) 어레이 구조로 구성될 수 있으며, 도면에서는 예시적으로 4×4 사이즈의 처리 요소들(PE)을 포함하는 것으로 가정한다. 처리 요소들(PE_11~PE_44) 각각은 입력 메모리(230) 중에서 어느 하나로부터의 입력 피처와, 가중치 메모리(220) 중에서 어느 하나로부터 제공되는 커널 가중치를 사용하여 컨볼루션 연산을 수행할 수 있다. The processing element array 210 may be configured as a two-dimensional systolic array structure, and in the drawing, it is assumed that the processing elements PE are 4×4 in size. Each of the processing elements PE_11 to PE_44 may perform a convolution operation using an input feature from any one of the input memories 230 and kernel weights provided from any one of the weight memories 220.

가중치 메모리(220)에는 처리 요소 어레이(210)에서 수행되는 합성곱 연산, 바이어스(Bias) 가산, 활성화(Relu), 풀링(Pooling) 등에 필요한 파라미터들을 제공한다. 특히, 가중치 메모리(220)는 컨볼루션 연산에 필요한 커널 가중치들(weight1~weight4)을 처리 요소 어레이(210)의 열 단위로 제공할 수 있다. 가중치 메모리(220)는 처리 요소 어레이(210)의 열 단위에 대응하는 복수의 가중치 메모리들(222, 224, 226, 228)을 포함할 수 있다.The weight memory 220 provides parameters necessary for a convolution operation performed in the processing element array 210, bias addition, activation, pooling, and the like. In particular, the weight memory 220 may provide kernel weights (weight1 to weight4) required for the convolution operation in units of columns of the processing element array 210. The weight memory 220 may include a plurality of weight memories 222, 224, 226, and 228 corresponding to a column unit of the processing element array 210.

입력 메모리(230)로부터 처리 요소 어레이(210)에 제공되는 입력 데이터는, 예를 들면, 이미지 데이터일 수 있다. 입력 메모리(230)에는 입력 데이터의 데이터 값들이 로드된다. 입력 메모리(230)의 사이즈는 합성곱(또는, 컨볼루션) 연산을 위한 커널(Kernel)의 사이즈에 따라 가변될 수 있을 것이다. 예를 들면, 커널의 사이즈가 K×K인 경우, 입력 메모리(230)에는 처리 요소 어레이(210)에 의한 커널과의 합성곱 연산을 순차적으로 수행하기 위한 충분한 크기의 입력 데이터가 로드되어야 할 것이다. 입력 메모리(230)는 처리 요소 어레이(210)의 각 행들에 입력 피처를 제공하는 복수의 입력 메모리들(232, 234, 236, 238)을 포함할 수 있다.The input data provided from the input memory 230 to the processing element array 210 may be, for example, image data. Data values of input data are loaded in the input memory 230. The size of the input memory 230 may be changed according to the size of the kernel for the convolution (or convolution) operation. For example, when the size of the kernel is K×K, the input memory 230 should be loaded with input data of a size large enough to sequentially perform a convolution operation with the kernel by the processing element array 210. . The input memory 230 may include a plurality of input memories 232, 234, 236, 238 that provide input features to each row of the processing element array 210.

출력 메모리(240)는 처리 요소 어레이(210)에 의해서 실행되는 컨볼루션 연산이나 풀링의 결과값이 로드된다. 출력 메모리(240)에 로드된 결과값은 복수의 커널들에 의한 각 컨볼루션 루프의 실행 결과에 따라 업데이트된다. 이후, 출력 메모리(240)에 로드된 값은 다시 입력 메모리(230)로 반환되어, 또 다른 계층의 컨볼루션 연산의 입력값으로 사용될 수 있다. 출력 메모리(240)는 처리 요소 어레이(210)의 각 행들에 출력 피처를 수신하는 복수의 출력 메모리들(242, 244, 246, 248)을 포함할 수 있다.The output memory 240 is loaded with a result of a convolution operation or pooling executed by the processing element array 210. The result value loaded in the output memory 240 is updated according to the execution result of each convolution loop by a plurality of kernels. Thereafter, the value loaded in the output memory 240 is returned to the input memory 230 again, and may be used as an input value of another layer of convolution operation. The output memory 240 can include a plurality of output memories 242, 244, 246, 248 that receive output features in each row of the processing element array 210.

신경망 시스템(200)에서의 컨볼루션 연산의 과정을 사이클에 맞춰 연속적으로 수행되어야 한다. 이러한 구조에서는 입력되는 입력 피처 맵(IFM)과 커널 가중치(Weight) 및 합성곱 출력인 부분합(Partial Sum)을 모든 사이클마다 메모리들로 전달해야 한다. 따라서, 신경망 시스템(200)에서의 컨볼루션 연산은 제어가 어렵다. 더불어, 2차원 어레이 형태로 구성되는 처리 요소 어레이(210)를 각 컨볼루션 계층의 차원 특성(도 1의 M, Y, X, N, R, C)에 맞춰 효율적인 프로세싱 구조로 변경해야 한다. The process of the convolution operation in the neural network system 200 must be continuously performed according to cycles. In such a structure, the input feature map (IFM), the kernel weight (Weight), and the partial sum (composite output) must be transferred to memories in every cycle. Therefore, the convolution operation in the neural network system 200 is difficult to control. In addition, the processing element array 210 configured in the form of a two-dimensional array must be changed to an efficient processing structure in accordance with the dimensional characteristics (M, Y, X, N, R, C of FIG. 1) of each convolution layer.

도 3은 본 발명의 예시적인 차원 변환 가능한 컨볼루션 연산 장치를 예시적으로 보여주는 블록도이다. 도 3을 참조하면, 컨볼루션 연산 장치(300)는 2차원의 MAC(Multiply-Accumulator) 어레이들(310, 312), IFM 스위칭 네트워크(315), 가중치 메모리들(322, 324), 가중치 스위칭 네트워크(326), 입력 피처 맵 버퍼들(331, 333, 335, 337), 입력 공유 네트워크(332), 출력 시프트 네트워크(342), 출력 피처 맵 버퍼(344), 그리고 컨트롤러(350)를 포함할 수 있다. 3 is a block diagram illustrating an exemplary dimension-convertable convolution operation device of the present invention. Referring to FIG. 3, the convolution computing device 300 includes two-dimensional multi-ply-accumulator (MAC) arrays 310 and 312, an IFM switching network 315, weight memories 322 and 324, and a weight switching network. 326, input feature map buffers 331, 333, 335, 337, input sharing network 332, output shift network 342, output feature map buffer 344, and controller 350. have.

MAC 어레이들(310, 312)은 각각 앞서 설명한 처리 요소 어레이(210)와 같이 2차원으로 배열되는 복수의 MAC 코어들을 포함할 것이다. 여기서, 2개의 MAC 어레이들(310, 312)이 예시적으로 도시되었지만, MAC 어레이의 수는 출력 피처 맵의 차원 수(도 1의 N)로 확장될 수 있다. MAC 어레이들(310, 312) 각각은 3×3 형태의 MAC 코어들을 포함하는 구조를 예로 들어서 설명하기로 한다.The MAC arrays 310 and 312 will each include a plurality of MAC cores arranged in two dimensions, such as the processing element array 210 described above. Here, although the two MAC arrays 310 and 312 are illustratively shown, the number of MAC arrays can be extended to the number of dimensions of the output feature map (N in FIG. 1). Each of the MAC arrays 310 and 312 will be described by taking an example of a structure including 3×3 types of MAC cores.

제 1 MAC 어레이(310)는 복수의 MAC 코어들{MAC(i, j), i 및 j=1, 2, 3}을 포함한다. 제 1 MAC 어레이(310)는 가중치 스위칭 네트워크(326)로부터 제공되는 제 1 가중치(W1)와 입력 공유 네트워크(332)의 IFM 레지스터들{IFM(i, j), i 및 j=1, 2, 3}로부터 제공되는 입력값을 사용하여 컨볼루션 연산을 수행한다. 제 2 MAC 어레이(312)는 복수의 MAC 코어들{MAC(i, j), i 및 j=1, 2, 3}과 IFM 스위칭 네트워크(315)를 포함할 수 있다. 제 2 MAC 어레이(312)는 가중치 스위칭 네트워크(326)로부터 제공되는 제 2 가중치(W2)와 IFM 스위칭 네트워크(315)에서 선택된 입력값을 사용하여 컨볼루션 연산을 수행한다. The first MAC array 310 includes a plurality of MAC cores {MAC(i, j), i and j=1, 2, 3}. The first MAC array 310 includes the first weight W1 provided from the weight switching network 326 and the IFM registers {IFM(i, j), i and j=1, 2, of the input sharing network 332. Convolution operation is performed using the input value provided from 3}. The second MAC array 312 may include a plurality of MAC cores {MAC(i, j), i and j=1, 2, 3} and an IFM switching network 315. The second MAC array 312 performs a convolution operation using the second weight W2 provided from the weight switching network 326 and the input value selected from the IFM switching network 315.

IFM 스위칭 네트워크(315)는 맵핑 조건에 따라 IFM 레지스터들{IFM(i, j), i 및 j=1, 2, 3} 또는 IFM 레지스터들{IFM(i, j), i=(1, 2, 3), j=(4, 5, 6)}로부터 제공되는 입력 피처를 제 2 MAC 어레이(312)에 제공할 수 있다. The IFM switching network 315 may use IFM registers {IFM(i, j), i and j=1, 2, 3} or IFM registers {IFM(i, j), i=(1, 2) depending on the mapping condition. , 3), j=(4, 5, 6)} can be provided to the second MAC array 312.

가중치 메모리들(322, 324) 및 가중치 스위칭 네트워크(326)는 컨볼루션 연산을 위한 가중치 커널을 입력받아 MAC 어레이들(310, 312)에 전달한다. 제 1 가중치 메모리(322)를 통해서 제 1 가중치(W1)를 제공받고, 제 2 가중치 메모리(324)를 통해서 제 2 가중치(W2)를 제공받을 수 있다. 가중치 스위칭 네트워크(326)는 제 1 가중치(W1) 및 제 2 가중치(W2)를 각각 제 1 MAC 어레이(310)와 제 2 MAC 어레이(312)에 전달한다.The weight memories 322 and 324 and the weight switching network 326 receive the weight kernel for the convolution operation and deliver it to the MAC arrays 310 and 312. The first weight W1 may be provided through the first weight memory 322, and the second weight W2 may be provided through the second weight memory 324. The weight switching network 326 delivers the first weight W1 and the second weight W2 to the first MAC array 310 and the second MAC array 312, respectively.

입력 공유 네트워크(332)는 좌, 우, 상 방향으로 데이터를 전송할 수 있는 2차원의 IFM 레지스터들{IFM(i, j), i 및 j=1, 2, 3}의 네트워크로 구성된다. 입력 공유 네트워크(332)의 행과 열의 수는 2차원의 MAC 어레이들(310, 312)의 차원 수와 사이즈, 가중치 커널의 사이즈(K)에 의해 결정될 수 있다. 가중치 커널의 사이즈가 K일 때, 경계 레지스터(334)는 (K-1) 개의 행과 열을 추가로 요구한다. The input sharing network 332 is composed of a network of two-dimensional IFM registers {IFM(i, j), i and j=1, 2, 3} capable of transmitting data in the left, right, and upward directions. The number of rows and columns of the input sharing network 332 may be determined by the number and size of the 2D MAC arrays 310 and 312 and the size (K) of the weighted kernel. When the size of the weight kernel is K, the boundary register 334 additionally requests (K-1) rows and columns.

입력 공유 네트워크(332)의 행의 수는 MAC 어레이들(310, 312)의 사이즈와 경계 레지스터(334)의 행의 수를 합한 값이다. 즉, 가중치 커널의 사이즈(K)가 3일 경우, 입력 공유 네트워크(332)의 행의 개수는 (3+2)로 5가 된다. 입력 공유 네트워크(332)의 열의 수는 {MAC 어레이들(310, 312)의 차원 수}×{MAC 어레이들(310, 312)의 사이즈}＋{경계 레지스터(334)의 열의 수}에 대응한다. 만일, 가중치 커널의 사이즈(K)가 3이고 2차원으로 배열되는 MAC 어레이들(310, 312)의 차원 수가 도시된 바와 같이 2인 경우, 입력 공유 네트워크(332)의 열의 수는 3*2+2로 8이 될 것이다. The number of rows of the input sharing network 332 is the sum of the size of the MAC arrays 310 and 312 and the number of rows of the boundary register 334. That is, when the size K of the weighted kernel is 3, the number of rows of the input sharing network 332 is (3+2), which is 5. The number of columns of the input sharing network 332 corresponds to {the number of dimensions of the MAC arrays 310 and 312} x {the size of the MAC arrays 310 and 312} + {the number of columns of the boundary register 334} . If the size K of the weighted kernel is 3 and the number of dimensions of the MAC arrays 310 and 312 arranged in two dimensions is 2, as shown, the number of columns of the input sharing network 332 is 3*2+ It will be 2 to 8.

상술한 2차원의 MAC 어레이들(310, 312)의 MAC들 각각은 출력 레지스터를 별도로 구비하고 있다. MAC들 각각은 부분합을 출력 레지스터에 저장하고, 컨볼루션 연산의 반복되는 동안 부분합을 출력 레지스터에 누적한다. MAC들 각각의 출력 레지스터는 출력 시프트 네트워크(342)의 레지스터와 일대일 맵핑된다. MAC 어레이들(310, 312)의 MAC들 각각의 출력 레지스터에 최종 출력 피처 맵(OFM)의 출력이 저장된 후 다시 후속되는 컨볼루션 연산을 시작한다. 이렇게 별도의 출력 시프트 네트워크(342)를 사용하는 이유는 컨볼루션 연산 수행과 동시에 최종 생성된 출력 피처 맵(OFM)을 읽어 오기 위함이다. Each of the MACs of the two-dimensional MAC arrays 310 and 312 described above has a separate output register. Each of the MACs stores a subtotal in the output register, and accumulates the subtotal during the iteration of the convolution operation. The output register of each of the MACs is mapped one-to-one with the registers of the output shift network 342. After the output of the final output feature map (OFM) is stored in the output register of each of the MACs of the MAC arrays 310 and 312, a subsequent convolution operation is started again. The reason for using the separate output shift network 342 is to read the final generated output feature map (OFM) at the same time as performing the convolution operation.

출력 시프트 네트워크(342)의 레지스터들{OUT(i, j), i=(1, 2, 3), j=(1, 2, 3)}은 도시된 바와 같이 지그재그 체인으로 연결되어 있고, 순차적 혹은 홉핑 방식으로 저장된 출력 피처 맵(OFM)을 출력 피처 맵 버퍼(344)로 전송한다. 출력 시프트 네트워크(342)의 레지스터들의 배열은 MAC 어레이들(310, 312)의 차원 및 사이즈에 의존할 수 있다.The registers of the output shift network 342 {OUT(i, j), i=(1, 2, 3), j=(1, 2, 3)} are connected in a zigzag chain as shown, and sequentially Alternatively, the output feature map (OFM) stored in the hopping method is transmitted to the output feature map buffer 344. The arrangement of registers in the output shift network 342 may depend on the dimension and size of the MAC arrays 310, 312.

상술한 입력 공유 네트워크(332)의 IFM 레지스터들{IFM(i, j), i=(1, 2, 3), j=(1, 2, 3, 4, 5, 6)}의 출력은 MAC 어레이(310, 312) 각각의 입력으로 맵핑된다. 맵핑하는 방법은 2차원의 MAC 어레이들(310, 312)의 차원을 어떻게 하느냐에 따라 달라진다. 도시된 바와 같이 MAC 어레이들(310, 312)을 2개로 구성한 경우에는 2가지의 맵핑 방법을 사용할 수 있다. The output of the IFM registers {IFM(i, j), i=(1, 2, 3), j=(1, 2, 3, 4, 5, 6)} of the input sharing network 332 described above is MAC Each of the arrays 310 and 312 is mapped to an input. The mapping method depends on how to dimension the two-dimensional MAC arrays 310 and 312. As illustrated, when the MAC arrays 310 and 312 are configured in two, two mapping methods may be used.

첫 번째 맵핑 방법은, 2개의 MAC 어레이들(310, 312) 각각을 출력 피처 맵(OFM)의 N 차원 중 (N-2) 차원과 (N-1) 차원 각각을 독립적으로 연산하도록 맵핑하는 방법이다. 제 1 MAC 어레이(310)의 각 MAC들의 출력은 출력 피처 맵(OFM)의 (N-2) 차원의 픽셀에 해당한다. 제 2 MAC 어레이(312)의 각 MAC들의 출력은 출력 피처 맵(OFM)의 (N-1) 차원의 픽셀에 해당한다. 이 경우 제 1 MAC 어레이(310)의 각 MAC들에 입력되는 (N-2) 차원의 가중치 커널값은 모두 동일한 값(W1)으로 공유된다. 마찬가지로, 제 2 MAC 어레이(312)의 각 MAC들에 입력되는 (N-1) 차원의 가중치 커널값도 모두 동일한 값(W2)으로 공유된다.The first mapping method is a method of mapping each of the two MAC arrays 310 and 312 to independently calculate each of the (N-2) and (N-1) dimensions among the N dimensions of the output feature map (OFM). to be. The output of each MAC of the first MAC array 310 corresponds to (N-2)-dimensional pixels of the output feature map (OFM). The output of each MAC of the second MAC array 312 corresponds to a (N-1)-dimensional pixel of the output feature map (OFM). In this case, the weighted kernel values of the (N-2) dimension input to each MAC of the first MAC array 310 are all shared with the same value W1. Likewise, the weighted kernel values of the (N-1) dimension input to each MAC of the second MAC array 312 are all shared with the same value W2.

더불어, 제 1 MAC 어레이(310)의 각 MAC들의 IFM 입력 포트는 입력 공유 네트워크(332)에서 동일한 좌표에 있는 레지스터 출력과 일대일로 맵핑된다. 마찬가지로, 제 2 MAC 어레이(312)의 각 MAC들의 IFM 입력 포트는 입력 공유 네트워크(332)에서 동일한 좌표에 있는 레지스터 출력과 일대일로 맵핑된다. 즉, 제 1 MAC 어레이(310)의 각 MAC들의 IFM 입력 포트와 제 2 MAC 어레이(312)의 각 MAC들의 IFM 입력 포트는 입력 공유 네트워크(332)의 동일한 좌표에 배열된 레지스터 출력에 연결된다.In addition, the IFM input port of each MAC of the first MAC array 310 is mapped one-to-one with the register output at the same coordinate in the input sharing network 332. Likewise, the IFM input port of each MAC of the second MAC array 312 is mapped one-to-one with the register output at the same coordinate in the input sharing network 332. That is, the IFM input port of each MAC of the first MAC array 310 and the IFM input port of each MAC of the second MAC array 312 are connected to register outputs arranged at the same coordinates of the input sharing network 332.

두 번째 맵핑 방법은, 2개의 MAC 어레이들(310, 312)을 같은 차원으로 통합 시켜 출력 피처 맵(OFM)의 N 차원 중 하나인 (N-2) 차원에 동일하게 맵핑하는 방법이다. 제 1 MAC 어레이(310)의 각 MAC들의 출력과 제 2 MAC 어레이(312)의 각 MAC들의 출력은 출력 피처 맵(OFM)의 (N-2)차원의 픽셀에 해당한다. 이 경우, 제 1 MAC 어레이(310)의 각 MAC들과 제 2 MAC 어레이(312)의 각 MAC들에 입력되는 출력 피처 맵(OFM)의 (N-2) 차원의 가중치 커널 값은 모두 동일하게 공유된다. 그리고, 제 1 MAC 어레이(310)의 각 MAC들의 IFM 입력 포트는 입력 공유 네트워크(332)에서 동일한 좌표에 있는 레지스터 출력과 일대일로 맵핑된다. 그러나 제 2 MAC 어레이(312)의 각 MAC들의 IFM 입력 포트는 입력 공유 네트워크(332)에서 최대 어레이 사이즈(3)만큼 시프트한 좌표에 대응하는 레지스터 출력과 일대일로 맵핑된다. 이와 같이, 제 2 MAC 어레이(312)의 IFM 입력 포트를 시프트된 위치의 입력 공유 네트워크(332)의 좌표에 맵핑하기 위해 IFM 스위칭 네트워크(315)가 사용될 수 있다. The second mapping method is a method of integrating two MAC arrays 310 and 312 into the same dimension and mapping the same to the (N-2) dimension, which is one of the N dimensions of the output feature map (OFM). The output of each MAC of the first MAC array 310 and the output of each MAC of the second MAC array 312 correspond to (N-2)-dimensional pixels of the output feature map OFM. In this case, the weighted kernel values in the (N-2) dimension of the output feature map (OFM) input to each MAC of the first MAC array 310 and each MAC of the second MAC array 312 are the same. Is shared. And, the IFM input port of each MAC of the first MAC array 310 is mapped one-to-one with the register output at the same coordinate in the input sharing network 332. However, the IFM input port of each MAC of the second MAC array 312 is mapped one-to-one with the register output corresponding to the coordinate shifted by the maximum array size 3 in the input sharing network 332. As such, the IFM switching network 315 can be used to map the IFM input port of the second MAC array 312 to the coordinates of the input sharing network 332 at the shifted position.

이상의 설명에 따르면, 2차원으로 배열되는 MAC 어레이들(310, 312)의 차원 구성에 따라, 각 MAC의 IFM 입력 포트와 입력 공유 네트워크(332)의 레지스터 출력 포트와의 맵핑 방법이 제공된다. 더불어, 2차원으로 배열되는 MAC 어레이들(310, 312)의 차원 구성에 따른 각 MAC과 가중치 커널의 입력 포트 사이의 맵핑 방법이 제공된다.According to the above description, according to the dimensional configuration of the MAC arrays 310 and 312 arranged in two dimensions, a mapping method is provided between the IFM input port of each MAC and the register output port of the input sharing network 332. In addition, a mapping method is provided between each MAC and the input port of the weighted kernel according to the dimensional configuration of the MAC arrays 310 and 312 arranged in two dimensions.

IFM 입력 경로 및 가중치 커널의 입력 경로가 정해지면, 입력 피처 맵(IFM)은 컨볼루션 연산을 위해 입력 공유 네트워크(332)의 행별로 분리되어 입력되어야 한다. 도시된 실시 예의 경우, 행의 개수가 5이므로 입력 피처 맵(IFM)은 5개로 분리되어 입력될 것이다. 5개로 분할되어 입력된 입력 피처 맵(IFM)은, 입력 공유 네트워크(332)를 통해 행별로 동일하게 시프트 방향(좌→좌→상→우→우→상→좌→좌)으로 입력 데이터가 전달될 것이다. 입력 공유 네트워크(332)의 각 레지스터들이 시프트 방향(좌→좌→상→우→우→상→좌→좌)으로 분할된 입력 피처 맵(IFM)을 인접 레지스터로 전달하면서 3×3 사이즈의 가중치 커널을 사용한 컨볼루션 연산에 필요한 입력 피처 맵(IFM)의 배열을 형성한다.When the input path of the IFM input path and the weight kernel is determined, the input feature map (IFM) must be input separately for each row of the input sharing network 332 for convolution operation. In the illustrated embodiment, since the number of rows is 5, the input feature map (IFM) will be divided into 5 and input. The input feature map (IFM) divided into 5 and input is transmitted in the same shift direction (left→left→up→right→right→upward→left→left) row by row through the input sharing network 332. Will be. Each register of the input sharing network 332 transfers an input feature map (IFM) divided into shift directions (left→left→up→right→right→left→left) to adjacent registers, and has a weight of 3×3 An array of input feature maps (IFMs) required for convolution operation using the kernel is formed.

입력 피처 맵(IFM)이 전달되는 사이클에 맞춰, 가중치 커널의 값도 MAC 어레이들(310, 312)에 순차적으로 입력될 것이다. 이러한 방법으로 2개의 MAC 어레이들(310, 312)에 있는 각 MAC 들은 출력 피처 맵(OFM)의 한 픽셀에 해당하는 부분합(Partial Sum)을 생성한다. 이와 같이, 도 1의 X×Y×M 사이즈의 입력 피처 맵들 각각이 IFM 스위칭 네트워크(315)에서 스트라이딩(Striding)되면서 MAC 어레이들(310, 312)로부터 최종 컨볼루션 연산의 결과인 출력 피처 맵(OFM)을 생성하게 된다.In accordance with the cycle in which the input feature map (IFM) is transmitted, the value of the weight kernel will also be sequentially input to the MAC arrays 310 and 312. In this way, each MAC in the two MAC arrays 310 and 312 generates a partial sum corresponding to one pixel of the output feature map (OFM). As such, each of the input feature maps of the size X×Y×M of FIG. 1 is striding in the IFM switching network 315, and the output feature map is the result of the final convolution operation from the MAC arrays 310 and 312. (OFM).

2차원으로 배열되는 MAC 어레이들(310, 312)의 MAC들 각각은 출력 레지스터를 별도로 구비하고 있다. 따라서, MAC들 각각은 구비된 출력 레지스터에 부분합을 저장하고, 각각의 컨볼루션 계층의 연산이 반복되는 동안 부분합을 누적하게 된다. MAC들 각각의 출력 레지스터는 출력 시프트 네트워크(342)의 레지스터와 일대일로 맵핑된다. MAC 어레이들(310, 312)의 MAC들 각각에서 출력되는 최종 출력 피처 맵(OFM)의 값은 출력 레지스터에 저장되며, 이후 다음 컨볼루션 계층의 연산을 위해서 사용될 수 있다. 이렇게 MAC들 각각이 별도의 출력 레지스터를 구비하는 이유는, 컨볼루션 연산 수행과 동시에 최종 생성된 출력 피처 맵(OFM)을 읽어 오기 위함이다. 출력 시프트 네트워크(342)의 레지스터들{OUT(x, y), x=(1, 2, 3), y=(1, 2, 3)}은 도시된 구조와 같이 지그재그 체인으로 연결된다. 그리고 레지스터들{OUT(x, y)}은 순차적 또는 홉핑 방식으로 저장된 출력 피처 맵(OFM)을 OFM 버퍼(344)로 전송한다.Each of the MACs of the two-dimensionally arranged MAC arrays 310 and 312 has an output register separately. Accordingly, each of the MACs stores a subtotal in the provided output register, and accumulates the subtotals while the operations of each convolution layer are repeated. The output register of each of the MACs is mapped one-to-one with the registers of the output shift network 342. The value of the final output feature map (OFM) output from each of the MACs of the MAC arrays 310 and 312 is stored in the output register, and can then be used for calculation of the next convolutional layer. The reason each MAC has a separate output register is to read a final generated output feature map (OFM) at the same time as performing the convolution operation. The registers (OUT(x, y), x=(1, 2, 3), y=(1, 2, 3)) of the output shift network 342 are connected in a zigzag chain as shown in the structure shown. Then, the registers {OUT(x, y)} transmit the output feature map OFM stored in a sequential or hopping manner to the OFM buffer 344.

컨트롤러(350)는 컨볼루션 연산 장치(300)에 포함되는 제반 구성들의 동작을 제어할 수 있다. 예컨대, 컨트롤러(350)는 상술한 입력 피처 맵(IFM)과 출력 피처 맵(OFM)의 경로를 설정된 시퀀스에 따라 제어할 수 있다. The controller 350 may control the operation of various components included in the convolution computing device 300. For example, the controller 350 may control the paths of the input feature map IFM and the output feature map OFM according to a set sequence.

이상에서는 본 발명의 컨볼루션 연산 장치(300)의 구성 및 기능이 예시적으로 설명되었다. 본 발명의 컨볼루션 연산 장치(300)를 통해서 각 컨볼루션 계층의 차원 특성에 최적화된 차원 설정을 통해서 MAC들의 사용률을 높일 수 있다. In the above, the configuration and function of the convolution computing device 300 of the present invention has been described by way of example. Through the convolution computing device 300 of the present invention, it is possible to increase the utilization rate of MACs through dimensional setting optimized for dimensional characteristics of each convolution layer.

도 4는 본 발명의 다른 실시 예에 따른 컨볼루션 연산 장치의 구조를 보여주는 블록도이다. 도 4를 참조하면, 컨볼루션 연산 장치(400)는 N 개의 MAC 어레이들(412, 414, 416, 418), IFM 스위칭 네트워크(413, 415, 417), 가중치 메모리들(422, 424, 426, 428), 가중치 스위칭 네트워크(423), IFM 버퍼들(431, 433, 435, 437, 439), 입력 공유 네트워크(432), 출력 시프트 네트워크(442), 출력 피처 맵 버퍼(444), 그리고 컨트롤러(450)를 포함할 수 있다. 4 is a block diagram showing the structure of a convolutional computing device according to another embodiment of the present invention. 4, the convolution computing device 400 includes N MAC arrays 412, 414, 416, 418, IFM switching network 413, 415, 417, and weight memories 422, 424, 426, 428), weighted switching network 423, IFM buffers 431, 433, 435, 437, 439, input sharing network 432, output shift network 442, output feature map buffer 444, and controller ( 450).

N 개의 MAC 어레이들(412, 414, 416, 418)은 각각 2차원으로 배열되는 복수의 MAC 코어들을 포함할 것이다. 여기서, N 개의 MAC 어레이들(412, 414, 416, 418) 각각은 출력 피처 맵(OFM)의 차원 수(N)에 대응한다. 즉, MAC 어레이들(412, 414, 416, 418) 각각은 각각 하나의 출력 피처 맵(OFM)에 대응할 수 있다. MAC 어레이들(412, 414, 416, 418) 각각은 3×3 형태의 MAC 코어들을 포함하는 구조로 도시되었으나, 본 발명은 3×3 형태에만 국한되지는 않는다. The N MAC arrays 412, 414, 416, and 418 will each include a plurality of MAC cores arranged in two dimensions. Here, each of the N MAC arrays 412, 414, 416, and 418 corresponds to the number N of dimensions of the output feature map OFM. That is, each of the MAC arrays 412, 414, 416, and 418 may correspond to one output feature map (OFM), respectively. Each of the MAC arrays 412, 414, 416, and 418 is shown as a structure including 3×3 type MAC cores, but the present invention is not limited to the 3×3 type.

제 1 MAC 어레이(412)는 3×3 배열로 제공되는 복수의 MAC 코어들을 포함할 수 있다. 제 1 MAC 어레이(412)는 가중치 스위칭 네트워크(423)로부터 제공되는 제 1 가중치(W1)와 입력 공유 네트워크(432)의 IFM 레지스터들(432a)로부터 제공되는 입력값을 사용하여 컨볼루션 연산을 수행한다. The first MAC array 412 may include a plurality of MAC cores provided in a 3x3 arrangement. The first MAC array 412 performs a convolution operation using the first weight W1 provided from the weight switching network 423 and the input value provided from the IFM registers 432a of the input sharing network 432. do.

제 2 MAC 어레이(414)는 3×3 배열로 제공되는 복수의 MAC 코어들을 포함할 것이다. 제 2 MAC 어레이(414)는 IFM 스위칭 네트워크(413)를 통해서 IFM 레지스터들(432a)과 IFM 레지스터들(432b) 중 어느 하나와 일대일로 맵핑될 수 있다. 제 2 MAC 어레이(414)는 가중치 스위칭 네트워크(423)로부터 제공되는 제 2 가중치(W2)와 IFM 스위칭 네트워크(413)에서 선택된 입력값을 사용하여 컨볼루션 연산을 수행할 수 있다.The second MAC array 414 will include a plurality of MAC cores provided in a 3x3 arrangement. The second MAC array 414 may be mapped one-to-one with any of the IFM registers 432a and IFM registers 432b through the IFM switching network 413. The second MAC array 414 may perform a convolution operation using the second weight W2 provided from the weight switching network 423 and an input value selected from the IFM switching network 413.

제 3 MAC 어레이(416)는 3×3 배열로 제공되는 복수의 MAC 코어들을 포함할 것이다. 제 3 MAC 어레이(416)는 IFM 스위칭 네트워크(415)를 통해서 IFM 레지스터들(432a)과 IFM 레지스터들(432c) 중 어느 하나와 일대일로 맵핑될 수 있다. 제 3 MAC 어레이(414)는 가중치 스위칭 네트워크(423)로부터 제공되는 제 N-1 가중치(W_N-1)와 IFM 스위칭 네트워크(415)에서 선택된 입력값을 사용하여 컨볼루션 연산을 수행할 수 있다.The third MAC array 416 will include a plurality of MAC cores provided in a 3x3 arrangement. The third MAC array 416 may be mapped one-to-one with any of the IFM registers 432a and IFM registers 432c through the IFM switching network 415. The third MAC array 414 may perform a convolution operation using the N-1 weight (W _N-1 ) provided from the weight switching network 423 and the input value selected from the IFM switching network 415. .

제 4 MAC 어레이(418)는 IFM 스위칭 네트워크(417)를 통해서 IFM 레지스터들(432a)과 IFM 레지스터들(432d) 중 어느 하나와 일대일로 맵핑될 수 있다. 제 4 MAC 어레이(418)는 가중치 스위칭 네트워크(423)로부터 제공되는 제 N 가중치(W_N)와 IFM 스위칭 네트워크(417)에서 선택된 입력값을 사용하여 컨볼루션 연산을 수행할 것이다. The fourth MAC array 418 may be mapped one-to-one with any of the IFM registers 432a and IFM registers 432d through the IFM switching network 417. The fourth MAC array 418 will perform a convolution operation using the Nth weight W _N provided from the weight switching network 423 and an input value selected from the IFM switching network 417.

가중치 메모리들(422, 424, 426, 428) 및 가중치 스위칭 네트워크(423)는 컨볼루션 연산을 위한 가중치 커널을 입력받아 MAC 어레이들(412, 414, 416, 418)에 전달한다. 예컨대, 제 1 가중치 메모리(422)를 통해서 제 1 가중치(W1)를 제공받고, 제 2 가중치 메모리(424)를 통해서 제 2 가중치(W2)를 제공받을 수 있다. 그리고 제 N-1 가중치 메모리(426)를 통해서 제 N-1 가중치(W_N-1)를 제공받고, 제 N 가중치 메모리(428)를 통해서 제 N 가중치(W_N)를 제공받을 수 있다. 가중치 스위칭 네트워크(423)는 가중치들(W1~W_N)을 각각 MAC 어레이들(412, 414, 416, 418)에 전달한다.The weight memories 422, 424, 426, 428 and the weight switching network 423 receive a weight kernel for convolution operation and transfer it to the MAC arrays 412, 414, 416, 418. For example, the first weight W1 may be provided through the first weight memory 422, and the second weight W2 may be provided through the second weight memory 424. In addition, the N-1 weight W W _-1 may be provided through the N-1 weight memory 426, and the N weight W _N may be provided through the N-th weight memory 428. The weight switching network 423 delivers weights W1 to W _N to the MAC arrays 412, 414, 416, and 418, respectively.

입력 공유 네트워크(432)는 좌, 우, 상 방향으로 데이터를 전송할 수 있는 2차원의 IFM 레지스터들의 네트워크로 구성된다. 입력 공유 네트워크(432)의 행과 열의 수는 2차원의 MAC 어레이들(412, 414, 416, 418)의 차원 수와 사이즈, 가중치 커널의 사이즈(K)에 의해 결정될 수 있다. 도시된 입력 공유 네트워크(432)의 행의 수는 가중치 커널의 사이즈를 3이라 가정할 때, IFM 레지스터들(432a~432d)에 더하여 경계 레지스터에 대응하는 2 개의 행과 열이 추가될 것이다.The input sharing network 432 is composed of a network of two-dimensional IFM registers capable of transmitting data in left, right, and upward directions. The number of rows and columns of the input sharing network 432 may be determined by the number and size of the two-dimensional MAC arrays 412, 414, 416, and 418, and the size K of the weighted kernel. Assuming that the size of the weighted kernel of the illustrated input sharing network 432 is 3, two rows and columns corresponding to the boundary register will be added to the IFM registers 432a to 432d.

입력 공유 네트워크(432)의 행의 수는 MAC 어레이들(412, 414, 416, 418)의 사이즈와 경계 레지스터(미도시)의 행의 수를 합한 값이다. 입력 공유 네트워크(432)의 행의 수는 (3+2)로 '5'가 된다. 그리고 입력 공유 네트워크(432)의 열의 수는 {MAC 어레이들(412, 414, 416, 418)의 차원 수(4)}×{MAC 어레이들(412, 414, 416, 418)의 사이즈(3)}＋{경계 레지스터(334)의 열의 수(2)} = '14'에 대응한다. 만일, 가중치 커널의 사이즈(K)가 3이고 2차원으로 배열되는 MAC 어레이들(310, 312)의 차원 수가 도시된 바와 같이 2인 경우, 입력 공유 네트워크(332)의 열의 수는 '3×2+2'로 '8'이 될 것이다. The number of rows of the input sharing network 432 is the sum of the size of the MAC arrays 412, 414, 416, 418 and the number of rows of the boundary register (not shown). The number of rows in the input sharing network 432 is (3+2), which is '5'. And the number of columns of the input sharing network 432 is {the number of dimensions of the MAC arrays 412, 414, 416, 418 (4)} x {the size of the MAC arrays 412, 414, 416, 418 (3) }+{corresponds to the number of columns (2) of the boundary register 334} = '14'. If the size (K) of the weighted kernel is 3 and the number of dimensions of the MAC arrays 310 and 312 arranged in two dimensions is 2 as shown, the number of columns of the input sharing network 332 is '3×2 It will become '8' with +2'.

입력 공유 네트워크(432)는 IFM 버퍼들(431, 433, 435, 437, 439)로부터 제공되는 입력 피처 맵(IFM)을 행별로 동일하게 시프트 방향(좌→좌→상→우→우→상→좌→좌)으로 전달할 것이다. 그러면, 입력 공유 네트워크(432)의 각 레지스터들을 통해서 입력 피처 맵(IFM)이 MAC 어레이들(412, 414, 416, 418)에 전달된다.The input sharing network 432 shifts the input feature map (IFM) provided from the IFM buffers 431, 433, 435, 437, and 439 equally by row (left→left→up→right→right→up→ Left to left). The input feature map (IFM) is then passed to the MAC arrays 412, 414, 416, 418 through each register of the input sharing network 432.

출력 시프트 네트워크(442)의 지그재그 형태의 레지스터 체인으로 연결되어 있고, 순차적 혹은 홉핑 방식으로 저장된 출력 피처 맵(OFM)을 출력 피처 맵 버퍼(444)로 전송한다. 출력 시프트 네트워크(442)의 레지스터들의 배열은 MAC 어레이들(412, 414, 416, 418)의 차원 및 사이즈에 의존할 것이다.The output shift network 442 is connected by a zigzag register chain, and transmits the output feature map OFM stored in a sequential or hopping manner to the output feature map buffer 444. The arrangement of registers in the output shift network 442 will depend on the dimensions and size of the MAC arrays 412, 414, 416, 418.

도 5는 본 발명의 실시 예에 따른 컨볼루션 연산 장치의 다른 구조를 보여주는 블록도이다. 도 5를 참조하면, 컨볼루션 연산 장치(500)는 N 개의 MAC 어레이들(512, 514, 516, 518), 가중치 메모리들(522, 524, 526, 528), 가중치 스위칭 네트워크(523), IFM 버퍼들(531, 533, 535, 537, 539), 입력 공유 네트워크(532), 출력 시프트 네트워크(542), 출력 피처 맵 버퍼(544), 그리고 컨트롤러(550)를 포함할 수 있다. 도 5에 도시된 구조에서는 IFM 스위칭 네트워크(413, 415, 417)는 존재하지 않는다. 5 is a block diagram showing another structure of a convolution computing device according to an embodiment of the present invention. Referring to FIG. 5, the convolution computing device 500 includes N MAC arrays 512, 514, 516, 518, weight memories 522, 524, 526, 528, weight switching network 523, and IFM. It may include buffers 531, 533, 535, 537, 539, input sharing network 532, output shift network 542, output feature map buffer 544, and controller 550. In the structure shown in FIG. 5, the IFM switching networks 413, 415, and 417 do not exist.

N 개의 MAC 어레이들(512, 514, 516, 518)은 각각 2차원으로 배열되는 복수의 MAC 코어들을 포함할 것이다. 여기서, N 개의 MAC 어레이들(512, 514, 516, 518) 각각은 출력 피처 맵(OFM)의 차원 수(N)에 대응한다. 즉, MAC 어레이들(412, 414, 416, 418) 각각은 각각 하나의 출력 피처 맵(OFM)에 대응할 수 있다. MAC 어레이들(512, 514, 516, 518) 각각은 3×3 형태의 MAC 코어들을 포함하는 구조로 도시되었으나, 본 발명은 3×3 형태에만 국한되지는 않는다. The N MAC arrays 512, 514, 516, and 518 will each include a plurality of MAC cores arranged in two dimensions. Here, each of the N MAC arrays 512, 514, 516, and 518 corresponds to the number N of dimensions of the output feature map OFM. That is, each of the MAC arrays 412, 414, 416, and 418 may correspond to one output feature map (OFM), respectively. Each of the MAC arrays 512, 514, 516, and 518 is shown as a structure including 3×3 type MAC cores, but the present invention is not limited to the 3×3 type.

제 1 MAC 어레이(512)는 3×3 배열로 제공되는 복수의 MAC 코어들을 포함할 수 있다. 제 1 MAC 어레이(512)는 가중치 스위칭 네트워크(523)로부터 제공되는 제 1 가중치(W1)와 입력 공유 네트워크(532)의 IFM 레지스터들(532a)로부터 제공되는 입력값을 사용하여 컨볼루션 연산을 수행한다. The first MAC array 512 may include a plurality of MAC cores provided in a 3×3 arrangement. The first MAC array 512 performs a convolution operation using the first weight W1 provided from the weight switching network 523 and the input value provided from the IFM registers 532a of the input sharing network 532. do.

제 2 내지 제 4 MAC 어레이들(514, 516, 518)도 실질적으로 제 1 MAC 어레이(512)와 동일한 대응관계로 IFM 레지스터들(532a)에 맵핑된다. 제 2 내지 제 4 MAC 어레이들(514, 516, 518)은 가중치 스위칭 네트워크(523)로부터 제공되는 커널 가중치(W₁~W_N-1)는 다르지만, IFM 레지스터들(532a)에 공통적으로 맵핑된다. 따라서, IFM 버퍼들(531, 533, 535, 537, 539)을 통해서 입력된 입력 피처 맵(IFM)이 가중치 스위칭 네트워크(523) 상에서 시프트되면서 입력 피처 맵(IFM)이 MAC 어레이들(512, 514, 516, 518)에 공통적으로 제공된다.The second to fourth MAC arrays 514, 516, 518 are also mapped to IFM registers 532a in substantially the same correspondence as the first MAC array 512. The second to fourth MAC arrays 514, 516, and 518 have different kernel weights W ₁ to W _N-1 provided from the weight switching network 523, but are commonly mapped to IFM registers 532a. . Thus, as the input feature map (IFM) input through the IFM buffers 531, 533, 535, 537, 539 is shifted on the weighted switching network 523, the input feature map (IFM) is the MAC arrays 512, 514 , 516, 518).

입력 공유 네트워크(532)의 행 및 열의 수는 도 4의 구조와 동일하다. 입력 공유 네트워크(532)는 IFM 버퍼들(531, 533, 535, 537, 539)로부터 제공되는 입력 피처 맵(IFM)을 행별로 동일하게 시프트 방향(좌→좌→상→우→우→상→좌→좌)으로 전달할 것이다. 그러면, 입력 공유 네트워크(532)의 IFM 레지스터들(532a)에 존재하는 입력 피처 맵(IFM)이 MAC 어레이들(512, 514, 516, 518)에 공통으로 전달된다.The number of rows and columns of the input sharing network 532 is the same as the structure of FIG. 4. The input sharing network 532 shifts the input feature map (IFM) provided from the IFM buffers 531, 533, 535, 537, and 539 equally by row (left→left→up→right→right→up→ Left to left). Then, the input feature map (IFM) present in the IFM registers 532a of the input sharing network 532 is commonly delivered to the MAC arrays 512, 514, 516, 518.

출력 시프트 네트워크(542)의 지그재그 형태의 레지스터 체인으로 연결되어 있고, 순차적 혹은 홉핑 방식으로 저장된 출력 피처 맵(OFM)을 출력 피처 맵 버퍼(544)로 전송한다. 출력 시프트 네트워크(542)의 레지스터들의 배열은 MAC 어레이들(512, 514, 516, 518)의 차원 및 사이즈에 의존할 것이다. 이 경우, 출력 피처 맵(OFM)의 차원(N)이 크고, 출력 피처 맵(OFM)의 사이즈(R, C)가 작을 경우에 병렬 처리 효율이 높아질 수 있다. The output shift network 542 is connected by a zigzag register chain and transmits the output feature map OFM stored in a sequential or hopping manner to the output feature map buffer 544. The arrangement of registers in the output shift network 542 will depend on the dimension and size of the MAC arrays 512, 514, 516, 518. In this case, when the dimension N of the output feature map OFM is large and the sizes R and C of the output feature map OFM are small, parallel processing efficiency may be increased.

도 6은 본 발명의 실시 예에 따른 컨볼루션 연산 장치의 또 다른 구조를 보여주는 블록도이다. 도 6을 참조하면, 컨볼루션 연산 장치(600)는 N 개의 MAC 어레이들(612, 614, 616, 618), 가중치 메모리들(622, 624, 626, 628), 가중치 스위칭 네트워크(623), IFM 버퍼들(631, 633, 635, 637, 639), 입력 공유 네트워크(632), 출력 시프트 네트워크(642), 출력 피처 맵 버퍼(644), 그리고 컨트롤러(650)를 포함할 수 있다. 도 6에 도시된 컨볼루션 연산 장치(600)의 구조에서도 도 4와는 달리 IFM 스위칭 네트워크는 별도로 존재하지 않는다. 6 is a block diagram showing another structure of a convolution computing device according to an embodiment of the present invention. Referring to FIG. 6, the convolution computing device 600 includes N MAC arrays 612, 614, 616, 618, weight memories 622, 624, 626, 628, weight switching network 623, and IFM Buffers 631, 633, 635, 637, 639, input sharing network 632, output shift network 642, output feature map buffer 644, and controller 650. Even in the structure of the convolution computing device 600 shown in FIG. 6, unlike the FIG. 4, the IFM switching network does not exist separately.

N 개의 MAC 어레이들(612, 614, 616, 618)은 각각 2차원으로 배열되는 복수의 MAC 코어들을 포함할 것이다. 여기서, N 개의 MAC 어레이들(612, 614, 616, 618) 각각은 출력 피처 맵(OFM)의 절반 차원(N/2)이 맵핑된다. 즉, 두 개의 MAC 어레이들(612, 614)이 하나의 출력 피처 맵(OFM)에 맵핑되고, 두 개의 MAC 어레이들(616, 618)이 다른 하나의 출력 피처 맵(OFM)에 맵핑될 수 있다.The N MAC arrays 612, 614, 616, and 618 will each include a plurality of MAC cores arranged in two dimensions. Here, each of the N MAC arrays 612, 614, 616, and 618 is mapped to a half dimension (N/2) of the output feature map (OFM). That is, two MAC arrays 612 and 614 may be mapped to one output feature map (OFM), and two MAC arrays 616 and 618 may be mapped to another output feature map (OFM). .

상술한 맵핑의 경우, 가중치 커널은 N/2 차원에 대응하는 값들만이 필요하기 때문에 도 5의 배열에 비해 가중치 메모리의 수도 반으로 감소될 수 있다. N/2 차원의 가중치 커널을 가중치 메모리들(622, 624, 626, 628)에 저장할 때는 메모리를 2의 배수로 점프해서 저장하면, 가중치 스위칭 네트워크(623) 내의 MUX 스위치의 수를 최소화할 수 있다. In the case of the above-described mapping, since the weight kernel only needs values corresponding to the N/2 dimension, the number of weight memories may be reduced by half compared to the arrangement of FIG. 5. When storing the N/2-dimensional weight kernel in the weight memories 622, 624, 626, and 628, if the memory is jumped and stored in multiples of 2, the number of MUX switches in the weight switching network 623 can be minimized.

가중치 쌍(W_N, W_N-1)은 같은 차원의 가중치 커널을 공유하게 된다. MAC 어레이들(612, 616)은 각각 IFM 레지스터들(632a)에 맵핑된다. 그리고 MAC 어레이들(614, 618)은 각각 IFM 레지스터들(632b)에 맵핑된다. 이 경우, 도 5의 실시 예에서 설명된 경우에 비해 컨볼루션 계층의 출력 피처 맵(OFM)의 차원(N)이 작고, 출력 피처 맵(OFM)의 사이즈(R, C)가 커지는 경우에 병렬처리 효율이 높아진다. The weight pairs W _N and W _N-1 share the same weight kernel. MAC arrays 612 and 616 are mapped to IFM registers 632a, respectively. And MAC arrays 614 and 618 are mapped to IFM registers 632b, respectively. In this case, when the dimension N of the output feature map OFM of the convolutional layer is small and the sizes R and C of the output feature map OFM are large compared to the case described in the embodiment of FIG. 5, parallel Processing efficiency increases.

도 7은 본 발명의 실시 예에 따른 컨볼루션 연산 장치의 또 다른 구조를 보여주는 블록도이다. 도 7을 참조하면, 컨볼루션 연산 장치(700)는 N 개의 MAC 어레이들(712, 714, 716, 718), 가중치 메모리들(722, 724, 726, 728), 가중치 스위칭 네트워크(723), IFM 버퍼들(731, 733, 735, 737, 739), 입력 공유 네트워크(732), 출력 시프트 네트워크(742), 출력 피처 맵 버퍼(744), 그리고 컨트롤러(750)를 포함할 수 있다. 7 is a block diagram showing another structure of a convolutional computing device according to an embodiment of the present invention. Referring to FIG. 7, the convolution computing device 700 includes N MAC arrays 712, 714, 716, 718, weight memories 722, 724, 726, 728, weight switching network 723, and IFM Buffers 731, 733, 735, 737, 739, input sharing network 732, output shift network 742, output feature map buffer 744, and controller 750.

N 개의 MAC 어레이들(712, 714, 716, 718)은 각각 2차원으로 배열되는 복수의 MAC 코어들을 포함할 것이다. 여기서, N 개의 MAC 어레이들(712, 714, 716, 718) 각각은 출력 피처 맵(OFM)의 하나의 차원으로 맵핑된다. 즉, 출력 피처 맵(OFM)의 하나의 차원으로 N 개의 MAC 어레이들(712, 714, 716, 718)이 통합된다. 이 경우, 한 차원의 가중치 커널만이 필요하기 때문에 가중치 메모리의 수는 하나만 있으면 된다. 예컨대, 가중치 커널을 제공하기 위해 제 1 가중치 메모리(722)만 있으면 된다. The N MAC arrays 712, 714, 716, and 718 will each include a plurality of MAC cores arranged in two dimensions. Here, each of the N MAC arrays 712, 714, 716, and 718 is mapped to one dimension of the output feature map (OFM). That is, N MAC arrays 712, 714, 716, and 718 are integrated into one dimension of the output feature map (OFM). In this case, only one dimension of the weighted kernel is needed, so only one number of weighted memory is needed. For example, only the first weight memory 722 is needed to provide the weight kernel.

더불어, MAC 어레이들(712, 714, 716, 718) 각각은 IFM 레지스터들(732a, 732b, 732c, 732d)에 각각 맵핑될 수 있다. 이 경우, 컨볼루션 계층의 출력 피처 맵(OFM)의 차원(N)이 작고, 출력 피처 맵(OFM)의 사이즈(R, C)가 커지는 경우에 병렬처리 효율이 높아질 것이다. In addition, each of the MAC arrays 712, 714, 716, and 718 may be mapped to IFM registers 732a, 732b, 732c, and 732d, respectively. In this case, when the dimension N of the output feature map OFM of the convolutional layer is small and the sizes R and C of the output feature map OFM are increased, the parallel processing efficiency will be increased.

위에서 설명한 내용은 본 발명을 실시하기 위한 구체적인 예들이다. 본 발명에는 위에서 설명한 실시 예들뿐만 아니라, 단순하게 설계 변경하거나 용이하게 변경할 수 있는 실시 예들도 포함될 것이다. 또한, 본 발명에는 상술한 실시 예들을 이용하여 앞으로 용이하게 변형하여 실시할 수 있는 기술들도 포함될 것이다.The contents described above are specific examples for carrying out the present invention. The present invention will include not only the embodiments described above, but also simple design changes or easily changeable embodiments. In addition, the present invention will also include techniques that can be easily modified in the future using the above-described embodiments.

Claims

In a convolutional computing device that performs convolutional neural network processing:
First input feature map (IFM) registers and second input feature map (IFM) arranged in rows and columns to sequentially output the input feature map (IFM) input in units of rows in row or column directions, respectively. An input sharing network including registers;
A first MAC array connected to the first input feature map (IFM) registers;
An input feature map (IFM) switching network that selects one of the first input feature map (IFM) registers and the second input feature map (IFM) registers;
A second MAC array connected to any one selected by the input feature map (IFM) switching network among the first input feature map (IFM) registers and the second input feature map (IFM) registers; And
And an output shift network for shifting the output feature map output from the first MAC array and the second MAC array to an output memory.

According to claim 1,
The number of rows of the input sharing network corresponds to the sum of the number of rows of the first or second MAC array minus one from the kernel size.

According to claim 2,
The number of columns of the input shared network is a convolution computing device corresponding to a value obtained by subtracting 1 from the kernel size multiplied by the product of the number of dimensions and the size of the first or second MAC array.

According to claim 1,
And a weight memory that provides a first kernel weight or a second kernel weight to the first MAC array and the second MAC array according to the selection of the input feature map (IFM) switching network.

The method of claim 4,
The weight memory may include the first kernel weight for the first MAC array and the second for the second MAC array when the input feature map (IFM) switching network selects the first input feature map (IFM). A convolutional computing device that provides kernel weights.

The method of claim 5,
The weight memory is a convolution that provides the 1 kernel weight to the first MAC array and the second MAC array when the input feature map (IFM) switching network selects the second input feature map (IFM). Computing device.

The method of claim 4,
And a weight switching network selectively providing the first kernel weight and the second kernel weight output from the weight memory to the first MAC array and the second MAC array.

According to claim 1,
The output shift network is a convolutional computing device including a register chain that shifts the output feature map (OFM) in a zigzag form and transfers it to the output memory.

The method of claim 8,
The convolutional computing device in which the arrangement of the register chain of the output shift network is changed according to the selection of the input feature map (IFM) switching network.

According to claim 1,
The input feature map (IFM) switching network includes a convolution computing device including a plurality of multiplexers for selecting one of first input feature map (IFM) registers and second input feature map (IFM) registers. .

According to claim 1,
Each of the first input feature map (IFM) registers and the second input feature map (IFM) registers sequentially shifts an input feature map in a shift direction (left→left→up→right→right→upward→left→left). Convolution operation device that shifts to.

According to claim 1,
And a controller that controls the input sharing network, the input feature map (IFM) switching network, and the output shift network.