KR20190030564A

KR20190030564A - Neural network accelerator including bidirectional processing element array

Info

Publication number: KR20190030564A
Application number: KR1020180042395A
Authority: KR
Inventors: 여준기; 권영수; 김찬; 김현미; 양정민; 정재훈; 조용철
Original assignee: 한국전자통신연구원
Priority date: 2017-09-14
Filing date: 2018-04-11
Publication date: 2019-03-22
Also published as: KR102490862B1

Abstract

According to an embodiment of the present invention, a neural network accelerator for performing an operation of a neural network including layers may comprise: a kernel memory storing kernel data related to a filter; a feature map memory storing feature map data, which is output of the layers; and a PE array including processing elements (PEs) arranged along first and second directions. Each of the PEs may perform an operation using the feature map data transmitted in the first direction from the feature map memory and kernel data transmitted in the second direction from the kernel memory, and may transmit an operation result to the feature map memory in a third direction opposite to the first direction.

Description

TECHNICAL FIELD [0001] The present invention relates to a neural network accelerator including a bidirectional processing element array,

본 발명은 신경망 가속기에 관한 것으로, 좀 더 상세하게는 양방향 프로세싱 엘리먼트 어레이를 포함하는 신경망 가속기에 관한 것이다.The present invention relates to a neural network accelerator, and more particularly to a neural network accelerator including a bidirectional processing element array.

인공 신경망(Artificial Neural Network; ANN)은 생물학적인 신경망과 유사한 방식으로 데이터 또는 정보를 처리할 수 있다. 이미지 인식을 위한 기술로서 심층 신경망(Deep Neural Network) 기법의 하나인 컨볼루션 신경망(Convolutional Neural Network)이 연구되고 있다. 특히, CNN은 사물, 문자, 필기체, 이미지 등과 같은 다양한 객체 인식에 효과적인 성능을 제공할 수 있다.An Artificial Neural Network (ANN) can process data or information in a manner similar to a biological neural network. Convolutional Neural Network, which is one of Deep Neural Network techniques, is being studied as a technology for image recognition. In particular, CNN can provide effective performance for recognizing various objects such as objects, characters, handwriting, images, and the like.

다수의 레이어들(Layers)로 구성되는 CNN은 반도체 장치를 이용하여 하드웨어 형태로 구현될 수 있다. CNN이 하드웨어 형태로 구현되는 경우 CNN을 구성하는 다수의 레이어들 각각에서 처리되어야 하는 연산량으로 인하여, CNN을 구현하기 위한 하드웨어 자원의 요구가 증가할 수 있다. 또한, CNN의 연산을 수행하기 위해 요구되는 메모리 접근 빈도도 증가할 수 있다. 따라서, CNN을 하드웨어 형태로 구현하는데 제약을 감소시킬 수 있는 기술이 필요하다.CNNs composed of a plurality of layers can be implemented in a hardware form using a semiconductor device. When a CNN is implemented in a hardware form, the amount of hardware resources required to implement CNN may increase due to the amount of processing to be performed in each of the plurality of layers constituting CNN. In addition, the frequency of memory accesses required to perform operations of CNN may also increase. Therefore, there is a need for a technology that can reduce constraints in implementing CNN in hardware form.

본 발명은 상술한 기술적 과제를 해결하기 위한 것으로, 본 발명은 양방향 프로세싱 엘리먼트 어레이를 포함하는 신경망 가속기를 제공할 수 있다.SUMMARY OF THE INVENTION The present invention is directed to a neural network accelerator including a bi-directional processing element array.

본 발명의 실시 예에 따른 레이어들을 포함하는 신경망의 연산을 수행하기 위한 신경망 가속기는, 필터와 관련된 커널 데이터를 저장하는 커널 메모리, 레이어들의 출력들인 피처 맵 데이터를 저장하는 피처 맵 메모리, 및 제 1 방향 및 제 2 방향을 따라 배치되는 프로세싱 엘리먼트들을 포함하는 PE 어레이를 포함할 수 있고, PE들 각각은, 피처 맵 메모리로부터 제 1 방향으로 전송되는 피처 맵 데이터와 커널 메모리로부터 제 2 방향으로 전송되는 커널 데이터를 이용하여 연산을 수행하고 그리고 제 1 방향과 반대인 제 3 방향으로 연산 결과를 피처 맵 메모리로 전송할 수 있다.A neural network accelerator for performing an operation of a neural network including layers according to an embodiment of the present invention includes a kernel memory for storing kernel data related to a filter, a feature map memory for storing feature map data, Direction and a second direction, wherein each of the PEs includes feature map data transmitted in a first direction from the feature map memory and feature map data transmitted in a second direction from the kernel memory The operation may be performed using kernel data and the operation result may be transferred to the feature map memory in a third direction opposite to the first direction.

본 발명의 다른 실시 예에 따른 레이어들을 포함하는 신경망의 연산을 수행하기 위한 신경망 가속기는, 필터와 관련된 커널 데이터를 저장하는 커널 메모리, 레이어들의 출력들인 피처 맵 데이터를 저장하는 피처 맵 메모리, 및 제 1 방향 및 제 2 방향을 따라 배치되는 프로세싱 엘리먼트들 그리고 제 2 방향으로 PE들과 피처 맵 메모리 사이에 배치되는 액티베이션 유닛들을 포함하는 PE 어레이를 포함할 수 있고, PE들 각각은, 피처 맵 메모리로부터 제 1 방향으로 전송되는 피처 맵 데이터와 커널 메모리로부터 제 2 방향으로 전송되는 커널 데이터를 이용하여 제 1 연산을 수행하고 그리고 제 1 방향과 반대인 제 3 방향으로 제 1 연산 결과를 액티베이션 유닛들로 전송하고, 그리고 액티베이션 유닛들은 제 1 연산 결과에 대한 제 2 연산을 수행하고 그리고 제 2 연산 결과를 피처 맵 메모리로 전송할 수 있다.A neural network accelerator for performing a computation of a neural network including layers according to another embodiment of the present invention includes a kernel memory for storing kernel data related to a filter, a feature map memory for storing feature map data, And a PE array comprising processing elements disposed along one direction and a second direction and activation units disposed between the PEs and the feature map memory in a second direction, each of the PEs comprising: Performing a first operation using feature map data transmitted in a first direction and kernel data transmitted in a second direction from a kernel memory and outputting a first operation result to a activation unit in a third direction opposite to the first direction And the activation units perform a second operation on the first operation result, 2 The operation result can be transferred to the feature map memory.

본 발명의 실시 예에 따른 신경망 가속기는 양방향의 PE 어레이 내부에서 컨볼루션 연산, 액티베이션 연산, 노멀라이제이션 연산, 및 풀링 연산을 수행하여 한번에 연산 결과를 메모리로 출력할 수 있다. 또한, 신경망 가속기는 양방향의 PE 어레이로부터 중간 연산 결과인 부분 합을 수신하거나 양방향의 PE 어레이로 부분 합을 전송할 수 있다. 따라서, 신경망 가속기는 메모리로의 접근을 최소화하고 신경망 연산을 효율적으로 수행할 수 있다.The neural network accelerator according to the embodiment of the present invention can perform the convolution operation, the activation operation, the normalization operation, and the pulling operation within the bi-directional PE array to output the operation result to the memory at one time. In addition, the neural network accelerator may receive a partial sum that is the result of the intermediate operation from a bi-directional PE array or may transmit a partial sum to a bi-directional PE array. Therefore, the neural network accelerator can minimize the access to the memory and perform the neural network operation efficiently.

도 1은 본 발명의 실시 예에 따른 CNN을 예시적으로 도시한다.
도 2는 도 1의 CNN의 연산을 수행하기 위한 신경망 가속기를 예시적으로 보여주는 블록도이다.
도 3은 도 2의 PE 어레이를 좀 더 상세하게 보여주는 블록도이다.
도 4는 도 3의 PE를 좀 더 상세하게 보여주는 블록도이다.
도 5는 도 4의 PE가 출력 명령에 응답하여 X축 방향으로 연산 결과를 수신하는 방법을 예시적으로 도시하는 순서도이다.
도 6은 X축 방향으로 출력 명령을 수신하는 제 1 PE 및 제 2 PE를 예시적으로 도시한다.
도 7은 도 4의 PE가 로드 부분 합 명령 또는 패스 부분 합 명령에 응답하여 X축과 반대 방향으로 부분 합을 전송하는 방법을 예시적으로 도시하는 순서도이다.
도 8은 X축과 반대 방향으로 로드 부분 합 명령 및 패스 부분 합 명령을 수신하는 제 1 내지 제 3 PE들을 예시적으로 도시한다.
도 9는 본 발명의 다른 실시 예에 따른 도 2의 PE 어레이를 예시적으로 보여주는 블록도이다.1 illustrates an exemplary CNN according to an embodiment of the present invention.
2 is a block diagram illustrating an exemplary neural network accelerator for performing the CNN operation of FIG.
Figure 3 is a block diagram illustrating the PE array of Figure 2 in more detail.
4 is a block diagram showing the PE of FIG. 3 in more detail.
5 is a flowchart exemplarily showing a method of receiving the calculation result in the X-axis direction in response to an output instruction of the PE of FIG.
FIG. 6 exemplarily shows a first PE and a second PE receiving an output command in the X-axis direction.
7 is a flowchart exemplarily showing how the PE of FIG. 4 transmits a partial sum in a direction opposite to the X axis in response to a load partial sum instruction or a path partial sum instruction.
FIG. 8 exemplarily shows first through third PEs receiving a load partial sum instruction and a path partial sum instruction in a direction opposite to the X axis.
Figure 9 is a block diagram illustrating an exemplary PE array of Figure 2 in accordance with another embodiment of the present invention.

아래에서는, 본 발명의 기술 분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있을 정도로, 본 발명의 실시 예들이 명확하고 상세하게 기재될 것이다.In the following, embodiments of the present invention will be described in detail and in detail so that those skilled in the art can easily carry out the present invention.

본 발명은 신경망(Neural Network)의 연산을 수행하기 위한 신경망 가속기(Neural Network Accelerator)에 관한 것이다. 본 발명의 신경망은 생물학적인 신경망과 유사한 방식으로 데이터 또는 정보를 처리할 수 있는 인공 신경망(ANN)일 수 있다. 신경망은 생물학적 뉴런과 유사한 인공 뉴런들을 포함하는 다수의 레이어들(Layers)을 포함할 수 있다. 신경망은 객체를 인식하고 분류하는 인공 지능 분야에서 활용될 수 있다. 이하에서, 이미지 인식을 위한 컨볼루션 신경망(Convolutional Neural Network; CNN)이 대표적으로 설명될 것이나, 본 발명의 신경망 가속기는 CNN에 한정되지 않고 CNN이 아닌 다른 신경망을 구현하기 위해서도 사용될 수 있다.The present invention relates to a neural network accelerator for performing an operation of a neural network. The neural network of the present invention may be an artificial neural network (ANN) capable of processing data or information in a manner similar to a biological neural network. The neural network may comprise a number of layers including artificial neurons similar to biological neurons. Neural networks can be used in the field of artificial intelligence to recognize and classify objects. Hereinafter, a Convolutional Neural Network (CNN) for image recognition will be exemplarily described. However, the neural network accelerator of the present invention is not limited to CNN but can also be used to implement a neural network other than CNN.

일반적으로, 컨볼루션(Convolution) 연산은 두 함수 간의 상관관계를 검출하기 위한 연산을 나타낼 수 있다. CNN에서 입력 데이터 또는 특정 피처(Feature)를 가리키는 커널(Kernel)과 특정 변수들(예를 들어, 가중치(Weight), 바이어스(Bias) 등) 사이의 컨볼루션 연산이 반복적으로 수행됨으로써, 이미지의 패턴이 결정되거나 또는 이미지의 특징이 추출될 수 있다.In general, a convolution operation may represent an operation for detecting a correlation between two functions. In CNN, a convolution operation between input data or a kernel indicating a specific feature and specific variables (e.g., weight, bias, etc.) is repeatedly performed, Or the characteristics of the image can be extracted.

도 1은 본 발명의 실시 예에 따른 CNN을 예시적으로 도시한다. 도 1을 참조하면, CNN(100)은 제 1 내지 제 5 레이어들(L1~L5)을 포함할 수 있다. 1 illustrates an exemplary CNN according to an embodiment of the present invention. Referring to FIG. 1, the CNN 100 may include first through fifth layers L1 through L5.

CNN은 외부(예를 들면, 호스트, 메모리 등)로부터 제 1 데이터(DATA1)를 수신할 수 있다. 제 1 데이터(DATA1)는 CNN(100)으로 제공되는 입력 이미지 또는 입력 데이터를 나타낼 수 있다. 예를 들어, 입력 이미지는 다수의 픽셀들을 포함하는 이미지 장치로부터 생성될 수 있다. 입력 이미지는 W1 X H1 X D1의 크기를 가질 수 있다. 다른 예를 들어, 제 1 데이터(DATA1)는 다른 레이어의 출력일 수도 있다.The CNN may receive the first data (DATA1) from the outside (e.g., host, memory, etc.). The first data (DATA1) may represent an input image or input data provided to the CNN (100). For example, an input image may be generated from an imaging device comprising a plurality of pixels. The input image may have a size of W1 X H1 X D1. As another example, the first data DATA1 may be an output of another layer.

제 1 레이어(L1)는 제 1 커널(K1)을 이용하여 제 1 데이터(DATA1)에 대한 컨볼루션 연산을 수행할 수 있고 제 2 데이터(DATA2)를 생성할 수 있다. 제 1 커널(K1)은 제 1 데이터(DATA1)에서 공간적으로 인접한 값들에 대한 상관관계를 추출하기 위해 사용되는 필터, 마스크, 윈도우 등을 나타낼 수 있다. 제 1 레이어(L1)는 제 1 데이터(DATA1)의 피처(Feature)를 추출할 수 있다. 제 1 데이터(DATA1)와 제 1 레이어(L1)에서 출력된 제 2 데이터(DATA2)는 피처 맵(feature map)으로 각각 지칭될 수 있다.The first layer L1 may perform a convolution operation on the first data DATA1 using the first kernel K1 and may generate the second data DATA2. The first kernel K1 may represent a filter, a mask, a window, and the like used for extracting a correlation for spatially adjacent values in the first data DATA1. The first layer (L1) can extract a feature of the first data (DATA1). The first data DATA1 and the second data DATA2 output from the first layer L1 may be referred to as feature maps, respectively.

제 1 레이어(L1)는 제 1 데이터(DATA1)의 입력 값들과 제 1 커널(K1)의 가중치 값들에 대해 곱셈 연산을 수행하고, 곱셈 결과들을 모두 가산하고, 즉, 컨볼루션 연산을 수행하고, 그리고 가산 결과에 대해 액티베이션(Activation) 연산을 수행할 수 있다. 제 1 레이어(L1)는 비선형 함수인 ReLU(Rectified Linear Unit) 함수, Leaky ReLU 함수, Sigmoid 함수, tanh(Hyperbolic Tangent) 함수 등을 이용하여 액티베이션 연산을 수행할 수 있다. 또한, 제 1 레이어(L1)는 액티베이션 연산의 결과들에 대한 노멀라이제이션(normalization) 연산을 수행할 수 있다. 노멀라이제이션 연산을 통해 제 1 레이어(L1)의 출력 값들의 범위들이 일치되거나 제 1 레이어(L1)의 출력 값들의 분포들이 유사해질 수 있다.The first layer L1 performs a multiplication operation on the input values of the first data DATA1 and the weight values of the first kernel K1, adds all the multiplication results, i.e., performs a convolution operation, Then, an activation operation can be performed on the addition result. The first layer L1 may perform an activation operation using a non-linear function ReLU (Rectified Linear Unit), a Leaky ReLU function, a Sigmoid function, and a tanh (Hyperbolic Tangent) function. In addition, the first layer L1 may perform a normalization operation on the results of the activation operation. The ranges of the output values of the first layer L1 may be matched or the distributions of the output values of the first layer L1 may be similar through the normalization operation.

제 2 데이터(DATA2)는 W2 X H2 X D2의 크기를 가질 수 있고 제 2 데이터(DATA2)의 크기는 제 1 데이터(DATA1)의 크기와 동일하거나 상이할 수 있다. 예를 들어, 제 1 데이터(DATA1)의 엣지(edge) 영역에 데이터 패딩(Padding)이 존재하면, 도 1에서 도시된 것과 같이, 제 2 데이터(DATA2)의 크기는 제 1 데이터(DATA1)의 크기와 동일할 수 있다.The second data DATA2 may have a size of W2 X H2 X D2 and the size of the second data DATA2 may be equal to or different from the size of the first data DATA1. For example, if there is data padding in the edge region of the first data DATA1, the size of the second data DATA2 is smaller than that of the first data DATA1, Size.

제 2 레이어(L2)는 제 2 커널(K2)을 이용하여 제 2 데이터(DATA2)에 대한 풀링(Pooling) 연산을 수행할 수 있다. 풀링 연산은 서브 샘플링(Sub Sampling) 연산으로 지칭될 수 있다. 제 2 레이어(L2)는 제 2 커널(K2)에 대응하는 제 2 데이터(DATA2)의 값들 중 고정된 위치에 있는 픽셀의 값을 선택하거나, 제 2 커널(K2)에 대응하는 제 2 데이터(DATA2)의 값들의 평균을 계산하거나(즉, 에버리지 풀링(Average Pooling)), 또는 제 2 커널(K2)에 대응하는 제 2 데이터(DATA2)의 값들 중 가장 큰 값을 선택할 수 있다(즉, 맥스 풀링(Max Pooling)). 제 2 레이어(L2)에서 출력된 제 3 데이터(DATA3)도 피처 맵으로 지칭될 수 있고 제 3 데이터(DATA3)는 W3 X H3 X D3의 크기를 가질 수 있다. 예를 들어, 제 3 데이터(DATA3)의 크기는 제 2 데이터(DATA2)의 크기보다 작을 수 있다.The second layer L2 may perform a pooling operation on the second data DATA2 using the second kernel K2. The pooling operation may be referred to as a sub-sampling operation. The second layer L2 selects the value of the pixel at the fixed position among the values of the second data DATA2 corresponding to the second kernel K2 or selects the value of the second data DATA2 corresponding to the second kernel K2 (I.e., Average Pooling), or to select the largest value of the second data (DATA2) corresponding to the second kernel K2 Max Pooling). The third data DATA3 output from the second layer L2 may be referred to as a feature map and the third data DATA3 may have the size of W3 X H3 X D3. For example, the size of the third data (DATA3) may be smaller than the size of the second data (DATA2).

제 3 레이어(L3)는 제 3 커널(K3)을 이용하여 제 3 데이터(DATA3)에 대한 컨볼루션 연산, 액티베이션 연산, 노멀라이제이션 연산 등을 수행할 수 있다. 그 다음, 제 4 레이어(L4)는 제 4 커널(K4)을 이용하여 제 4 데이터(DATA4)에 대한 풀링 연산을 수행할 수 있다. 제 3 레이어(L3)는 제 1 레이어(L1)와 유사하게 연산을 수행할 수 있다. 예를 들어, 제 1 및 제 3 레이어들(L1, L3)은 컨볼루션 레이어들일 수 있다. 제 4 레이어(L4)는 제 2 레이어(L2)와 유사하게 연산을 수행할 수 있다. 예를 들어, 제 2 및 제 4 레이어들(L2, L4)은 풀링 레이어들일 수 있다.The third layer L3 may perform a convolution operation, an activation operation, a normalization operation, and the like for the third data (DATA3) using the third kernel K3. Next, the fourth layer L4 may perform a pooling operation on the fourth data (DATA4) using the fourth kernel K4. The third layer L3 may perform an operation similar to the first layer L1. For example, the first and third layers L1 and L3 may be convolution layers. The fourth layer L4 may perform an operation similar to the second layer L2. For example, the second and fourth layers L2 and L4 may be pooling layers.

실시 예에 있어서, 제 1 내지 제 4 레이어들(L1~L4)에서 각각 출력되는 제 2 내지 제 5 데이터(DATA2~DATA5)의 크기들(W2 X H2 X D2, W3 X H3 X D3, W4 X H4 X D4, W5 X H5 X D5)은 서로 동일하거나 상이할 수 있다. CNN(100)이 4개의 레이어들(L1~L4)을 포함하는 것으로 도 1에서 예시적으로 도시되었으나, 실제 CNN(100)은 더 많은 레이어들을 포함할 수 있다. In the embodiment, the sizes (W2XH2XD2, W3XH3XD3, W4X) of the second through fifth data (DATA2 through DATA5) output from the first through fourth layers (L1 through L4) H4 X D4, W5 X H5 X D5) may be the same or different from each other. Although CNN 100 is illustratively shown in FIG. 1 as including four layers L1 through L4, actual CNN 100 may include more layers.

제 1 데이터(DATA1)의 크기는 제 1 내지 제 4 레이어들(L1~L4)을 통과하면서 감소할 수 있다. 제 5 데이터(DATA5)는 제 1 데이터(DATA1)가 제 1 내지 제 4 레이어들(L1~L4)을 통과하면서 추출된 피처를 포함할 수 있다. 제 5 데이터(DATA5)에 포함된 피처는 제 1 데이터(DATA1)를 대표할 수 있는 피처일 수 있다. 또한, 제 1 내지 제 4 레이어들(L1~L4)의 제 1 내지 제 4 커널들(K1~K4)이 다양하게 설정되면, 제 1 데이터(DATA1)의 다양한 피처들이 추출될 수 있다. 도 1을 참조하면, 제 1 내지 제 4 커널들(K1~K4)의 크기들이 모두 동일한 것으로 도시되었으나, 제 1 내지 제 4 커널들(K1~K4)의 크기들은 서로 동일할 수도 있고 상이할 수도 있다.The size of the first data (DATA1) may decrease while passing through the first to fourth layers (L1 to L4). The fifth data (DATA5) may include features extracted while the first data (DATA1) passes through the first to fourth layers (L1 to L4). The feature included in the fifth data (DATA5) may be a feature that can represent the first data (DATA1). In addition, when the first to fourth kernels K1 to K4 of the first to fourth layers L1 to L4 are variously set, various features of the first data DATA1 can be extracted. 1, the sizes of the first to fourth kernels K1 to K4 are all the same, but the sizes of the first to fourth kernels K1 to K4 may be the same or different from each other have.

제 5 레이어(L5)는 제 5 데이터(DATA5)에 대한 완전 연결(Full Connection) 동작을 수행하고 출력 데이터(OUTPUT DATA)를 생성할 수 있다. 예를 들어, 제 5 레이어(L5)는 완전 연결 레이어(Fully Connected Layer)일 수 있다. 출력 데이터(OUTPUT DATA)는 제 1 데이터(DATA1)가 인식되거나 분류된 결과를 나타낼 수 있다. CNN(100)이 하나의 제 5 레이어(L5)를 포함하는 것으로 도 1에서 예시적으로 도시되었으나, 실제 CNN(100)은 더 많은 레이어들을 포함할 수 있다. CNN(100)의 제 1 내지 제 4 레이어들(L1~L4)의 피처 맵들(DATA1~DATA5)과 커널들(K1~K4)이 간략하게 설명되었다. 이하에서는 CNN(100)을 하드웨어로 구현하기 위한 장치가 설명될 것이다.The fifth layer L5 may perform a full connection operation to the fifth data DATA5 and generate output data OUTPUT DATA. For example, the fifth layer L5 may be a fully connected layer. The output data (OUTPUT DATA) may indicate the result that the first data (DATA1) is recognized or classified. Although CNN 100 is illustratively shown in FIG. 1 as including one fifth layer L5, actual CNN 100 may include more layers. The feature maps DATA1 to DATA5 and kernels K1 to K4 of the first to fourth layers L1 to L4 of the CNN 100 have been briefly described. Hereinafter, an apparatus for implementing the CNN 100 in hardware will be described.

도 2는 도 1의 CNN의 연산을 수행하기 위한 신경망 가속기를 예시적으로 보여주는 블록도이다. 도 2는 도 1을 참조하여 설명될 것이다. 신경망 가속기(1000)는 프로세싱 엘리먼트 어레이(Processing Element Array; 이하 PE 어레이, 1100), 피처 맵 메모리(Feature Map Memory, 1200), 커널 메모리(Kernel Memory, 1300), 및 제어기(Controller, 1400)를 포함할 수 있다. 신경망 가속기(1000)는 신경망 장치, 신경망 회로, 하드웨어 가속기, 프로세싱 유닛 등과 같이 신경망 연산을 수행하기 위해 하드웨어적으로 구현된 연산 장치일 수 있다. 예를 들어, 신경망 가속기(1000)는 SoC(System on Chip), ASIC(application specific integrated circuit), CPU(Central Processing Unit), GPU(Graphics Processing Unit), VPU(Vision Processing Unit), 및 NPU(Neural Processing Unit) 등과 같은 다양한 반도체 장치를 이용하여 구현될 수 있다.2 is a block diagram illustrating an exemplary neural network accelerator for performing the CNN operation of FIG. Fig. 2 will be described with reference to Fig. The neural network accelerator 1000 includes a processing element array (PE array) 1100, a feature map memory 1200, a kernel memory 1300, and a controller 1400 can do. The neural network accelerator 1000 may be a computing device that is hardware implemented to perform neural network operations such as a neural network device, a neural network circuit, a hardware accelerator, a processing unit, and the like. For example, the neural network accelerator 1000 may include a system on chip (SoC), an application specific integrated circuit (ASIC), a central processing unit (CPU), a graphics processing unit (GPU), a vision processing unit (VPU) Processing Unit), and the like.

PE 어레이(1100)는 X축 및 Y축 방향을 따라 배치되는 프로세싱 엘리먼트(Processing Element; 이하 PE)들을 포함할 수 있다. PE들은 도 3을 참조하여 후술한다. PE 어레이(1100)는 동기 신호(예를 들면, 클럭 신호)에 맞추어서 연산을 수행하는 시스톨릭 어레이(Systolic Array)일 수 있다. 예를 들어, 연산들은 도 1에서 상술한 컨볼루션 연산, 액티베이션 연산, 노멀라이제이션 연산, 및 풀링 연산을 포함할 수 있다.The PE array 1100 may include processing elements (PEs) arranged along the X-axis and Y-axis directions. The PEs are described below with reference to FIG. The PE array 1100 may be a systolic array that performs operations according to a synchronization signal (e.g., a clock signal). For example, the operations may include the convolution operation, the activation operation, the normalization operation, and the pulling operation described above in FIG.

PE 어레이(1100)는 피처 맵 메모리(1200)로부터 X축 방향으로 전송되는 피처 맵 데이터를 수신할 수 있다. X축 방향은 피처 맵 메모리(1200)로부터 PE 어레이(1100)로 향할 수 있다. PE 어레이(1100)는 커널 메모리(1300)로부터 X축 방향과 수직인 Y축 방향으로 전송되는 커널 데이터를 수신할 수 있다. Y축 방향은 커널 메모리(1300)로부터 PE 어레이(1100)로 향할 수 있다. 여기서, 피처 맵 데이터는 도 1의 제 1 내지 제 5 데이터(DATA1~DATA5, 피처 맵들)를 나타낼 수 있다. 커널 데이터는 도 1의 제 1 내지 제 4 커널들(K1~K4, 또는 필터들, 윈도우들, 마스크들)과 관련될 수 있다.The PE array 1100 can receive feature map data transmitted in the X axis direction from the feature map memory 1200. [ The X-axis direction can be directed from the feature map memory 1200 to the PE array 1100. The PE array 1100 can receive kernel data transmitted from the kernel memory 1300 in the Y axis direction perpendicular to the X axis direction. The Y axis direction can be directed from the kernel memory 1300 to the PE array 1100. [ Here, the feature map data may represent the first to fifth data (DATA1 to DATA5, feature maps) of Fig. The kernel data may be associated with the first through fourth kernels (K1 through K4, or filters, windows, masks) of FIG.

PE 어레이(1100)는 피처 맵 데이터와 커널 데이터를 이용하여 연산을 수행할 수 있다. PE 어레이(1100)는 X축과 반대 방향으로 연산 결과를 피처 맵 메모리(1200)로 전송할 수 있다. PE 어레이(1100)의 내부에서 CNN(100)의 다양한 연산들이 처리될 수 있으므로, 신경망 가속기(1000)는 CNN(100)의 다양한 연산들을 처리하기 위한 PE 어레이(1100)와 피처 맵 메모리(1200) 사이의 별도의 블록, 회로, 유닛 등을 포함하지 않을 수 있다. PE 어레이(1100)의 연산 결과는 곧바로 피처 맵 메모리(1200)로 전송될 수 있다. PE 어레이(1100)는 X축 방향으로 피처 맵 데이터를 수신하고 X축과 반대 방향으로 연산 결과를 출력하므로, PE 어레이(1100)는 X축을 기준으로 양방향 PE 어레이일 수 있다.The PE array 1100 can perform operations using feature map data and kernel data. The PE array 1100 can transfer the result of the operation to the feature map memory 1200 in a direction opposite to the X axis. The neural network accelerator 1000 may include a PE array 1100 and a feature map memory 1200 for processing various operations of the CNN 100 since various operations of the CNN 100 may be processed within the PE array 1100. [ Circuits, units, and the like between the memory cells. The operation result of the PE array 1100 can be transferred to the feature map memory 1200 immediately. Since the PE array 1100 receives the feature map data in the X axis direction and outputs the calculation results in the opposite direction to the X axis, the PE array 1100 can be a bidirectional PE array based on the X axis.

실시 예에 있어서, PE 어레이(1100)의 연산 결과는 도 1의 제 2 내지 제 5 데이터(DATA2~DATA5)에 대응하는 피처 맵 데이터일 수 있다. 또는, PE 어레이(1100)의 연산 결과는 피처 맵 데이터를 생성하기 위한 중간 연산 결과인 부분 합(Partial Sum)일 수 있다. PE 어레이(1100)는 PE들의 개수, 속도 등과 같은 제약으로 인하여 새로운 피처 맵 데이터를 피처 맵 메모리(1200)로 전송하기 전에 새로운 피처 맵 데이터를 생성하기 위한 부분 합을 피처 맵 메모리(1200)로 전송할 수 있다.In the embodiment, the calculation result of the PE array 1100 may be feature map data corresponding to the second to fifth data (DATA2 to DATA5) of Fig. Alternatively, the computation result of the PE array 1100 may be a partial sum, which is the intermediate computation result for generating feature map data. The PE array 1100 may send a partial sum for generating new feature map data to the feature map memory 1200 before transferring the new feature map data to the feature map memory 1200 due to constraints such as the number of PEs, .

피처 맵 메모리(1200)는 도 1의 제 1 내지 제 5 데이터(DATA1~DATA5)에 대응하는 입력 데이터 또는 피처 맵 데이터를 저장할 수 있다. 또한, 피처 맵 메모리(1200)는 PE 어레이(1100)로부터 X축과 반대 방향으로 전송되는 연산 결과를 수신하고 저장할 수 있다. 피처 맵 메모리(1200)는 X축 방향으로 피처 맵 데이터를 PE 어레이(1100)로 전송하거나 또는 X축 방향으로 이전에 저장된 PE 어레이(1100)의 이전의 연산 결과를 PE 어레이(1100)로 전송할 수 있다.The feature map memory 1200 may store input data or feature map data corresponding to the first to fifth data (DATA1 to DATA5) in Fig. In addition, the feature map memory 1200 can receive and store computation results transmitted in the direction opposite to the X axis from the PE array 1100. [ The feature map memory 1200 may transmit feature map data to the PE array 1100 in the X axis direction or may transfer the results of previous operations of the PE array 1100 previously stored in the X axis direction to the PE array 1100 have.

커널 메모리(1300)는 도 1의 제 1 내지 제 4 커널들(K1~K4)에 대응하는 커널 데이터를 저장할 수 있다. 커널 메모리(1300)는 Y축 방향으로 커널 데이터를 PE 어레이(1100)로 전송할 수 있다. 커널 데이터는 컨볼루션 연산에서 사용되는 커널의 가중치 값들을 포함할 수 있다. 가중치 값들은 임의의 레이어 내의 임의의 인공 뉴런과 다른 인공 뉴런간의 결합 세기를 나타낼 수 있다. 실시 예에 있어서, 피처 맵 메모리(1200)와 커널 메모리(1300)는 서로 다른 메모리 장치들에서 각각 구현되거나 또는 하나의 메모리 장치의 서로 다른 영역들에서 각각 구현될 수 있다.The kernel memory 1300 may store kernel data corresponding to the first to fourth kernels K1 to K4 of FIG. The kernel memory 1300 can transmit kernel data to the PE array 1100 in the Y-axis direction. The kernel data may include kernel weight values used in the convolution operation. The weight values may represent the coupling strength between any and all artificial neurons in any layer. In an embodiment, feature map memory 1200 and kernel memory 1300 may be implemented in different memory devices, respectively, or may be implemented in different regions of a memory device, respectively.

실시 예에 있어서, PE 어레이(1100)에 대한 피처 맵 메모리(1200)와 커널 메모리(1300)의 위치들은 도 2에서 도시된 것으로 한정되지 않는다. 예를 들어, 피처 맵 메모리(1200)는 PE 어레이(1100)의 오른쪽, 위쪽, 또는 아래쪽에 위치할 수도 있고 커널 메모리(1300)는 PE 어레이(1100)의 아래쪽, 왼쪽, 또는 오른쪽에 위치할 수도 있다. 어느 경우든, 피처 맵 메모리(1200)와 커널 메모리(1300)의 위치들과 관계없이, 피처 맵 메모리(1200)와 커널 메모리(1300) 각각은 PE 어레이(1100) 내 상대적으로 가까운 PE들로부터 상대적으로 먼 PE들로 데이터를 전송할 수 있다.In an embodiment, the locations of feature map memory 1200 and kernel memory 1300 for PE array 1100 are not limited to those shown in FIG. For example, the feature map memory 1200 may be located on the right, above, or below the PE array 1100 and the kernel memory 1300 may be located below, to the left, or to the right of the PE array 1100 have. In either case, regardless of the locations of the feature map memory 1200 and the kernel memory 1300, each of the feature map memory 1200 and the kernel memory 1300 may be relatively < RTI ID = 0.0 > To the remote PEs.

제어기(1400)는 PE 어레이(1100), 피처 맵 메모리(1200), 및 커널 메모리(1300)를 제어하기 위한 명령들을 생성할 수 있다. 제어기(1400)는 PE 어레이(1100), 피처 맵 메모리(1200), 및 커널 메모리(1300)가 동기되는 (글로벌) 클럭 신호를 생성할 수 있다. PE 어레이(1100) 및 피처 맵 메모리(1200)는 클럭 신호에 기초하여 피처 맵 데이터 및 연산 결과를 교환할 수 있다. 커널 메모리(1300)는 클럭 신호에 기초하여 커널 데이터를 PE 어레이(1100)로 전송할 수 있다.Controller 1400 may generate instructions to control PE array 1100, feature map memory 1200, and kernel memory 1300. [ Controller 1400 may generate a (global) clock signal to which PE array 1100, feature map memory 1200, and kernel memory 1300 are synchronized. The PE array 1100 and feature map memory 1200 may exchange feature map data and computation results based on the clock signal. The kernel memory 1300 may send kernel data to the PE array 1100 based on the clock signal.

도 3은 도 2의 PE 어레이를 좀 더 상세하게 보여주는 블록도이다. 도 3은 도 1 및 도 2를 참조하여 설명될 것이다. PE 어레이(1100)는 PE들(1110), 피처 맵 입출력 유닛들(Feature Map I/O Unit, 1120), 및 커널 로드 유닛들(Kernel Load Unit, 1130)을 포함할 수 있다.Figure 3 is a block diagram illustrating the PE array of Figure 2 in more detail. Fig. 3 will be described with reference to Figs. 1 and 2. Fig. The PE array 1100 may include PEs 1110, Feature Map I / O Units 1120, and Kernel Load Units 1130.

PE들(1110)은 X축 및 Y축 방향을 따라 배치될 수 있고 2차원 어레이를 구성할 수 있다. PE들(1110) 각각은 피처 맵 메모리(1200)로부터 X축 방향으로 전송되는 피처 맵 데이터와 커널 메모리(1300)로부터 Y축 방향으로 전송되는 커널 데이터를 이용하여 연산을 수행할 수 있다. PE들(1110) 각각은 X축과 반대 방향으로 연산 결과를 피처 맵 메모리(1200)로 전송할 수 있다. The PEs 1110 may be disposed along the X-axis and Y-axis directions and may constitute a two-dimensional array. Each of the PEs 1110 can perform the operation using the feature map data transmitted in the X axis direction from the feature map memory 1200 and the kernel data transmitted in the Y axis direction from the kernel memory 1300. Each of the PEs 1110 may transmit the computation result to the feature map memory 1200 in a direction opposite to the X axis.

PE들(1110) 각각은 도 2의 제어기(1400)에 의해 생성되는 클럭 신호에 동기될 수 있다. PE들(1110) 각각은 클럭 신호에 기초하여 X축 및 Y축 방향으로 데이터를 인접 PE들(혹은 다음 PE들)로 전송할 수 있다. 좀 더 구체적으로, PE들(1110) 각각은 클럭 신호에 기초하여 X축 방향으로 피처 맵 데이터를 인접 PE로 전송할 수 있다. PE들(1110) 각각은 클럭 신호에 기초하여 X축과 반대 방향으로 연산 결과를 인접 PE로 전송할 수 있다. PE들(1110) 각각은 클럭 신호에 기초하여 Y축을 기준으로 커널 데이터를 인접 PE로 전송할 수 있다. PE들(1110)은 시스톨릭 어레이를 구성할 수 있고 클럭 신호에 기초하여 동시에 동작할 수 있다. PE들(1110)의 개수는 신경망 가속기(1000)의 면적, 속도, 전력, CNN(100)의 연산량 등에 기초하여 결정될 수 있다.Each of the PEs 1110 may be synchronized to a clock signal generated by the controller 1400 of FIG. Each of the PEs 1110 may transmit data to adjacent PEs (or next PEs) in the X and Y axis directions based on the clock signal. More specifically, each of the PEs 1110 may transmit feature map data to the neighbor PE in the X-axis direction based on the clock signal. Each of the PEs 1110 can transmit the calculation result to the adjacent PE in the direction opposite to the X axis based on the clock signal. Each of the PEs 1110 can send kernel data to the neighbor PE based on the Y-axis based on the clock signal. The PEs 1110 may constitute a systolic array and may operate concurrently based on the clock signal. The number of PEs 1110 may be determined based on the area, speed, power, amount of computation of the CNN 100, and the like of the neural network accelerator 1000.

피처 맵 입출력 유닛들(1120)은 Y축 방향을 따라 배치될 수 있다. 피처 맵 입출력 유닛들(1120) 각각은 X축 방향을 따라 배치되고 하나의 로우(row)에 대응하는 PE들(1110)과 데이터를 교환할 수 있다. 예를 들어, 피처 맵 입출력 유닛들(1120) 각각은 피처 맵 메모리(1200)에 저장된 피처 맵 데이터를 수신할 수 있다. 피처 맵 입출력 유닛들(1120) 각각은 X축 방향으로 수신된 피처 맵 데이터를 PE들(1110)로 전송할 수 있다. The feature map input / output units 1120 may be disposed along the Y axis direction. Each of the feature map input / output units 1120 may be arranged along the X axis direction and exchange data with the PEs 1110 corresponding to one row. For example, each of the feature map input / output units 1120 may receive feature map data stored in the feature map memory 1200. Each of the feature map input / output units 1120 may transmit the feature map data received in the X axis direction to the PEs 1110. [

X축 방향을 따라 배치되는 PE들(1110)은 수신된 피처 맵 데이터에 대한 연산을 수행하고 그리고 연산 결과(새로운 피처 맵 데이터 혹은 새로운 피처 맵 데이터를 생성하기 위한 부분 합)를 생성할 수 있다. 그 다음, 피처 맵 입출력 유닛들(1120) 각각은 X축과 반대 방향으로 PE들(1110)로부터 연산 결과를 수신하고 수신된 연산 결과를 피처 맵 메모리(1200)로 전송할 수 있다. 피처 맵 입출력 유닛들(1120) 각각은 피처 맵 메모리(1200)로부터 부분 합을 수신하고 X축 방향으로 수신된 부분 합을 다시 PE들(1110)로 전송할 수 있다.PEs 1110 disposed along the X-axis direction may perform operations on the received feature map data and generate an operation result (new feature map data or a subset to generate new feature map data). Each of the feature map input / output units 1120 may then receive the result of the operation from the PEs 1110 in the opposite direction to the X axis and send the result of the operation received to the feature map memory 1200. Each of the feature map input / output units 1120 may receive a partial sum from the feature map memory 1200 and transmit the received partial sum to the PEs 1110 in the X axis direction.

커널 로드 유닛들(1130)은 X축 방향을 따라 배치될 수 있다. 커널 로드 유닛들(1130) 각각은 커널 메모리(1300)에 저장된 커널 데이터를 수신할 수 있다. 커널 로드 유닛들(1130) 각각은 수신된 커널 데이터를 Y축 방향을 따라 배치되고 하나의 컬럼(column)에 대응하는 PE들(1110)로 전송할 수 있다. 피처 맵 데이터가 전송되는 방향(X축 방향 또는 X축과 반대 방향)과 커널 데이터가 전송되는 방향(Y축 방향)은 서로 수직할 수 있다.The kernel load units 1130 may be disposed along the X-axis direction. Each of the kernel load units 1130 may receive kernel data stored in the kernel memory 1300. Each of the kernel load units 1130 may transmit the received kernel data to the PEs 1110 arranged along the Y-axis direction and corresponding to one column. The direction in which the feature map data is transmitted (the direction opposite to the X axis direction or the X axis) and the direction in which the kernel data is transferred (the Y axis direction) may be perpendicular to each other.

도 4는 도 3의 PE를 좀 더 상세하게 보여주는 블록도이다. 도 4는 도 1 내지 도 3을 참조하여 설명될 것이다. 도 4의 PE(1110)는 도 3의 PE 어레이(1100) 내의 PE들 중 임의의 하나일 수 있다. 즉, PE 어레이(1100) 내의 PE들은 서로 동일하게 구현될 수 있다. PE(1110)는 제어 레지스터(1111), 피처 맵 레지스터(1112), 커널 레지스터(1113), 곱셈기(1114), 가산기(1115), 누적 레지스터(1116), 및 출력 레지스터(1117)를 포함할 수 있다.4 is a block diagram showing the PE of FIG. 3 in more detail. Fig. 4 will be described with reference to Figs. 1 to 3. Fig. The PE 1110 of FIG. 4 may be any one of the PEs in the PE array 1100 of FIG. That is, the PEs in the PE array 1100 may be implemented identically. The PE 1110 may include a control register 1111, a feature map register 1112, a kernel register 1113, a multiplier 1114, an adder 1115, an accumulation register 1116, and an output register 1117 have.

제어 레지스터(1111)는 X축 방향으로 전송되는 명령을 저장할 수 있다. 명령은 도 2의 제어기(1400)에서 생성될 수 있다. 제어 레지스터(1111)는 저장된 명령에 기초하여 PE(1110) 내 다른 구성 요소들(피처 맵 레지스터(1112), 커널 레지스터(1113), 곱셈기(1114), 가산기(1115), 누적 레지스터(1116), 및 출력 레지스터(1117))을 제어할 수 있다. 제어 레지스터(1111)는 임의의 사이클(cycle)에서 명령을 수신하고 다음 사이클에서 X축 방향으로 저장된 명령을 다른 PE(예를 들면, 오른쪽의 PE)로 전송할 수 있다. 여기서, 사이클은 전술한 클럭 신호의 임의의 한 주기를 나타낼 수 있다. 다른 예시로, 제어 레지스터(1111)는 클럭 신호의 상승 엣지 또는 하강 엣지에서 명령을 수신할 수도 있다.The control register 1111 can store an instruction to be transmitted in the X-axis direction. An instruction may be generated in the controller 1400 of FIG. The control register 1111 is coupled to other components in the PE 1110 (feature map register 1112, kernel register 1113, multiplier 1114, adder 1115, accumulation register 1116, And output register 1117). The control register 1111 may receive the command in any cycle and may transmit the command stored in the X axis direction to another PE (e.g., the PE on the right) in the next cycle. Here, the cycle may represent any one period of the above-mentioned clock signal. In another example, the control register 1111 may receive an instruction at the rising or falling edge of the clock signal.

피처 맵 레지스터(1112)는 임의의 사이클에서 X축 방향으로 전송되는 피처 맵 데이터를 수신하고 저장할 수 있다. 피처 맵 레지스터(1112)는 저장된 피처 맵 데이터를 곱셈기(1114)로 제공할 수 있다. 피처 맵 레지스터(1112)는 다음 사이클에서 X축 방향으로 저장된 피처 맵 데이터를 다른 PE(예를 들면, 오른쪽의 PE)로 전송할 수 있다. Feature map register 1112 may receive and store feature map data transmitted in the X axis direction in any cycle. Feature map register 1112 may provide stored feature map data to multiplier 1114. Feature map register 1112 may transfer feature map data stored in the X axis direction to another PE (e.g., PE on the right) in the next cycle.

커널 레지스터(1113)는 임의의 사이클에서 Y축 방향으로 전송되는 커널 데이터를 수신하고 저장할 수 있다. 커널 레지스터(1113)는 저장된 커널 데이터를 곱셈기(1114)로 제공할 수 있다. 커널 레지스터(1113)는 다음 사이클에서 Y축 방향으로 저장된 커널 데이터를 다른 PE(예를 들면, 아래쪽의 PE)로 전송할 수 있다. 제어 레지스터(1111), 피처 맵 레지스터(1112), 및 커널 레지스터(1113)는 적어도 하나의 플립 플롭, 적어도 하나의 래치, 적어도 하나의 로직 게이트 등을 이용하여 하드웨어적으로 구현될 수 있다.The kernel register 1113 can receive and store kernel data transmitted in the Y axis direction in any cycle. The kernel register 1113 may provide stored kernel data to the multiplier 1114. The kernel register 1113 may transmit the kernel data stored in the Y axis direction to another PE (e.g., the lower PE) in the next cycle. The control register 1111, feature map register 1112, and kernel register 1113 may be implemented in hardware using at least one flip-flop, at least one latch, at least one logic gate, and so on.

곱셈기(1114)는 수신된 피처 맵 데이터의 입력 값과 수신된 커널 데이터의 가중치 값에 대한 곱셈 연산을 수행할 수 있다. 예를 들어, 곱셈기(1114)의 곱셈 결과는 다른 PE로 곧바로 전송되지 않고 누적 레지스터(1116)에 누적될 수 있다. 예를 들어, PE(1110)가 CNN(100)의 액티베이션 연산, 노멀라이제이션 연산 연산, 및 풀링 연산을 수행하면, 연산 결과는 X축과 반대 방향으로 인접 PE(예를 들면, 왼쪽의 PE)로 전송될 수 있다.The multiplier 1114 may perform a multiplication operation on the input value of the received feature map data and the weight value of the received kernel data. For example, the multiplication result of multiplier 1114 may be accumulated in accumulation register 1116 without being immediately transmitted to another PE. For example, when the PE 1110 performs an activation operation, a normalization operation operation, and a pulling operation of the CNN 100, the operation result is transmitted to an adjacent PE (for example, a left PE) Lt; / RTI >

가산기(1115)는 곱셈기(1114)의 곱셈 결과 및 이전 사이클에서 누적 레지스터(1116)에 누적된 이전의 연산 결과에 대한 가산 연산을 수행할 수 있다. 누적 레지스터(1116)는 가산기(1115)의 가산 결과를 누적하거나 저장할 수 있다. 누적 레지스터(1116)가 가산기(1115)의 새로운 가산 결과를 수신하면, 내부에 저장되었던 가산 결과는 이전의 연산 결과가 될 수 있다. 즉, 곱셈기(1114)의 곱셈 결과는 누적 레지스터(1116)에 누적될 수 있다.The adder 1115 may perform an addition operation on the multiplication result of the multiplier 1114 and the previous operation result accumulated in the accumulation register 1116 in the previous cycle. Accumulation register 1116 may accumulate or store the addition result of adder 1115. [ When the accumulation register 1116 receives the new addition result of the adder 1115, the addition result stored therein can be the previous calculation result. That is, the multiplication result of the multiplier 1114 can be accumulated in the accumulation register 1116.

실시 예에 있어서, PE(1110)는 곱셈기(1114), 가산기(1115), 및 누적 레지스터(1116)를 이용하여 곱셈-누산(Multiply Accumulate; MAC) 연산을 수행할 수 있다. 곱셈기(1114), 가산기(1115), 및 누적 레지스터(1116)는 적어도 하나의 플립 플롭, 적어도 하나의 래치, 적어도 하나의 로직 게이트 등을 이용하여 하드웨어적으로 구현될 수 있다. PE(1110)는 곱셈-누산 연산을 반복적으로 수행하여 도 1에서 전술한 컨볼루션 연산, 액티베이션 연산, 및 노멀라이제이션 연산을 수행할 수 있다. In an embodiment, the PE 1110 may perform a Multiply Accumulate (MAC) operation using a multiplier 1114, an adder 1115, and an accumulation register 1116. The multiplier 1114, the adder 1115, and the accumulation register 1116 can be implemented in hardware using at least one flip-flop, at least one latch, at least one logic gate, and the like. PE 1110 may perform the convolution operation, the activation operation, and the normalization operation described above in FIG. 1 by repeatedly performing the multiply-accumulate operation.

컨볼루션 연산은 피처 맵 데이터 및 커널 데이터를 곱하고 곱셈 결과들을 모두 더하는 연산이다. 액티베이션 연산 및 노멀라이제이션 연산은 누적 레지스터(1116)에 저장된 컨볼루션 연산의 결과에 특정한 값을 곱하거나 가산하는 연산을 포함할 수 있다. 즉, 누적 레지스터(1116)에 저장된 컨볼루션 연산의 결과는 곱셈기(1114, 실선 화살표 참조) 및 가산기(1115, 점선 화살표 참조)로 제공될 수 있다. 예를 들어, 곱셈기(1114), 가산기(1115), 및 누적 레지스터(1116)를 포함하는 PE(1110)는 ReLU 함수 또는 Leaky ReLU 함수를 이용하여 액티베이션 연산을 수행할 수 있다. The convolution operation is an operation that multiplies the feature map data and the kernel data and adds all of the multiplication results. The activation and normalization operations may include operations that multiply or add the result of a convolution operation stored in accumulation register 1116 with a particular value. That is, the result of the convolution operation stored in the accumulation register 1116 may be provided by a multiplier 1114 (see a solid line arrow) and an adder 1115 (see a dotted line arrow). For example, a PE 1110 that includes a multiplier 1114, an adder 1115, and an accumulation register 1116 may perform an activation operation using the ReLU function or the Leaky ReLU function.

출력 레지스터(1117)는 가산기(1115)의 가산 결과를 저장할 수 있다. 출력 레지스터(1117)에 저장된 가산 결과는 PE(1110)로부터 X축과 반대 방향으로 전송되는 연산 결과이다. 출력 레지스터(1117)는 다른 PE(예를 들면, 오른쪽의 PE)로부터 X축과 반대 방향으로 전송되는 연산 결과를 저장할 수 있다.The output register 1117 can store the addition result of the adder 1115. [ The addition result stored in the output register 1117 is the result of operation transmitted from the PE 1110 in the direction opposite to the X axis. The output register 1117 may store the result of the operation transmitted in the opposite direction from the X axis from another PE (e.g., PE on the right).

실시 예에 있어서, PE(1110)는 곱셈기(1114), 가산기(1115), 누적 레지스터(1116), 및 출력 레지스터(1117)를 이용하여 CNN(100)의 풀링 연산을 더 수행할 수 있다. PE(1110)는 컨볼루션 연산, 액티베이션 연산, 및 노멀라이제이션 연산을 수행한 새로운 연산 결과와 출력 레지스터(1117)에 저장된 이전의 연산 결과를 비교할 수 있다. 여기서, 새로운 연산 결과는 가산기(1115)에서 생성될 수 있다. 가산기(1115)는 새로운 연산 결과와 이전의 연산 결과에 대한 비교 연산을 수행하는 비교기(1115_1)를 더 포함할 수 있다.In an embodiment, PE 1110 may further perform a pooling operation of CNN 100 using multiplier 1114, adder 1115, accumulation register 1116, and output register 1117. The PE 1110 may compare the results of the new operation that performed the convolution operation, the activation operation, and the normalization operation with the results of the previous operation stored in the output register 1117. Here, a new operation result can be generated in the adder 1115. [ The adder 1115 may further include a comparator 1115_1 for performing a comparison operation on the new operation result and the previous operation result.

맥스 풀링 연산의 경우, 비교기(1115_1)의 비교 결과에 기초하여, 출력 레지스터(1117)는 새로운 연산 결과 또는 이전의 연산 결과 중 더 큰 결과를 저장할 수 있다. 예를 들어, 출력 레지스터(1117)에 이전의 연산 결과보다 큰 새로운 연산 결과가 업데이트되거나 또는 출력 레지스터(1117)에 새로운 연산 결과보다 큰 이전의 연산 결과가 그대로 유지될 수 있다.In the case of the Max-Pulling operation, based on the comparison result of the comparator 1115_1, the output register 1117 may store a new operation result or a larger result of the previous operation result. For example, a new operation result that is larger than the previous operation result may be updated in the output register 1117, or a previous operation result that is larger than the new operation result may be maintained in the output register 1117.

에버리지 풀링 연산의 경우, 가산기(1115)는 이전의 연산 결과와 새로운 연산 결과를 더하고 나누기 연산(예를 들면, 쉬프트 연산)을 수행할 수 있다. 예를 들어, 출력 레지스터(1117)에 나누기 연산의 결과가 업데이트될 수 있다.In the case of the overhead pulling operation, the adder 1115 may add a previous operation result and a new operation result and perform a division operation (for example, a shift operation). For example, the result of the division operation may be updated in the output register 1117.

본 발명의 실시 예에 따른 PE(1110)는 내부적으로 CNN(100)의 컨볼루션 연산, 액티베이션 연산, 노멀라이제이션 연산, 및 풀링 연산 모두를 수행하거나 처리할 수 있다. 이를 통해, 도 2의 PE 어레이(1100), 피처 맵 메모리(1200), 커널 메모리(1300)간의 데이터 교환 횟수가 최소화될 수 있다. 피처 맵 메모리(1200) 및 커널 메모리(1300)로의 접근 빈도가 낮아지므로, 피처 맵 메모리(1200) 및 커널 메모리(1300)의 면적들과 신경망 가속기(1000)의 전력 소모가 개선될 수 있다.The PE 1110 according to an embodiment of the present invention may internally perform or process both the convolution operation, the activation operation, the normalization operation, and the pulling operation of the CNN 100. [ Thus, the number of data exchanges between the PE array 1100, the feature map memory 1200, and the kernel memory 1300 of FIG. 2 can be minimized. The power consumption of the neural network accelerator 1000 and the areas of the feature map memory 1200 and the kernel memory 1300 can be improved since the access frequency to the feature map memory 1200 and the kernel memory 1300 is reduced.

도 5는 도 4의 PE가 출력 명령에 응답하여 X축 방향으로 연산 결과를 수신하는 방법을 예시적으로 도시하는 순서도이다. 도 5는 도 1 내지 도 4를 참조하여 설명될 것이다.5 is a flowchart exemplarily showing a method of receiving the calculation result in the X-axis direction in response to an output instruction of the PE of FIG. Fig. 5 will be described with reference to Figs. 1 to 4. Fig.

S110 단계에서, 도 4의 PE(1110)는 X축 방향으로 전송되는 명령을 수신할 수 있다. S120 단계에서, PE(1110)는 수신된 명령이 출력 명령인지 여부를 판별할 수 있다. PE(1110)가 출력 명령을 수신하면, S130 단계가 진행될 수 있다. PE(1110)가 출력 명령을 수신하지 않으면, PE(1110)는 출력 명령이 아닌 다른 명령에 대응하는 연산을 처리할 수 있다.In step S110, the PE 1110 of FIG. 4 may receive an instruction to be transmitted in the X-axis direction. In step S120, the PE 1110 can determine whether the received command is an output command. When the PE 1110 receives the output command, step S130 may be performed. If the PE 1110 does not receive an output command, the PE 1110 may process the operation corresponding to another command other than the output command.

S130 단계에서, PE(1110)는 유효 플래그 비트(Valid Flag Bit)를 활성화할 수 있다. 예를 들어, PE(1110)는 유효 플래그 비트를 1로 설정할 수 있다. 물론, 활성화된 유효 플래그 비트의 논리 값은 상술한 예시로 한정되지 않는다. 출력 레지스터(1117)는 PE(1110)의 연산 결과에 더해 유효 플래그 비트를 더 저장할 수 있다. 출력 레지스터(1117)에 저장된 유효 플래그 비트는 PE(1110)의 연산 결과와 함께 X축과 반대 방향으로 다른 PE로 전송될 수 있다. PE(1110)의 유효 플래그 비트는, X축과 반대 방향을 따라 PE(1110)의 옆에 위치하는 인접 PE가 PE(1110)의 연산 결과를 자신의 출력 레지스터에 저장할지 여부를 판별하는데 사용될 수 있다. PE(1110)의 유효 플래그 비트는 PE(1110)의 연산 결과가 유효한지 여부를 나타낼 수 있다.In step S130, the PE 1110 may activate a valid flag bit. For example, the PE 1110 may set the valid flag bit to one. Of course, the logical value of the activated valid flag bit is not limited to the example described above. The output register 1117 may further store valid flag bits in addition to the operation result of the PE 1110. [ The valid flag bit stored in the output register 1117 can be transmitted to another PE in the opposite direction to the X axis together with the operation result of the PE 1110. [ The valid flag bit of PE 1110 can be used to determine whether a neighboring PE located next to PE 1110 along the opposite direction of the X axis will store the result of the operation of PE 1110 in its own output register have. The valid flag bit of PE 1110 may indicate whether the operation result of PE 1110 is valid.

S140 단계에서, PE(1110)는 자신의 위치가 마지막 컬럼에 배치되는지 여부를 판별할 수 있다. 여기서, 마지막 컬럼이란, 도 3의 PE들(1110)이 배치되는 컬럼들 중에서 도 3의 피처 맵 입출력 유닛들(1120)이 배치된 컬럼으로부터 X축 방향으로 가장 멀리 떨어진 컬럼을 나타낼 수 있다. 도 3에서 도시된 것과 달리, 피처 맵 입출력 유닛들(1120)은 X축과 평행한 임의의 로우에 배치될 수도 있다. 이 경우, S140 단계에서, PE(1110)는 자신의 위치가 마지막 로우에 배치되는지 여부를 판별할 수 있다.In step S140, the PE 1110 may determine whether its location is located in the last column. Here, the last column may represent a column that is farthest in the X-axis direction from the column in which the feature map input / output units 1120 of FIG. 3 are disposed among the columns in which the PEs 1110 of FIG. 3 are disposed. 3, feature map input / output units 1120 may be arranged in any row parallel to the X axis. In this case, in step S140, the PE 1110 may determine whether its location is located in the last row.

예를 들어, PE(1110)는 도 2의 제어기(1400)에 의해 제공되는 주소 정보에 기초하여 자신의 위치를 판별할 수 있다. 혹은, 상술한 주소 정보가 마지막 컬럼에 배치되는 PE들에 사전에 프로그램될 수도 있다. 마지막 컬럼에 배치되는 PE(1110)는 S150 단계를 수행할 수 있다. 마지막 컬럼에 배치되지 않는 PE(1110)는 S160 단계를 수행할 수 있다.For example, the PE 1110 may determine its location based on the address information provided by the controller 1400 of FIG. Alternatively, the above address information may be pre-programmed into the PEs placed in the last column. The PE 1110 disposed in the last column may perform step S150. PE 1110 that is not placed in the last column may perform step S160.

S150 단계에서, PE(1110)는 라스트 플래그 비트(Last Flag Bit)를 활성화할 수 있다. 예를 들어, PE(1110)는 라스트 플래그 비트를 1로 설정할 수 있다. 물론, 활성화된 라스트 플래그 비트의 논리 값은 상술한 예시로 한정되지 않는다. 출력 레지스터(1117)는 PE(1110)의 연산 결과 및 유효 플래그 비트에 더해 라스트 플래그 비트를 더 저장할 수 있다. 출력 레지스터(1117)에 저장된 라스트 플래그 비트는 PE(1110)의 연산 결과 및 유효 플래그 비트와 함께 X축과 반대 방향으로 다른 PE로 전송될 수 있다. 라스트 플래그 비트는 PE(1110)가 X축 방향으로 피처 맵 메모리(1200)로부터 가장 멀리 떨어진 컬럼에 배치되는지 여부를 나타낼 수 있다.In step S150, the PE 1110 may activate a Last Flag Bit. For example, PE 1110 may set the last flag bit to one. Of course, the logical value of the activated last flag bit is not limited to the example described above. The output register 1117 may further store the last flag bits in addition to the operation result and valid flag bits of the PE 1110. [ The last flag bit stored in the output register 1117 can be transmitted to another PE in the opposite direction to the X axis together with the operation result of the PE 1110 and the valid flag bit. The last flag bit may indicate whether the PE 1110 is located in the farthest column from the feature map memory 1200 in the X-axis direction.

S160 단계에서, PE(1110)는 X축 방향을 따라 PE(1110)의 옆에 위치하는 인접 PE의 출력 레지스터에 저장된 연산 결과, 유효 플래그 비트, 및 라스트 플래그 비트를 자신의 출력 레지스터에 저장할 수 있다. PE(1110)는 X축과 반대 방향으로 인접 PE로부터 전송되는 연산 결과, 유효 플래그 비트, 및 라스트 플래그 비트를 수신할 수 있다.In step S160, the PE 1110 can store the operation result, the valid flag bit, and the last flag bit stored in the output register of the adjacent PE located next to the PE 1110 along the X-axis direction in its own output register . The PE 1110 may receive the operation result, the valid flag bit, and the last flag bit transmitted from the neighbor PE in the direction opposite to the X axis.

S170 단계에서, PE(1110)는 수신된 라스트 플래그 비트가 활성화되었는지를 판별할 수 있다. 예를 들어, PE(1110)는 수신된 라스트 플래그 비트의 논리 값이 1인지 여부를 판별할 수 있다. 라스트 플래그 비트가 활성화되지 않았으면, PE(1110)는 S160 단계를 다시 수행할 수 있다. 라스트 플래그 비트가 활성화되면, PE(1110)는 S180 단계를 수행할 수 있다. 즉, PE(1110)는 인접 PE로부터 활성화된 라스트 플래그 비트를 수신할 때까지 X축과 반대 방향으로 전송되는 인접 PE의 출력 레지스터에 저장된 연산 결과, 유효 플래그 비트, 및 라스트 플래그 비트를 반복적으로 수신하고 저장할 수 있다.In step S170, the PE 1110 may determine whether the received last flag bit is activated. For example, the PE 1110 may determine whether the logical value of the received last flag bit is one. If the last flag bit is not activated, the PE 1110 may perform step S160 again. If the last flag bit is activated, the PE 1110 may perform step S180. That is, the PE 1110 repeatedly receives the operation result, the valid flag bit, and the last flag bit stored in the output register of the adjacent PE transmitted in the direction opposite to the X axis until receiving the last flag bit activated from the adjacent PE And store it.

S180 단계에서, PE(1110)는 활성화된 라스트 플래그 비트에 기초하여 자신의 유효 플래그 비트 및 라스트 플래그 비트를 모두 비활성화할 수 있다. 예를 들어, PE(1110)는 자신의 유효 플래그 비트 및 라스트 플래그 비트를 모두 0으로 설정할 수 있다. S180 단계를 수행하는 PE(1110)는 더 이상 인접 PE로부터 연산 결과를 수신하지 않을 수 있다. PE(1110)는 다른 출력 명령을 수신할 수 있도록 자신의 유효 플래그 비트 및 라스트 플래그 비트를 비활성화할 수 있다. 즉, PE(1110)는 자신의 유효 플래그 비트 및 라스트 플래그 비트를 리셋(Reset)할 수 있다.In step S180, the PE 1110 may deactivate both its valid flag bit and the last flag bit based on the activated last flag bit. For example, the PE 1110 may set its own valid flag bit and the last flag bit to zero. The PE 1110 performing step S180 may no longer receive the operation result from the adjacent PE. PE 1110 may deactivate its own valid flag bit and last flag bit to receive another output command. That is, the PE 1110 may reset its valid flag bit and the last flag bit.

도 6은 X축 방향으로 출력 명령을 수신하는 제 1 PE 및 제 2 PE를 예시적으로 도시한다. 도 6은 도 1 내지 도 5를 참조하여 설명될 것이다. 도 6에서, 제 1 PE 및 제 2 PE는 X축을 따라 배치될 수 있다. 제 1 PE는 X축과 반대 방향을 따라 제 2 PE의 옆에 위치할 수 있고, 반대로 제 2 PE는 X축 방향을 따라 제 1 PE의 옆에 위치할 수 있다. 제 1 및 제 2 PE들은 서로 인접할 수 있다. 제 1 및 제 2 PE들 각각은 도 4의 PE(1110)와 동일하게 구현될 수 있다. FIG. 6 exemplarily shows a first PE and a second PE receiving an output command in the X-axis direction. Fig. 6 will be described with reference to Figs. 1 to 5. Fig. 6, the first PE and the second PE may be disposed along the X axis. The first PE may be positioned next to the second PE along the direction opposite to the X axis while the second PE may be positioned alongside the first PE along the X axis. The first and second PEs may be adjacent to each other. Each of the first and second PEs may be implemented in the same manner as the PE 1110 of FIG.

제 1 PE는 출력 레지스터(1117)인 제 1 출력 레지스터를 포함할 수 있고 그리고 제 2 PE는 출력 레지스터(1117)인 제 2 출력 레지스터를 포함할 수 있다. 제 1 출력 레지스터는 제 1 유효 플래그 비트 및 제 1 라스트 플래그 비트를 저장할 수 있다. 제 2 출력 레지스터는 제 2 유효 플래그 비트 및 제 2 라스트 플래그 비트를 저장할 수 있다.The first PE may comprise a first output register which is an output register 1117 and the second PE may comprise a second output register which is an output register 1117. [ The first output register may store a first valid flag bit and a first last flag bit. The second output register may store a second valid flag bit and a second last flag bit.

제 1 사이클에서, 제 1 PE는 출력 명령을 수신할 수 있다. 제 1 PE는 출력 명령에 기초하여 제 1 유효 플래그 비트를 활성화할 수 있다(S130 단계 참조). In the first cycle, the first PE may receive an output command. The first PE can activate the first valid flag bit based on the output instruction (see step S130).

제 2 사이클에서, X축과 반대 방향을 따라 제 1 PE의 옆에 위치하는 인접 PE 또는 피처 맵 입출력 유닛(1120)은 제 1 유효 플래그 비트 및 제 1 라스트 플래그 비트에 기초하여, 제 1 연산 결과, 제 1 유효 플래그 비트, 및 제 1 라스트 플래그 비트를 포함하는 제 1 PE 데이터를 수신하고 저장할 수 있다(S160 단계 참조). 제 1 PE는 출력 명령을 제 2 PE로 전송할 수 있고 그리고 제 2 PE는 출력 명령을 수신할 수 있다. 제 2 PE는 출력 명령에 기초하여 제 2 유효 플래그 비트 또는 제 2 라스트 플래그 비트를 활성화할 수 있다(S130, S150 단계들 참조).In the second cycle, the adjacent PE or feature map input / output unit 1120 located next to the first PE along the direction opposite to the X axis, based on the first valid flag bit and the first last flag bit, , A first valid flag bit, and a first last flag bit (see step S160). The first PE may send an output command to the second PE and the second PE may receive the output command. The second PE may activate the second valid flag bit or the second last flag bit based on the output instruction (see steps S130 and S150).

제 3 사이클에서, 제 1 PE는 제 2 유효 플래그 비트 및 제 2 라스트 플래그 비트에 기초하여, 제 2 연산 결과, 제 2 유효 플래그 비트, 및 제 2 라스트 플래그 비트를 포함하는 제 2 PE 데이터를 수신하고 저장할 수 있다(S160 단계 참조). 제 2 PE는 출력 명령을 X축 방향을 따라 제 2 PE의 옆에 위치하는 인접 PE(예를 들면, 제 3 PE)로 출력 명령을 전송할 수 있다.In the third cycle, the first PE receives, based on the second valid flag bit and the second last flag bit, the second PE data including the second operation result, the second valid flag bit, and the second last flag bit (See step S160). The second PE can send an output command to the adjacent PE (e.g., the third PE) located next to the second PE along the X axis direction.

실시 예에 있어서, 제 1 PE는 반드시 제 1 사이클에서만 제 1 유효 플래그 비트를 활성화하지 않는다. 예를 들어, 제 1 PE는 제 2 사이클 또는 제 1 사이클과 제 2 사이클 사이에서 제 1 유효 플래그 비트를 활성화할 수도 있다. 유사하게, 제 2 PE도 반드시 제 2 사이클에서만 제 2 유효 플래그 비트 또는 제 2 라스트 플래그 비트를 활성화하지 않는다. 예를 들어, 제 2 PE는 제 3 사이클 또는 제 2 사이클과 제 3 사이클 사이에서 제 2 유효 플래그 비트 또는 제 2 라스트 플래그 비트를 활성화할 수도 있다.In an embodiment, the first PE does not necessarily activate the first valid flag bit in the first cycle. For example, the first PE may activate the first valid flag bit in the second cycle or between the first and second cycles. Similarly, the second PE does not necessarily activate the second valid flag bit or the second last flag bit in the second cycle necessarily. For example, the second PE may activate the second valid flag bit or the second last flag bit in the third cycle or between the second and third cycles.

제 4 사이클에서, 제 2 PE는 X축 방향을 따라 제 2 PE의 옆에 위치하는 인접 PE의 제 3 PE 데이터를 수신하고 저장할 수 있다(S160 단계 참조). 제 1 PE의 옆에 위치하는 다른 PE 또는 피처 맵 입출력 유닛(1120)은 제 1 PE로 전송된 제 2 PE 데이터를 수신하고 저장할 수 있다(S160 단계 참조). In the fourth cycle, the second PE can receive and store the third PE data of the neighboring PE located next to the second PE along the X-axis direction (see step S160). Another PE or feature map input / output unit 1120 located next to the first PE may receive and store the second PE data transmitted to the first PE (see step S160).

도 6을 참조하면, 제 1 PE에서 매 사이클마다 PE 데이터가 출력되지 않는다. 제 1 PE에서 제 2 사이클 및 제 4 사이클에서 제 1 PE 데이터 및 제 2 PE 데이터가 각각 출력될 수 있다. 정리하면, PE 어레이(1100)의 연산 결과는 PE들(1110)이 배치되는 컬럼들의 개수의 두 배만큼의 사이클들 동안에 피처 맵 메모리(1200)로 전송될 수 있다.Referring to FIG. 6, PE data is not output every cycle in the first PE. The first PE data and the second PE data may be output in the first and second cycles, respectively. In summary, the operation result of the PE array 1100 can be transferred to the feature map memory 1200 for as many cycles as twice the number of columns in which the PEs 1110 are disposed.

도 7은 도 4의 PE가 로드 부분 합 명령 또는 패스 부분 합 명령에 응답하여 X축과 반대 방향으로 부분 합을 전송하는 방법을 예시적으로 도시하는 순서도이다. 도 7은 도 1 내지 도 4를 참조하여 설명될 것이다. PE(1110)가 로드 부분 합(Load Partial Sum) 명령 또는 패스 부분 합(Pass Partial Sum) 명령에 응답하여 부분 합을 전송하는 방향과 PE(1110)가 출력 명령에 응답하여 연산 결과를 전송하는 방향은 서로 반대이다.7 is a flowchart exemplarily showing how the PE of FIG. 4 transmits a partial sum in a direction opposite to the X axis in response to a load partial sum instruction or a path partial sum instruction. Fig. 7 will be described with reference to Figs. 1 to 4. Fig. The PE 1110 sends a partial sum in response to a Load Partial Sum command or a Pass Partial Sum command and the direction in which the PE 1110 sends the partial sum in response to the output command Are opposite.

S210 단계에서, 도 4의 PE(1110)는 X축 방향으로 전송되는 명령을 수신할 수 있다. S220 단계에서, PE(1110)는 수신된 명령이 로드 부분 합 명령인지 여부를 판별할 수 있다. PE(1110)가 로드 부분 합 명령을 수신하면 S230 단계가 진행될 수 있다. PE(1110)가 로드 부분 합 명령을 수신하지 않으면, S250 단계가 진행될 수 있다.In step S210, the PE 1110 of FIG. 4 may receive an instruction to be transmitted in the X-axis direction. In step S220, the PE 1110 may determine whether the received instruction is a load partial sum instruction. If the PE 1110 receives the load partial sum command, the process proceeds to step S230. If the PE 1110 does not receive the load partial sum command, step S250 may be performed.

S230 단계에서, PE(1110)는 로드 부분 합 명령에 응답하여, X축 방향으로 로드 부분 합 명령과 함께 전송되는 부분 합을 저장할 수 있다. 여기서, PE(1110)의 피처 맵 레지스터(1112)가 아닌 누적 레지스터(1116)가 수신된 부분 합을 저장할 수 있다. 피처 맵 레지스터(1112)는 피처 맵 데이터를 저장할 수 있고 누적 레지스터(1116)는 새로운 피처 맵 데이터를 생성하기 위한 중간 연산 결과인 부분 합을 저장할 수 있다.In step S230, the PE 1110, in response to the load partial sum instruction, may store a partial sum transmitted along with the load partial sum instruction in the X axis direction. Here, the accumulated register 1116, rather than the feature map register 1112 of the PE 1110, may store the received partial sum. Feature map register 1112 may store feature map data and accumulation register 1116 may store a sub-sum that is the result of an intermediate operation to generate new feature map data.

S240 단계에서, PE(1110)는 S210 단계에서 명령을 수신한 이후 다음 사이클에서 X축 방향으로 전송되는 패스 부분 합 명령을 수신할 수 있다. PE(1110)는 패스 부분 합 명령에 응답하여 패스 부분 합 명령과 함께 전송되는 부분 합을 임시로 저장할 수 있다. 패스 부분 합 명령과 함께 전송되는 부분 합은 X축 방향을 따라 위치하는 다른 PE들 중 어느 하나를 위한 것이다. PE(1110)의 제어 레지스터(1111)는 패스 부분 합 명령을 수신한 이후 다음 사이클에서 패스 부분 합 명령 대신에 로드 부분 합 명령과 임시로 저장된 부분 합을 X축 방향으로 인접 PE로 전송할 수 있다. In step S240, the PE 1110 may receive a path partial sum command transmitted in the X axis direction in the next cycle after receiving the command in step S210. PE 1110 may temporarily store the partial sum sent with the path partial sum instruction in response to the path partial sum instruction. The partial sum transmitted with the path partial sum instruction is for any one of the other PEs located along the X axis direction. The control register 1111 of the PE 1110 may transmit the partial sum instruction temporarily stored in the load partial sum instruction to the adjacent PE in the X axis direction instead of the path partial sum instruction in the next cycle after receiving the path partial sum instruction.

S250 단계에서, PE(1110)는 S210 단계에서 수신된 명령이 패스 부분 합 명령인지 여부를 판별할 수 있다. PE(1110)가 패스 부분 합 명령을 수신하면 S260 단계가 진행될 수 있다. 여기서, S250 단계에서 PE(1110)로 패스 부분 합 명령이 전송되는 시점(즉, 사이클)과 S240 단계에서 PE(1110)로 패스 부분 합 명령이 전송되는 시점은 상이하다. 예를 들어, PE(1110)는 S240 단계에서 로드 부분 합 명령 이후에 다음 사이클에서 패스 부분 합 명령을 수신할 수 있다. PE(1110)는 S240 단계에서 패스 부분 합 명령을 수신한 이후 다음 사이클에서(즉, S250 단계에서) 패스 부분 합 명령을 더 수신할 수 있다. PE(1110)가 패스 부분 합 명령을 수신하지 않으면, PE(1110)는 패스 부분 합 명령이 아닌 다른 명령에 대응하는 연산을 처리할 수 있다.In step S250, the PE 1110 may determine whether the instruction received in step S210 is a path partial sum instruction. If the PE 1110 receives the path partial sum command, the process proceeds to step S260. Here, the point of time when the path partial sum command is transmitted to the PE 1110 in step S250 (i.e., the cycle) and the point in time when the path partial sum command is transmitted to the PE 1110 in step S240 are different. For example, the PE 1110 may receive a path partial sum instruction in the next cycle after the load partial sum instruction in step S240. PE 1110 may further receive the path partial sum instruction in the next cycle (i.e., in step S250) after receiving the path partial sum instruction in step S240. If PE 1110 does not receive the path partial sum instruction, PE 1110 may process the operation corresponding to the instruction other than the path partial sum instruction.

S260 단계에서, PE(1110)는 패스 부분 합 명령과 부분 합을 X축 방향으로 인접 PE로 전송할 수 있다. 전술한대로, 패스 부분 합 명령과 함께 전송되는 부분 합은 PE(1110)를 위한 것이 아닌 X축 방향을 따라 위치하는 다른 PE들 중 어느 하나를 위한 것이다. S240 단계에서 제어 레지스터(1111)는 로드 부분 합 명령과 부분 합을 인접 PE로 전송할 수 있으나 S260 단계에서 제어 레지스터(1111)는 패스 부분 합 명령과 부분 합을 인접 PE로 전송할 수 있다.In step S260, the PE 1110 may transmit the path partial sum instruction and the partial sum to the adjacent PE in the X axis direction. As described above, the partial sum transmitted with the path partial sum instruction is for any one of the other PEs located along the X-axis direction, not for the PE 1110. [ In step S240, the control register 1111 may transmit the load partial sum instruction and the partial sum to the adjacent PE. However, in step S260, the control register 1111 may transmit the partial sum instruction and the partial sum to the adjacent PE.

도 8은 X축과 반대 방향으로 로드 부분 합 명령 및 패스 부분 합 명령을 수신하는 제 1 내지 제 3 PE들을 예시적으로 도시한다. 도 8은 도 1 내지 도 4 및 도 7을 참조하여 설명될 것이다. 도 8에서, 제 1 내지 제 3 PE들은 X축을 따라 배치될 수 있다. X축과 반대 방향을 따라, 제 1 PE는 제 2 PE의 옆에 위치할 수 있고 제 2 PE는 제 3 PE의 옆에 위치할 수 있다. X축 방향을 따라, 제 2 PE는 제 1 PE의 옆에 위치할 수 있고 제 3 PE는 제 2 PE의 옆에 위치할 수 있다. 제 1 및 제 2 PE들은 서로 인접할 수 있고 제 2 및 제 3 PE들은 서로 인접할 수 있다. 제 1 내지 제 3 PE들 각각은 도 4의 PE(1110)와 동일하게 구현될 수 있다.FIG. 8 exemplarily shows first through third PEs receiving a load partial sum instruction and a path partial sum instruction in a direction opposite to the X axis. Fig. 8 will be described with reference to Figs. 1 to 4 and Fig. In Fig. 8, the first to third PEs may be arranged along the X-axis. Along the opposite direction to the X axis, the first PE can be positioned next to the second PE and the second PE can be positioned next to the third PE. Along the X-axis direction, the second PE can be positioned next to the first PE and the third PE can be positioned next to the second PE. The first and second PEs may be adjacent to each other and the second and third PEs may be adjacent to each other. Each of the first to third PEs may be implemented in the same manner as the PE 1110 in FIG.

제 1 사이클에서, 제 1 PE는 X축 방향으로 전송되는 제 1 로드 부분 합 명령(LC1)과 제 1 부분 합(PS1)을 수신할 수 있다. 제 1 PE는 제 1 부분 합(PS1)을 자신의 누적 레지스터(1116)에 저장할 수 있다(S230 단계 참조).In the first cycle, the first PE may receive a first load partial sum instruction LC1 and a first partial sum PS1 transmitted in the X-axis direction. The first PE may store the first partial sum PS1 in its accumulation register 1116 (see step S230).

제 2 사이클에서, 제 1 PE는 X축 방향으로 전송되는 제 2 패스 부분 합 명령(PC2)과 제 2 부분 합(PS2)을 수신할 수 있다. 제 2 부분 합(PS2)은 제 1 PE가 아닌 제 2 PE를 위한 것이다. 제 1 PE는 제 2 부분 합(PS2)이 제 2 PE로 전송되도록, 제 2 부분 합(PS2)을 임시로 저장할 수 있다.In the second cycle, the first PE may receive a second pass partial sum instruction PC2 and a second partial sum PS2 that are transmitted in the X-axis direction. The second partial sum PS2 is for the second PE rather than the first PE. The first PE may temporarily store the second partial sum PS2 such that the second partial sum PS2 is transmitted to the second PE.

제 3 사이클에서, 제 1 PE는 제 2 사이클에서 수신된 제 2 패스 부분 합 명령(PC2) 대신에 X축 방향으로 제 2 로드 부분 합 명령(LC2)과 제 2 부분 합(PS2)을 제 2 PE로 전송할 수 있다(S240 단계 참조). 제 2 PE는 수신된 제 2 부분 합(PS2)을 저장할 수 있다(S230 단계 참조). 또한, 제 1 PE는 X축 방향으로 제 3 패스 부분 합 명령(PC3)과 제 3 부분 합(PS3)을 수신할 수 있다. 제 3 부분 합(PS3)은 제 1 및 제 2 PE들이 아닌 제 3 PE를 위한 것이다. 제 1 PE는 제 3 부분 합(PS3)이 제 3 PE로 전송되도록, 제 3 부분 합(PS3)을 임시로 저장할 수 있다(S260 단계 참조).In the third cycle, the first PE sends a second load partial sum instruction LC2 and a second partial sum PS2 in the X-axis direction instead of the second pass partial sum instruction PC2 received in the second cycle to the second PE (refer to step S240). The second PE may store the received second partial sum PS2 (see step S230). In addition, the first PE can receive the third-path partial sum instruction PC3 and the third partial sum PS3 in the X-axis direction. The third subset PS3 is for the third PE, not the first and second PEs. The first PE may temporarily store the third partial sum PS3 so that the third partial sum PS3 is transmitted to the third PE (see step S260).

제 4 사이클에서, 제 2 PE는 X축 방향으로 제 1 PE로부터 전송되는 제 3 패스 부분 합 명령(PC3)과 제 3 부분 합(PS3)을 수신할 수 있다. 유사하게, 제 2 PE는 제 3 부분 합(PS3)이 제 3 PE로 전송되도록, 제 3 부분 합(PS3)을 임시로 저장할 수 있다.In the fourth cycle, the second PE may receive the third pass partial sum command PC3 and the third partial sum PS3 transmitted from the first PE in the X axis direction. Similarly, the second PE may temporarily store the third partial sum PS3 so that the third partial sum PS3 is transmitted to the third PE.

제 5 사이클에서, 제 2 PE는 제 4 사이클에서 수신된 제 3 패스 부분 합 명령(PC3) 대신에 X축 방향으로 제 3 로드 부분 합 명령(LC3)과 제 3 부분 합(PS3)을 제 3 PE로 전송할 수 있다(S240 단계 참조). 제 3 PE는 수신된 제 3 부분 합(PS3)을 저장할 수 있다(S230 단계 참조). In the fifth cycle, the second PE transfers the third load partial sum instruction LC3 and the third partial sum PS3 in the X-axis direction instead of the third-path partial sum instruction PC3 received in the fourth cycle to the third PE (refer to step S240). The third PE may store the received third partial sum PS3 (see step S230).

도 8을 참조하면, 매 사이클마다 제 1 내지 제 3 PE들로 로드 부분 합 명령들이 전송되지 않는다. 제 1 PE는 제 1 사이클에서 제 1 로드 부분 합 명령(LC1)을 수신할 수 있고, 제 2 PE는 제 3 사이클에서 제 2 로드 부분 합 명령(LC2)을 수신할 수 있고, 그리고 제 3 PE는 제 5 사이클에서 제 3 로드 부분 합 명령(LC3)을 수신할 수 있다. 정리하면, 피처 맵 메모리(1200)에 저장된 부분 합은 PE들(1110)이 배치되는 컬럼들의 개수의 두 배만큼의 사이클들 동안에 PE 어레이(1100)로 전송될 수 있다.Referring to FIG. 8, load partial sum instructions are not transmitted to the first to third PEs every cycle. The first PE may receive the first load partial sum instruction LC1 in the first cycle, the second PE may receive the second load partial sum instruction LC2 in the third cycle, and the third PE Lt; / RTI > may receive a third load partial sum instruction LC3 in the fifth cycle. In summary, a partial sum stored in the feature map memory 1200 may be transferred to the PE array 1100 for cycles twice as many as the number of columns in which the PEs 1110 are located.

도 9는 본 발명의 다른 실시 예에 따른 도 2의 PE 어레이를 예시적으로 보여주는 블록도이다. 도 9는 도 1 내지 도 8을 참조하여 설명될 것이다. PE 어레이(2100)는 PE들(2110), 피처 맵 입출력 유닛들(2120), 커널 로드 유닛들(2130), 액티베이션 유닛들(2140), 및 멀티플렉싱 유닛들(2150)을 포함할 수 있다. PE들(2110), 피처 맵 입출력 유닛들(2120), 및 커널 로드 유닛들(2130)은 도 3의 PE들(1110), 피처 맵 입출력 유닛들(1120), 커널 로드 유닛들(1130)과 실질적으로 동일하게 구현될 수 있다.Figure 9 is a block diagram illustrating an exemplary PE array of Figure 2 in accordance with another embodiment of the present invention. Figure 9 will be described with reference to Figures 1-8. PE array 2100 may include PEs 2110, feature map input / output units 2120, kernel load units 2130, activation units 2140, and multiplexing units 2150. The PEs 2110, feature map input / output units 2120, and kernel load units 2130 are connected to the PEs 1110, feature map input / output units 1120, kernel load units 1130, But may be practically the same.

액티베이션 유닛들(2140)은 피처 맵 입출력 유닛들(2120)과 PE들(2110) 사이에 Y축을 따라 배치될 수 있다. 액티베이션 유닛들(2140)은 도 5의 순서도에 따른 단계들에 기초하여, PE들(2110)로부터 연산 결과를 수신할 수 있다. 액티베이션 유닛들(2140)은 PE들(2110) 각각이 이용하는 ReLU 함수 또는 Leaky ReLU 함수뿐만 아니라 Sigmoid 함수 또는 tanh(Hyperbolic Tangent) 함수를 이용하여 액티베이션 연산을 수행할 수 있다. 액티베이션 유닛들(2140)은 액티베이션 연산의 결과를 멀티플렉싱 유닛들(2150)로 전송할 수 있다.Activation units 2140 may be disposed along the Y axis between feature map input / output units 2120 and PEs 2110. Activation units 2140 may receive computation results from PEs 2110 based on the steps according to the flowchart of FIG. The activation units 2140 may perform an activation operation using a SigLoid function or a tanh (Hyperbolic Tangent) function as well as an ReLU function or Leaky ReLU function used by each of the PEs 2110. [ Activation units 2140 may send the result of the activation operation to multiplexing units 2150.

멀티플렉싱 유닛들(2150)은 액티베이션 유닛들(2140)과 피처 맵 입출력 유닛들(2120) 사이에 Y축을 따라 배치될 수 있다. 멀티플렉싱 유닛들(2150)은 CNN(100)에 의해 수행되는 연산에 따른 도 2의 제어기(1400)의 제어에 기초하여 PE들(2110)로부터의 연산 결과 또는 액티베이션 유닛들(2140)의 연산 결과 중 어느 하나를 선택할 수 있다. 선택된 연산 결과는 피처 맵 입출력 유닛들(2120)로 전송될 수 있다.Multiplexing units 2150 may be disposed along the Y axis between activation units 2140 and feature map input / output units 2120. Multiplexing units 2150 are operable to receive a result of an operation from PEs 2110 or an operation result of activation units 2140 based on control of controller 1400 of Figure 2 in accordance with an operation performed by CNN 100. [ You can choose either one. The selected operation result may be sent to feature map input / output units 2120.

위에서 설명한 내용은 본 발명을 실시하기 위한 구체적인 예들이다. 본 발명에는 위에서 설명한 실시 예들뿐만 아니라, 단순하게 설계 변경하거나 용이하게 변경할 수 있는 실시 예들도 포함될 것이다. 또한, 본 발명에는 상술한 실시 예들을 이용하여 앞으로 용이하게 변형하여 실시할 수 있는 기술들도 포함될 것이다.The above description is a concrete example for carrying out the present invention. The present invention includes not only the above-described embodiments, but also embodiments that can be simply modified or easily changed. In addition, the present invention includes techniques that can be easily modified by using the above-described embodiments.

1000: 신경망 가속기;
1100: PE 어레이;
1200: 피처 맵 메모리;
1300: 커널 메모리;
1400: 제어기;1000: neural network accelerator;
1100: PE array;
1200: Feature map memory;
1300: kernel memory;
1400: controller;

Claims

1. A neural network accelerator for performing an operation of a neural network including layers, the neural network accelerator comprising:
A kernel memory for storing kernel data related to the filter;
A feature map memory for storing feature map data which are outputs of the layers; And
A PE array comprising processing elements (PEs) disposed along a first direction and a second direction,
Wherein each of the PEs performs an operation using the feature map data transferred in a first direction from the feature map memory and the kernel data transferred in a second direction from the kernel memory, And transfers the computation result to the feature map memory in a third direction.

The method according to claim 1,
Wherein the operation includes a multiplication operation, an addition operation, an activation operation, a normalization operation, and a pooling operation.

3. The method of claim 2,
The PE array includes:
Kernel load units for transmitting the kernel data to the PEs in the second direction; And
Further comprising feature map input / output units for transmitting the feature map data to the PEs in the first direction, receiving the result of the operation being transmitted in the third direction, and transmitting the result of the operation to the feature map memory A neural network accelerator.

The method of claim 3,
Each of the PEs comprises:
A control register for storing an instruction to be transmitted in the first direction;
A kernel register for storing the kernel data;
A feature map register for storing the feature map data;
A multiplier for performing a multiplication operation on data stored in the kernel register and the feature map register;
An adder for performing an addition operation on a multiplication result of the multiplier and a previous operation result;
An accumulation register for accumulating the previous operation result or the addition result of the adder; And
And an output register for storing the result of the addition or an operation result transmitted from the other PE in the third direction.

5. The method of claim 4,
Wherein each of the PEs performs the activation operation using a ReLU (Rectified Linear Unit) function or a Leaky ReLU function.

5. The method of claim 4,
Wherein the adder compares the operation result stored in the output register with the cumulative result stored in the cumulative register and updates the operation result stored in the output register based on the comparison result.

5. The method of claim 4,
Wherein the feature map input / output unit further receives, based on the output instruction, a new feature map data from the PEs or a partial sum for generating the new feature map data and sends the new feature map data or the partial sum to the feature map A neural network accelerator that transfers further into memory.

8. The method of claim 7,
Wherein the output register comprises: a valid flag bit (Valid Flag Bit) indicating whether or not the result of the operation is valid; and a valid flag bit indicating whether the PE is valid from the feature map memory And further stores a Last Flag Bit indicating whether or not to be placed in a farther column.

9. The method of claim 8,
The PEs include a first PE and a second PE positioned next to the first PE in the first direction, and
Wherein the first PE receives the operation result of the second PE, the valid flag bit, and the last flag bit in the third direction based on the valid flag bit and the last flag bit of the second PE.

10. The method of claim 9,
Wherein the first PE repeatedly receives the operation result of the second PE, the valid flag bit, and the last flag bit in the third direction until the last flag bit of the second PE is activated.

8. The method of claim 7,
Wherein the feature map input / output unit further receives the partial sum from the feature map memory in the first direction based on a Load Partial Sum instruction and a Pass Partial Sum instruction, And further transmits the neural network accelerator to the PEs.

12. The method of claim 11,
Wherein the cumulative register of each of the PEs stores the partial sum transmitted in the first direction in response to the load partial sum instruction being sent to the control register.

13. The method of claim 12,
The PEs include a first PE and a second PE positioned next to the first PE in the first direction, and
Wherein the first PE sends the load partial sum instruction to the second PE instead of the received path partial sum instruction upon receiving the load partial sum instruction and then receiving the path partial sum instruction.

14. The method of claim 13,
Wherein the first PE sends a partial sum to be sent to the second PE together with the received path partial sum instruction in the first direction.

1. A neural network accelerator for performing an operation of a neural network including layers, the neural network accelerator comprising:
A kernel memory for storing kernel data related to the filter;
A feature map memory for storing feature map data which are outputs of the layers; And
Processing elements (PEs) disposed along a first direction and a second direction, and activation arrays disposed between the PEs and the feature map memory in the second direction. However,
Each of the PEs performs a first operation using the feature map data transferred in a first direction from the feature map memory and the kernel data transferred in a second direction from the kernel memory, Transmits a first operation result to the activation units in a third direction opposite to the first operation result, and
Wherein the activation units perform a second operation on the first operation result and transmit a second operation result to the feature map memory.

16. The method of claim 15,
Wherein the first operation comprises a multiplication operation and an addition operation.

17. The method of claim 16,
The second operation may include a neural network including an activation operation using a ReLU (Rectified Linear Unit) function, a Leaky ReLU function, a Sigmoid function, or a tanh (Hyperbolic Tangent) function, a normalization operation, and a pooling operation Accelerator.

18. The method of claim 17,
The PE array includes:
Kernel load units for transmitting the kernel data to the PEs in the second direction;
The processor sends the feature map data to the PEs in the first direction, receives the first operation result or the second operation result transmitted in the third direction, and transmits the received operation result to the feature map memory To < RTI ID = 0.0 > a < / RTI > And
And multiplexing units for selecting any one of the first calculation result and the second calculation result.

19. The method of claim 18,
When each of the PEs performs the activation operation using the ReLU function or the Leaky ReLU function, the multiplexing units select the first operation result, and the received operation result is input to the neural network accelerator .

19. The method of claim 18,
Wherein if the PEs do not perform the activation operation, the multiplexing units select the second operation result and the received operation result is the second operation result.