KR102667134B1

KR102667134B1 - Hardware accelerator for neural network including a single port memory and operation method thereof

Info

Publication number: KR102667134B1
Application number: KR1020180156562A
Authority: KR
Inventors: 이동현; 이혁재
Original assignee: 서울대학교산학협력단
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2024-05-20
Also published as: KR20200069477A

Abstract

본 발명은 CNN 하드웨어 가속기 내 컨볼루션 레이어의 온칩 메모리를 더블 버퍼링 없는 싱글포트 메모리로 구성하여 종래에 비교하여 가속기 성능 감소 없이 온칩 메모리의 면적을 크게 감소시킬 수 있다.The present invention configures the on-chip memory of the convolution layer in the CNN hardware accelerator as a single-port memory without double buffering, and can greatly reduce the area of the on-chip memory without reducing accelerator performance compared to the prior art.

Description

Neural network hardware accelerator including single port memory and method of operation thereof {HARDWARE ACCELERATOR FOR NEURAL NETWORK INCLUDING A SINGLE PORT MEMORY AND OPERATION METHOD THEREOF}

본 발명은 종래의 신경망 하드웨어 가속기들이 사용하는 연산 순서인 래스터 스캔 순서를 변경하여 메모리의 읽기 및 쓰기 접근 타이밍을 분리한다. The present invention separates memory read and write access timings by changing the raster scan order, which is the operation order used by conventional neural network hardware accelerators.

본 발명은 종래의 CNN(Convolutional Neural Network) 가속기가 사용하는 듀얼포트 메모리 대신 싱글포트 메모리로 CNN 가속기를 구성하여 전체적인 가속기의 칩 면적 중 온칩 메모리가 차지하는 면적을 감소시킨다.The present invention configures a CNN accelerator with a single-port memory instead of the dual-port memory used by a conventional CNN (Convolutional Neural Network) accelerator, thereby reducing the area occupied by the on-chip memory among the overall chip area of the accelerator.

CNN 하드웨어 가속기는 CNN의 많은 연산량을 가속 처리하기 위해 사용된다. CNN 하드웨어 가속기는 일반적으로 CNN을 동작하기 위한 GPU (Graphic processing unit)와 비교하여 작은 하드웨어 리소스 및 낮은 전력으로 동작하여 CNN을 효율적으로 가속할 수 있다. CNN hardware accelerator is used to accelerate the large amount of calculations of CNN. CNN hardware accelerators can efficiently accelerate CNNs by operating with small hardware resources and low power compared to GPUs (Graphic processing units) for generally operating CNNs.

CNN 하드웨어 가속기는 컨볼루션 연산에 사용할 입력 데이터나 컨볼루션 연산 결과를 임시적으로 저장할 온칩 메모리를 필요로 한다.CNN hardware accelerators require on-chip memory to temporarily store input data to be used in convolution operations or convolution operation results.

온칩 메모리는 컨볼루션 연산기에 가까이 위치하여 빠른 동작 속도로 컨볼루션 연산이 수행될 수 있도록 하지만 하드웨어 내 온칩 메모리의 크기는 외부 메모리와 비교하여 매우 작다. The on-chip memory is located close to the convolution operator, allowing convolution operations to be performed at high operating speeds, but the size of the on-chip memory within the hardware is very small compared to external memory.

일반적으로 CNN 하드웨어 가속기의 온칩 메모리에 많은 양의 데이터를 저장할수록 외부 메모리 접근이 감소하여 하드웨어 가속기의 동작 속도가 증가한다. In general, as a larger amount of data is stored in the on-chip memory of a CNN hardware accelerator, external memory access is reduced, thereby increasing the operation speed of the hardware accelerator.

종래의 CNN 하드웨어 가속기는 영상이 입력되는 순서인 래스터 스캔 순서로 컨볼루션 연산을 수행하기 때문에 매 클록 사이클에서 온칩 메모리로부터 데이터를 읽는 동작과 이전 컨볼루션 레이어로부터 전달받은 데이터를 저장하기 위한 쓰기 동작이 동시에 발생하여 듀얼포트 메모리를 사용한다. Because the conventional CNN hardware accelerator performs convolution operations in the raster scan order, which is the order in which images are input, an operation to read data from the on-chip memory and a write operation to store data received from the previous convolution layer are required in each clock cycle. They occur simultaneously and use dual port memory.

싱글포트 메모리를 사용하는 경우에도 더블 버퍼링이 필수적이므로 듀얼포트 메모리와 동일한 면적을 차지하는 문제가 있다.Even when using single-port memory, double buffering is essential, so there is a problem that it takes up the same area as dual-port memory.

도 1은 종래의 하드웨어 가속기를 나타내는 블록도이다.1 is a block diagram showing a conventional hardware accelerator.

도 1에서 Li는 i 번째 컨볼루션 레이어에 입력되는 입력 특징 맵의 폭을 나타내고, ifmap 및 ofmap은 각각 입력 특징 맵과 출력 특징 맵을 나타낸다.In Figure 1, Li represents the width of the input feature map input to the ith convolutional layer, and ifmap and ofmap represent the input feature map and output feature map, respectively.

Ni는 Li에서의 ifmap의 채널 갯수를 나타내고 Mi는 필터의 갯수를 나타내고, Hi와 Wi는 컨볼루션 필터의 높이와 폭을 나타낸다.Ni represents the number of channels in the ifmap in Li, Mi represents the number of filters, and Hi and Wi represent the height and width of the convolution filter.

ifmap SRAM은 일시적으로 입력된 ifmap을 저장하고 이를 ifmap 레지스터(REG)로 전달한다.ifmap SRAM temporarily stores the input ifmap and transfers it to the ifmap register (REG).

ifmap SRAM은 Hi-1 개의 뱅크를 포함하고 각각의 뱅크는 특징 맵에 포함된 Li x Ni개의 워드를 저장한다. ifmap SRAM includes Hi-1 banks, and each bank stores Li x Ni words included in the feature map.

ifmap 레지스터(REG)는 컨볼루션 동작에 사용되는 ifmap을 저장하기 위한 레지스터로 구성된 메모리에 Hi x Wi x Ni개의 특징을 저장한다.The ifmap register (REG) stores Hi x Wi x Ni features in a memory composed of registers for storing ifmaps used in convolution operations.

필터 레지스터(REG)는 컨볼루션 동작을 위해 Hi x Wi x Ni x Mi개의 필터 계수를 저장한다.The filter register (REG) stores Hi x Wi x Ni x Mi filter coefficients for convolution operation.

ifmap 레지스터와 필터 레지스터에 저장된 특징 및 계수들은 MAC(Multiplication and Accumulation) 유닛에 제공되어 컨볼루션 동작을 수행하고 그 결과 Mi-채널의 특징 맵을 생성한다.The features and coefficients stored in the ifmap register and the filter register are provided to the MAC (Multiplication and Accumulation) unit to perform a convolution operation and generate a feature map of the Mi-channel as a result.

ofmap 레지스터는 다음 컨볼루션 레이어의 입력으로 제공되는 출력 특징 맵(ofmap)을 저장한다.The ofmap register stores the output feature map (ofmap), which is provided as input to the next convolutional layer.

도 2는 도 1의 컨볼루션 레이어에서 수행되는 컨볼루션 연산 방법을 설명하는 알고리즘이다.FIG. 2 is an algorithm explaining a convolution operation method performed in the convolution layer of FIG. 1.

도 2의 알고리즘에서 Ki는 Li의 ifmap의 높이를 표시한다.In the algorithm of Figure 2, Ki represents the height of Li's ifmap.

종래의 ifmap SRAM 구조에서 ifmap 스트림은 ifmap SRAM에 래스터 스캔 순서로 저장되고 같은 순서로 출력된다.In the conventional ifmap SRAM structure, ifmap streams are stored in raster scan order in ifmap SRAM and output in the same order.

이러한 ifmap SRAM에서 읽기 동작은 Hi-2 SRAM 뱅크에서 수행되고, 쓰기 및 읽기 동작은 하나의 SRAM 뱅크에서 수행된다.In this ifmap SRAM, read operations are performed in Hi-2 SRAM banks, and write and read operations are performed in one SRAM bank.

각 뱅크는 쓰기 및 읽기 동작의 쓰루풋 조건을 만족시키기 위해 듀얼포트 SRAM을 사용한다.Each bank uses dual-port SRAM to satisfy the throughput requirements of write and read operations.

그러나 Hi-1 개의 쓰기 포트 중에서 한 번에 하나의 쓰기 포트만 접근할 수 있으며 Hi-2 개의 쓰기 포트는 아이들 상태에 있어야 하므로 하드웨어 자원의 낭비가 크게 발생한다.However, among Hi-1 write ports, only one write port can be accessed at a time, and Hi-2 write ports must be in an idle state, resulting in significant waste of hardware resources.

본 발명은 온칩 메모리의 면적이 감소되도록 싱글포트 메모리를 포함하는 CNN 하드웨어 가속기와 그 동작 방법을 제공한다.The present invention provides a CNN hardware accelerator including a single port memory and a method of operating the same so that the area of the on-chip memory is reduced.

본 발명은 실시간 처리 속도를 유지하면서 하드웨어 자원 낭비를 방지하기 위하여 메모리 포트 낭비를 방지할 수 있는 메모리 구조 및 접근 스케줄링 기술을 제공한다.The present invention provides a memory structure and access scheduling technology that can prevent memory port waste in order to prevent hardware resource waste while maintaining real-time processing speed.

본 발명의 일 실시예에 의한 하드웨어 가속기는 싱글포트 메모리를 포함한다.A hardware accelerator according to an embodiment of the present invention includes a single port memory.

도 1은 종래 기술에 의한 하드웨어 가속기를 나타내는 블록도.
도 2는 종래 기술에 의한 컨볼루션 연산을 나타낸 알고리즘.
도 3은 본 발명의 일 실시예에 의한 컨볼루션 연산을 설명하는 설명도.
도 4는 본 발명의 일 실시예에 의한 컨볼루션 연산을 나타낸 알고리즘.
도 5는 본 발명의 일 실시예에 의한 ifmap 레지스터를 나타낸 블록도.
도 6은 본 발명의 일 실시예에 의한 하드웨어 가속기를 나타내는 블록도.
도 7 및 8은 본 발명의 일 실시예에 의한 컨볼루션 연산을 설명하는 설명도.
도 9는 본 발명의 효과를 설명하는 그래프.1 is a block diagram showing a hardware accelerator according to the prior art.
Figure 2 is an algorithm showing a convolution operation according to the prior art.
3 is an explanatory diagram illustrating a convolution operation according to an embodiment of the present invention.
Figure 4 is an algorithm showing a convolution operation according to an embodiment of the present invention.
Figure 5 is a block diagram showing an ifmap register according to an embodiment of the present invention.
Figure 6 is a block diagram showing a hardware accelerator according to an embodiment of the present invention.
7 and 8 are explanatory diagrams illustrating a convolution operation according to an embodiment of the present invention.
9 is a graph illustrating the effect of the present invention.

이하에서는 첨부한 도면을 참조하여 본 발명의 실시예를 개시한다. 이하의 개시는 본 발명의 설명을 위한 실시예로서 본 발명의 권리범위가 아래의 실시예로 제한되는 것은 아니다.Hereinafter, embodiments of the present invention will be disclosed with reference to the attached drawings. The following disclosure is an example for explaining the present invention, and the scope of the present invention is not limited to the examples below.

A. 부분 수직 순서(Partially-vertical order) 컨볼루션A. Partially-vertical order convolution

임시 계산 결과를 저장하기 위한 버퍼의 크기를 줄이기 위하여 본 발명에서는 CNN 하드웨어 가속기의 레이어 처리 순서를 변경한다.In order to reduce the size of the buffer for storing temporary calculation results, the present invention changes the layer processing order of the CNN hardware accelerator.

도 3은 부분적 수직 순서를 적용한 컨볼루션 레이어의 동작을 나타낸다. Figure 3 shows the operation of a convolutional layer applying partial vertical order.

도 3의 (a), (b)는 각각 입력 및 출력 순서를 예시한다. Figures 3 (a) and (b) illustrate the input and output sequences, respectively.

설명의 편의를 위해 Ki = 12, Li = 5, 필터 크기는 3x3 즉 Hi= Wi = 3인 것으로 가정하며 이미지 경계에서 입력 데이터가 없는 경우 제로 패딩을 수행하는 것으로 가정한다.For convenience of explanation, it is assumed that Ki = 12, Li = 5, the filter size is 3x3, that is, Hi = Wi = 3, and zero padding is performed when there is no input data at the image boundary.

도 3의 (a), (b)에서 박스 안의 숫자는 각각 처리 사이클을 표시하고 화살표는 처리 순서를 표시한다.In Figures 3 (a) and (b), numbers in boxes indicate processing cycles and arrows indicate processing orders, respectively.

도 3(c)는 클록 사이클 30 - 45 동안 버퍼(예를 들어 SRAM)에서의 읽기 및 쓰기 동작을 나타낸다.Figure 3(c) shows read and write operations in a buffer (e.g. SRAM) during clock cycles 30 - 45.

버퍼는 입력 데이터(ifmap)를 저장하며 이는 컨볼루션 레이어의 3x3 커널을 통해 필터링된다.The buffer stores the input data (ifmap), which is filtered through the 3x3 kernel of the convolutional layer.

사이클 30, 31에서 입력 특징 (6, 3) 및 (7, 3)은 ifmap SRAM에 저장되는데 이들은 이전 레이어에서 출력된 것이다.In cycles 30 and 31, the input features (6, 3) and (7, 3) are stored in the ifmap SRAM, which are output from the previous layer.

사이클 32에서 35까지 입력 특징 (0,4)에서 (5,4)까지는 ifmap 레지스터에 저장되고, ifmap SRAM은 아이들 상태에 있게 된다.From cycles 32 to 35, the input features (0,4) to (5,4) are stored in the ifmap register, and the ifmap SRAM is in the idle state.

동기식 SRAM의 경우 한 사이클에 한 번의 읽기 동작이 수행되므로 사이클 36에서 입력 특징(6, 0)이 ifmap SRAM에서 읽혀지고 이는 사이클 37에서 ifmap REG에 저장되어 3x3 필터에 의해 처리된다.For synchronous SRAM, one read operation is performed per cycle, so in cycle 36 the input feature (6, 0) is read from ifmap SRAM, which is stored in ifmap REG in cycle 37 and processed by a 3x3 filter.

같은 사이클에서 입력 특징 (4,4)가 이전 레이어에서 출력되고 ifmap REG에 저장된다.In the same cycle, the input feature (4,4) is output from the previous layer and stored in ifmap REG.

사이클 37에서 입력 특징 (7,0)이 ifmap SRAM에서 읽혀지고 입력 특징 (5,4)가 이전 레이어에서 읽혀져 ifmap REG에 저장된다.In cycle 37, input feature (7,0) is read from ifmap SRAM and input feature (5,4) is read from the previous layer and stored in ifmap REG.

입력 특징 (6,4), (7,4)는 사이클 38, 39에서 ifmap SRAM으로부터 읽혀져 ifmap REG에 저장된다.Input features (6,4), (7,4) are read from ifmap SRAM in cycles 38 and 39 and stored in ifmap REG.

사이클 40 내지 43에서 입력 특징 (8,0) ~ (11, 0)이 ifmap REG에 저장되고 ifmap SRAM은 아이들 상태가 된다.In cycles 40 to 43, input features (8,0) to (11,0) are stored in ifmap REG and ifmap SRAM is idled.

사이클 44, 45에 입력 특징 (6,1), (7,1)이 ifmap SRAM에서 읽혀져 ifmap REG에 저장된다.In cycles 44 and 45, the input features (6,1) and (7,1) are read from ifmap SRAM and stored in ifmap REG.

사이클 48에서 (6,0), (7,0), (8,0), (6,1), (7,1), 및 (8,1)를 사용하여 3x3 필터링 동작이 수행되어 출력 특징 (7,0)이 출력된다.In cycle 48, a 3x3 filtering operation is performed using (6,0), (7,0), (8,0), (6,1), (7,1), and (8,1) to obtain the output characteristics (7,0) is output.

(6,-1), (7, -1), (8, -1)에 대해서는 도 3(b)와 같이 제로 패딩이 수행된다.Zero padding is performed for (6,-1), (7, -1), and (8, -1) as shown in Figure 3(b).

필터링 결과는 사이클 49에서 다음 컨볼루션 레이어를 전달된다.The filtering results are passed to the next convolutional layer in cycle 49.

사이클 49에서 (8,0)에 대해서 필터링 동작이 수행되고 그 결과는 사이클 50에 다음 컨볼루션 레이어로 전달되어 그 결과가 ofmap REG에 저장된다.In cycle 49, a filtering operation is performed on (8,0), and the result is passed to the next convolution layer in cycle 50, where the result is stored in ofmap REG.

입력 특징 (6,0)은 (5,0), (6,0), 및 (7,0)을 생성하는데 사용된다.The input feature (6,0) is used to generate (5,0), (6,0), and (7,0).

도 3(b)의 처리 순서는 (5,0), (6,0)이 사이클 14, 15에서 연속적으로 처리되고 (7,0)은 연속적으로 처리되지 않는 컨볼루션 동작을 나타낸다. 대신 (7,0)은 (6,0)이 처리된 후 사이클 33에서 처리된다.The processing sequence in Figure 3(b) represents a convolution operation in which (5,0) and (6,0) are processed sequentially in cycles 14 and 15, and (7,0) is not processed sequentially. Instead, (7,0) is processed in cycle 33 after (6,0) is processed.

(7,0)을 생성하기 위한 동작에서 (6, 0), (7,0) 및 (8,0)이 필요하다.In the operation to generate (7,0), (6, 0), (7,0) and (8,0) are needed.

(5,0) 및 (6,0)을 생성하기 위해 (6,0), (7,0)이 필요하므로 이들은 ifmap REG에서 ifmap SRAM에 다시 저장된다. 유사하게 (7,0) 또한 ifmap SRAM에 다시 저장된다.Since we need (6,0) and (7,0) to generate (5,0) and (6,0), they are stored back in ifmap SRAM in ifmap REG. Similarly, (7,0) is also stored back in ifmap SRAM.

도 3(a)에서 회색 표시된 박스의 입력 특징은 ifmap SRAM에 저장되고 흰색 박스의 입력 특징은 ifmap SRAM에 저장되지 않는다.In Figure 3(a), input features in gray boxes are stored in ifmap SRAM, and input features in white boxes are not stored in ifmap SRAM.

매 사이클마다 한번의 읽기 또는 쓰기가 수행된다. One read or write is performed in each cycle.

도 3의 경우 읽기 및 쓰기 동작은 매 8 사이클마다 반복되고 수직 순서에서의 컨볼루션 동작은 8 사이클 동안 수행된다.In the case of Figure 3, read and write operations are repeated every 8 cycles and the convolution operation in vertical order is performed for 8 cycles.

다만 ifmap SRAM에 대해서는 4개의 아이들 사이클이 존재하는데 이들 4개의 아이들 사이클은 부분 수직 순서 컨볼루션 주기를 4로 줄임으로써 제거될 수 있다.However, there are 4 idle cycles for ifmap SRAM, and these 4 idle cycles can be eliminated by reducing the partial vertical order convolution cycle to 4.

부분 수직 순서 컨볼루션 주기는 pd_conv로 표시하였는데 이는 매 칼럼에 대해서 컨볼루션 동작을 수행하는 클록 사이클 개수를 의미한다.The partial vertical order convolution cycle is expressed as pd _conv , which means the number of clock cycles that perform the convolution operation for each column.

B. 제안된 알고리즘B. Proposed Algorithm

도 4는 i 번째 컨볼루션 레이어에 대한 부분 수직 순서 컨볼루션 알고리즘을 나타낸다.Figure 4 shows the partial vertical order convolution algorithm for the ith convolution layer.

i 번째 컨볼루션 레이어에 대해서 필터 크기는 Hi x Wi로 표시되는데 Hi는 필터의 높이(행 개수), Wi는 필터의 폭(열 개수)이다.For the ith convolutional layer, the filter size is expressed as Hi x Wi, where Hi is the height of the filter (number of rows) and Wi is the width of the filter (number of columns).

이때 배치 크기는 Mi 및 Ni로 표시되는데 편의상 1로 가정한다.At this time, the batch size is expressed as Mi and Ni, which is assumed to be 1 for convenience.

h_temp는 부분 수직 순서 컨볼루션에서 처리 순서의 변경을 표시한다.h_temp indicates a change in processing order in partial vertical order convolution.

h_temp는 0에서 pd_conv까지 증가하는데 2 x (max(Hi)-1)로 선택된다(i>=1). 이때 max(Hi)는 0번째 컨볼루션 레이어를 제외한 모든 컨볼루션 레이어 중에서 필터 커널의 최대값을 표시한다.h_temp increases from 0 to pd _conv , which is chosen as 2 x (max(Hi)-1) (i>=1). At this time, max(Hi) indicates the maximum value of the filter kernel among all convolutional layers except the 0th convolutional layer.

pd_conv의 값은 ifmap SRAM에서 하나의 사이클에 하나의 쓰기 또는 읽기 동작이 보장되도록 선택되어야 한다.The value of pd _conv should be chosen to ensure one write or read operation per cycle in the ifmap SRAM.

모든 내부 및 마지막 컨볼루션 레이어들의 경우, 특징 맵의 pd_conv개의 라인 중에서 Hi-1개의 라인은 ifmap SRAM에 저장된다.For all inner and last convolutional layers, Hi-1 lines among the pd _conv lines of the feature map are stored in the ifmap SRAM.

수학식 1은 본 실시예의 ifmap SRAM의 이차원 좌표 (h,w)에 대한 쓰기 및 읽기 타이밍을 나타낸다.Equation 1 represents the write and read timing for the two-dimensional coordinates (h,w) of the ifmap SRAM in this embodiment.

수학식 1에서 mod(a,b)는 a를 b로 나누었을 때의 나머지를 나타내는 모듈로 연산을 의미한다.In Equation 1, mod(a,b) refers to the modulo operation that represents the remainder when a is divided by b.

본 실시예에서는 Hi-1 개의 라인을 저장하는데 이는 개수 측면에서는 종래의 경우와 같다.In this embodiment, Hi-1 lines are stored, which is the same as the conventional case in terms of number.

그러나 싱글포트 SRAM을 사용함으로써 종래의 듀얼포트 SRAM을 사용하는 경우에 비하여 면적을 크게 줄일 수 있다.However, by using a single-port SRAM, the area can be greatly reduced compared to using a conventional dual-port SRAM.

C. MAC 유닛에 대한 입력을 저장하는 레지스터C. Register that stores input to the MAC unit

각 컨볼루션 레이어에 대해서 ifmap 특징은 레지스터에 저장되는데 ifmap REG는 컨볼루션 동작을 위해 MAC 유닛에 제공할 입력을 위해 사용된다.For each convolution layer, ifmap features are stored in a register, and ifmap REG is used for input to provide to the MAC unit for convolution operation.

전술한 바와 같이 ifmap 특징의 일부는 이전 컨볼루션 레이어에서 제공되고 일부는 ifmap SRAM에서 읽혀진다.As mentioned above, some of the ifmap features are provided by the previous convolutional layer and some are read from the ifmap SRAM.

ifmap REG에 저장되는 워드의 개수는 수학식 2와 같다.The number of words stored in ifmap REG is the same as Equation 2.

도 5는 출력 특징 (4,0)을 위한 ifmap REG의 구조를 설명하는 도면이다. 이때 출력 특징 (4,0)은 푸른색 사선이 있는 박스로 표시되어 있다.Figure 5 is a diagram explaining the structure of ifmap REG for output feature (4,0). At this time, the output feature (4,0) is indicated by a box with a blue diagonal line.

실선 화살표는 현재 열에 대한 컨볼루션 동작 순서를 나타내고, 점선 화살표는 다음 열에 대한 컨볼루션 동작 순서를 나타낸다. 이때 max(Hi), Hi, Wi는 3이다.The solid arrow indicates the convolution operation order for the current column, and the dotted arrow indicates the convolution operation order for the next column. At this time, max(Hi), Hi, and Wi are 3.

푸른색 박스는 3x3 컨볼루션 윈도우 내의 입력 특징을 표시하고, pd_conv는 4로 선택되며 배치 크기 Mi, 및 Ni는 1이다.The blue box displays the input features within the 3x3 convolution window, pd _conv is chosen as 4 and the batch size Mi, and Ni are 1.

회색 박스로 표시된 입력 특징은 ifmap SRAM에서 제공된 것이고 그 나머지는 이전 레이어에서 제공된 것이다.The input features shown in gray boxes are provided by the ifmap SRAM and the rest are provided by the previous layer.

컨볼루션 동작은 부분 수직 순서로 진행되므로 ifmap REG에 저장된 특징은 상부 왼쪽에서 하부 오른쪽 방향으로 갱신된다.Since the convolution operation is performed in partial vertical order, the features stored in ifmap REG are updated from top left to bottom right.

예를 들어 ifmap REG의 입력 특징 (0,0)이 무효화되고 ifmap SRAM으로부터 제공되는 입력 특징 (2,2)에 의해 대체된다.For example, the input feature (0,0) of the ifmap REG is invalidated and replaced by the input feature (2,2) provided by the ifmap SRAM.

출력 특징 (4,0)에 대한 컨볼루션 동작 이후에 ifmap REG의 (1,0) 위치의 특징이 무효화되고 입력 특징 (3,2)에 의해 대체되는데 이를 회색 바탕의 푸른색 체크 패턴으로 표시하였다.After the convolution operation on the output feature (4,0), the feature at the (1,0) position of the ifmap REG is invalidated and replaced by the input feature (3,2), which is indicated by a blue check pattern on a gray background. .

출력 특징 (5,0)의 컨볼루션 동작을 위해 입력 특징 (6,1)이 필요하다. 입력 특징 (6,1)은 ifmap REG의 위치 (4,2)에 저장되며 이를 회색 바탕의 사선으로 표시하였다. 이는 ifmap 스트림으로부터 제공된다.For the convolution operation of the output feature (5,0), the input feature (6,1) is required. The input feature (6,1) is stored at location (4,2) of ifmap REG and is indicated by a diagonal line on a gray background. This comes from the ifmap stream.

입력 특징 (7,1)은 위치 (5,2)에 저장된 ifmap 스트림으로부터 제공되고, 회색 사선 표시된 박스에 대응하며 출력 특징 (6,0)의 컨볼루션 동작에 사용된다.The input feature (7,1) comes from the ifmap stream stored at position (5,2), corresponds to the gray diagonal box, and is used in the convolution operation of the output feature (6,0).

한 열에 대해서 컨볼루션 동작이 완료되면 다음 열에 대해서 부분 수직 순서 컨볼루션 동작이 진행된다.When the convolution operation is completed for one column, a partial vertical order convolution operation is performed for the next column.

D. 최선 및 최후 컨볼루션 레이어를 위한 메모리 구조D. Memory structure for the best and last convolutional layers.

일반적으로 입력 이미지는 래스터 스캔 순서로 수신되므로 부분 수직 순서 컨볼루션 동작을 위해서는 처리 순서를 변경할 필요가 있다.Typically, input images are received in raster scan order, so partial vertical order convolution operation requires changing the processing order.

도 6은 본 실시예에서 CNN의 최선 레이어를 위한 ifmap SRAM 구조의 일 예를 도시한다.Figure 6 shows an example of the ifmap SRAM structure for the best layer of CNN in this embodiment.

본 실시예에서 컨볼루션 레이어는 3x3 필터를 포함하는 것으로 가정한다.In this embodiment, the convolution layer is assumed to include a 3x3 filter.

각 SRAM 블록은 H₀개의 싱글포트 SRAM 뱅크를 포함하고, 각 SRAM 블록은 입력 데이터에서 H₀ - 1 + pd_conv 개의 라인을 저장한다.Each SRAM block includes H ₀ single-port SRAM banks, and each SRAM block stores H ₀ - 1 + pd _conv lines of input data.

각 SRAM 블록에서 첫 번째 및 마지막 (H₀- 1)/2개의 뱅크에 저장되는 데이터는 컨볼루션 동작의 경계 처리를 위한 데이터이다.The data stored in the first and last (H ₀ - 1)/2 banks in each SRAM block is data for boundary processing of the convolution operation.

따라서 한 SRAM 블록의 상부로부터 (H₀- 1)/2개의 뱅크에 대한 읽기 동작이 하부로부터 같은 개수의 뱅크에 대한 읽기 동작과 병렬적으로 수행된다.Therefore, a read operation for (H ₀ - 1)/2 banks from the top of one SRAM block is performed in parallel with a read operation for the same number of banks from the bottom.

경계 처리를 위한 이들 각각의 SRAM 뱅크는 L₀ x N₀ 개의 워드에 대해 저장 공간을 필요로 한다.Each of these SRAM banks for boundary processing requires storage space for L ₀ x N ₀ words.

SRAM 뱅크 (H₀- 1)/2는 pd_conv x L₀ x Nx 워드를 저장하고 이 SRAM은 주소를 변경하여 스트리밍 순서를 래스터 스캔 순서에서 부분 수직 순서로 변환한다.SRAM bank (H ₀ - 1)/2 stores pd _conv x L ₀ x Nx words and this SRAM changes the address to convert the streaming order from raster scan order to partial vertical order.

하나의 블록에 저장된 입력 데이터에서 마지막 H₀- 1 개의 라인은 첫 번째 컨볼루션 레이어에 대한 컨볼루션 동작을 위해 또 다른 블록에도 저장된다.The last H ₀ - 1 lines from the input data stored in one block are also stored in another block for the convolution operation for the first convolution layer.

최선 컨볼루션 레이어의 ifmap SRAM은 해당 레이어에서의 입력과 출력 순서의 충돌을 방지하기 위해 다수의 버퍼를 구비하는 다수의 블록을 포함한다.The ifmap SRAM of the best convolutional layer includes multiple blocks with multiple buffers to prevent conflicts in the input and output order of that layer.

최선 컨볼루션 레이어에서 ifmap SRAM의 블록 갯수는 수학식 3과 같다.The number of blocks of ifmap SRAM in the best convolutional layer is equal to Equation 3.

수학식 3의 우변에서 분자는 쓰기 및 읽기 동작에 필요한 클록 사이클 수를 나타내고 분모는 읽기 동작에 필요한 클록 사이클 수를 나타낸다.On the right side of Equation 3, the numerator represents the number of clock cycles required for write and read operations, and the denominator represents the number of clock cycles required for read operations.

싱글포트 SRAM은 쓰기 및 읽기 포트를 공유하므로 SRAM에 입력 데이터를 쓰기 위하여 (H₀-1+pd_conv) x L₀ 개의 클록 사이클이 필요하고 SRAM에서 입력 데이터를 읽기 위하여 pd_convx L0개의 클록 사이클을 필요로 한다.Single-port SRAM shares write and read ports, so (H ₀ -1+pd _conv ) x L ₀ clock cycles are required to write input data to SRAM, and pd _conv x L0 clock cycles are required to read input data from SRAM. need.

따라서 하나의 SRAM 블록은 다음 pd_convx L₀개의 데이터를 쓰기 위하여 (H₀ - 1 + 2 x pd_conv) x L₀개의 클록 사이클을 필요로 한다.Therefore, one SRAM block requires (H ₀ - 1 + 2 x pd _conv ) x L ₀ clock cycles to write the next pd _conv x L ₀ data.

하나의 블록을 사용하는 경우, SRAM에 데이터를 쓰는데 필요한 클록 사이클 수는 데이터를 읽는데 필요한 사이클 수보다 크므로 메모리 오버플로우를 발생시킨다.When using one block, the number of clock cycles required to write data to SRAM is greater than the number of cycles required to read data, resulting in memory overflow.

따라서 읽기 쓰루풋을 쓰기 쓰루풋 이상으로 유지하기 위해서는 다수 블록을 구비하는 SRAM이 필요하다.Therefore, in order to maintain read throughput greater than write throughput, SRAM with multiple blocks is required.

최후 레이어 또한 부분 수직 처리 순서를 CNN의 최종 출력을 위한 래스터 스캔 순서로 변환하기 위하여 SRAM을 필요로 한다.The final layer also requires SRAM to convert the partial vertical processing sequence into a raster scan sequence for the CNN's final output.

래스터라이저는 수학식 3에서 H₀ 및 max(Hi)를 1로 대체하여 SRAM에서 두 개의 블록을 포함한다.The rasterizer includes two blocks from SRAM by replacing H ₀ and max(Hi) with 1 in Equation 3.

각 블록은 하나의 싱글포트 SRAM 뱅크를 포함한다. Each block contains one singleport SRAM bank.

각 래스터라이저 SRAM 뱅크는 ofmap 중 pd_conv개의 라인을 저장하고 주소를 변경하여 래스터 스캔 순서로 데이터를 출력한다.Each rasterizer SRAM bank stores pd _conv lines in ofmap, changes addresses, and outputs data in raster scan order.

SISR(Single Image Super Resolution) 및 노이즈 제거를 CNN 구조는 입력 및 출력 순서를 래스터 스캔 순서로 동일하게 하기 위하여 래스터라이저를 적용한다.SISR (Single Image Super Resolution) and noise removal CNN structure applies a rasterizer to make the input and output order the same as the raster scan order.

그러나 이미지 분류를 위한 CNN의 경우 최종 컨볼루션 레이어의 ofmap이 특징 맵의 출력 순서에 관계없이 완전 연결 레이어에 입력되므로 래스터라이저를 필요로 하지 않는다.However, in the case of CNN for image classification, a rasterizer is not required because the ofmap of the final convolution layer is input to the fully connected layer regardless of the output order of the feature map.

E. 다채널 ifmap을 저장하는 온칩 SRAME. On-chip SRAM storing multi-channel ifmap

하나의 ifmap SRAM을 다채널 ifmap에 공유한다면 ifmap SRAM의 포트 갯수는 감소하고 CNN에서의 전체 SRAM 면적도 감소한다.If one ifmap SRAM is shared with multi-channel ifmap, the number of ports in ifmap SRAM is reduced and the total SRAM area in the CNN is also reduced.

입력 데이터의 라인들 사이에서는 아이들 사이클이 존재하는데 이들 아이들 사이클을 사용하여 ifmap SRAM을 다채널 ifmap에 공유한다.There are idle cycles between lines of input data, and these idle cycles are used to share the ifmap SRAM to multi-channel ifmap.

도 7은 최초 컨볼루션 레이어에서의 출력 특징 맵에서 채널들을 분리하는 예를 나타낸다.Figure 7 shows an example of separating channels in the output feature map in the first convolutional layer.

도 7(a) 및 도 7(b)는 각각 입력 및 출력 순서를 예시하는데 Ki 및 Li는 각각 12와 5로 선택되었다. Figures 7(a) and 7(b) illustrate the input and output sequences, respectively, where Ki and Li were chosen to be 12 and 5, respectively.

배치 크기는 1, Ni는 1, Mi 는 2, Hi 및 Wi 는 3이며 제로 패딩을 가정한다.The batch size is 1, Ni is 1, Mi is 2, Hi and Wi are 3, and assumes zero padding.

도 7(a) 및 도 7(b)의 박스의 숫자는 처리 사이클을 나타내고 화살표는 처리 순서를 나타낸다.The numbers in the boxes in FIGS. 7(a) and 7(b) indicate processing cycles and the arrows indicate the processing order.

도 7(c)는 클록 사이클 0에서 59까지의 동안 버퍼에서의 쓰기 및 읽기 동작을 나타낸다.Figure 7(c) shows write and read operations in the buffer during clock cycles 0 to 59.

도 7(a)에서 ifmap은 래스터 스캔 순서로 입력되고, ifmap의 각 행의 사이에 Li에 대해서 동일한 5개의 아이들 사이클이 존재한다. In FIG. 7(a), the ifmap is input in raster scan order, and the same five idle cycles for Li exist between each row of the ifmap.

3x3 컨볼루션의 경계 경우를 고려하여 각 블록은 입력 특징에 대해서 6개의 라인을 저장한다.Considering the edge case of 3x3 convolution, each block stores 6 lines for input features.

따라서 블록 1은 (3,0)에서 (8,0)까지의 입력 특징을 저장하고, 블록 0 및 2는 데이터가 없는 경계로 인하여 입력 특징의 5개 라인을 저장한다.Therefore, block 1 stores the input features from (3,0) to (8,0), and blocks 0 and 2 store 5 lines of input features due to boundaries without data.

도 7(a)에서 (3,0)에서 (4,4)까지의 입력 특징은 블록 0 및 1 모두에 저장되고, (7,0)에서 (8,4) 까지의 입력 데이터는 블록 1 및 2 모두에 저장된다.In Figure 7(a), the input features from (3,0) to (4,4) are stored in both blocks 0 and 1, and the input data from (7,0) to (8,4) are stored in blocks 1 and 1. 2 are all saved.

두 SRAM 블록에 저장되는 입력 특징은 회색 박스로 표시하였다.The input features stored in the two SRAM blocks are marked with gray boxes.

도 7(c)에서 사이클 0에서 49까지 (0,0)에서 (4,4) 까지의 입력 데이터는 ifmap SRAM에 저장된다.In Figure 7(c), input data from (0,0) to (4,4) from cycles 0 to 49 are stored in ifmap SRAM.

이들 50개 사이클 중에서 ifmap SRAM에는 25개의 아이들 사이클이 존재하고 이는 하나의 출력 특징에 대해서 컨볼루션 처리를 위해 2개의 사이클이 가용하다는 것을 나타낸다.Among these 50 cycles, there are 25 idle cycles in ifmap SRAM, which indicates that 2 cycles are available for convolution processing for one output feature.

사이클 46에서 59까지의 동안 (0,0)에서 (2,1)까지의 입력 특징을 최초 컨볼루션 레이어의 ifmap SRAM에서 읽는다.During cycles 46 to 59, the input features from (0,0) to (2,1) are read from the ifmap SRAM of the first convolutional layer.

ifmap SRAM의 읽기 및 아이들 상태는 교차하는 방식으로 발생하여 SRAM의 입력 및 출력 쓰루풋을 맞춘다.ifmap SRAM's read and idle states occur in an alternating manner to match the SRAM's input and output throughput.

(0,0)에 위치한 출력 특징은 다음과 같이 출력된다.The output feature located at (0,0) is output as follows.

사이클 46에서 55까지 (0,0) 에서 (0,1)까지의 입력 데이터를 ifmap SRAM에서 읽고 이들을 ifmap REG에 저장한다.In cycles 46 to 55, input data from (0,0) to (0,1) are read from ifmap SRAM and stored in ifmap REG.

사이클 56에서 입력 특징 (1,1)을 ifmap SRAM에서 읽고 이를 사이클 57에서 ifmap REG에 저장한다.In cycle 56, the input feature (1,1) is read from ifmap SRAM and in cycle 57 it is stored in ifmap REG.

사이클 57에서 입력 특징 (0,0), (1,0), (0,1) 및 (1,1)이 가용하고 (0,0)에 위치한 출력 특징이 생성된다.In cycle 57, input features (0,0), (1,0), (0,1) and (1,1) are available and an output feature located at (0,0) is generated.

이 사이클에서 채널 0과 1 두 채널의 출력 특징 (0,0) 즉 (0,0)₀ 및 (0,0)₁이 존재한다.In this cycle, there are two output features (0,0), i.e. (0,0) ₀ and (0,0) ₁ , for channels 0 and 1.

출력 특징의 채널 0 및 1은 채널 1에 대해서 한 사이클의 지연을 더하여 도 7(b)와 같이 구분할 수 있다.Channels 0 and 1 of the output characteristics can be distinguished by adding one cycle of delay to channel 1, as shown in FIG. 7(b).

도 8은 두 채널의 ifmap을 위해 ifmap SRAM을 공유하는 예를 설명한다.Figure 8 explains an example of sharing ifmap SRAM for ifmap of two channels.

도 8(a) 및 8(b)는 각각 입력 및 출력 순서의 예를 도시한다.Figures 8(a) and 8(b) show examples of input and output sequences, respectively.

이때 Ki, Li, Hi, Wi, 배치 크기는 도 7의 예와 같으며 Mi와 Ni는 2로 가정한다.At this time, Ki, Li, Hi, Wi, and batch size are the same as the example in Figure 7, and Mi and Ni are assumed to be 2.

도 8(a) 및 도 8(b)에서 박스 내부의 숫자는 처리 사이클을 나타내고 화살표는 처리 순서를 나타낸다.In Figures 8(a) and 8(b), numbers inside boxes indicate processing cycles and arrows indicate processing orders.

도 8(c)는 클록 사이클 32에서 59까지의 동안 버퍼에서의 쓰기 및 읽기 동작을 도시한다.Figure 8(c) shows write and read operations in the buffer during clock cycles 32 to 59.

버퍼는 ifmap을 저장하는데 이는 컨볼루션 레이어의 3x3 커널에 의해 필터링된다.The buffer stores an ifmap, which is filtered by the 3x3 kernel of the convolutional layer.

(3,0) 위치의 특징을 출력하는 순서는 다음과 같다.The order of outputting the features of the (3,0) position is as follows.

도 8(a)에서 ifmap을 위한 채널 0 및 1이 서로 다른 클록 사이클에 입력되어 버퍼를 공유한다.In Figure 8(a), channels 0 and 1 for ifmap are input at different clock cycles and share a buffer.

사이클 32 및 33에서 특징 (2,0)₀ 및 (2,0)₁을 ifmap SRAM에서 읽고 이들을 사이클 33 및 34에 ifmap REG에 저장한다.In cycles 32 and 33, the features (2,0) ₀ and (2,0) ₁ are read from ifmap SRAM and they are stored in ifmap REG in cycles 33 and 34.

사이클 34 및 35에서 입력 특징 (3,0)₀ 및 (3,0)₁이 ifmap REG에 저장된다.In cycles 34 and 35, the input features (3,0) ₀ and (3,0) ₁ are stored in ifmap REG.

사이클 36에서 39까지의 동안 회색 박스로 표시된 입력 특징 (2,4)₀, (2,4)₁, (3,4)₀ 및 (3,4)₁이 ifmap SRAM에 저장된다.During cycles 36 to 39, the input features (2,4) ₀ , (2,4) ₁ , (3,4) ₀ , and (3,4) ₁ , indicated by gray boxes, are stored in the ifmap SRAM.

사이클 40에서 43까지의 동안 입력 특징 (2,1)₀, (2,1)₁, (3,1)₀ 및 (3,1)₁을 ifmap SRAM에서 읽고 사이클 41에서 44까지의 동안 이들을 ifmap REG에 저장한다.During cycles 40 to 43 the input features (2,1) ₀ , (2,1) ₁ , (3,1) ₀ and (3,1) ₁ are read from the ifmap SRAM and during cycles 41 to 44 they are Save it to REG.

사이클 48 및 49에서 이전 레이어에서 제공된 입력 특징 (4,1)₀ 및 (4,1)₁을 ifmap REG에 저장한다.In cycles 48 and 49, we store the input features (4,1) ₀ and (4,1) ₁ provided by the previous layer in ifmap REG.

사이클 49에서 입력 특징 (3,0) 주위의 두 채널의 모든 특징들이 가용하게 되고 컨볼루션 동작이 수행된다.In cycle 49, all features of both channels around the input feature (3,0) are made available and the convolution operation is performed.

출력 특징 맵의 순서를 ifmap과 맞추기 위해 출력 특징 (3,0)₁이 1 사이클 지연된다.To match the order of the output feature map with ifmap, the output feature (3,0) ₁ is delayed by 1 cycle.

이에 따라 사이클 50, 51에서 출력 특징 (3,0)₀ 및 (3,1)₁이 다음 컨볼루션 레이어로 전달된다.Accordingly, in cycles 50 and 51, the output features (3,0) ₀ and (3,1) ₁ are passed to the next convolution layer.

사이클 51, 53 및 55에서 출력 특징 (4,0), (5,0) 및 (6,0)이 계산된다.In cycles 51, 53 and 55 the output features (4,0), (5,0) and (6,0) are calculated.

사이클 52에서 57까지의 동안 출력 특징 (4,0)₀, (5,0)₀ 및 (6,0)₀, (4,0)₁, (5,0)₁ 및 (6,0)₁이 다음 레이어로 전달된다.During cycles 52 to 57 the output features (4,0) ₀ , (5,0) ₀ and (6,0) ₀ , (4,0) ₁ , (5,0) ₁ and (6,0) ₁ This is passed on to the next layer.

수학식 4는 다채널 ifmap을 공유하는 ifmap SRAM의 (h,w)_Ch 좌표에서의 쓰기 및 읽기 타이밍을 나타낸다. 이때 Ch는 ifmap 채널을 나타내는데 0 또는 1이 된다.Equation 4 represents the write and read timings in the (h,w) _Ch coordinates of ifmap SRAM sharing a multi-channel ifmap. At this time, Ch represents the ifmap channel and is 0 or 1.

수학식 4에서 N^s는 동일한 ifmap SRAM에 저장된 ifmap의 채널 갯수를 나타낸다. 이때 pd^s _conv 및 L^s _i는 각각 pd_conv와 Li에 N^s를 곱한 값이다. 또한 h_Ch는 N^s x h + Ch와 동일하다.In Equation 4, N ^s represents the number of channels of ifmap stored in the same ifmap SRAM. At this time, pd ^s _conv and L ^s _i are the values of pd _conv and Li multiplied by N ^s , respectively. Also, h _Ch is the same as N ^s xh + Ch.

최후 컨볼루션 레이어는 출력 특징 맵을 지연을 통해 채널을 구분하지 않고 래스터라이저에 전달한다.The final convolutional layer delivers the output feature map to the rasterizer without distinguishing channels through delay.

따라서 ifmap SRAM이 두 채널 ifmap 의해 공유되는 경우 래스터라이저의 아이들 사이클이 발생하고 아이들 사이클 개수는 래스터라이저의 쓰기 사이클 숫자와 동일하게 된다. 이를 통해 래스터라이저의 SRAM 뱅크 개수를 하나로 줄일 수 있다.Therefore, if ifmap SRAM is shared by two channels ifmap, an idle cycle of the rasterizer occurs and the number of idle cycles becomes the same as the number of write cycles of the rasterizer. Through this, the number of SRAM banks in the rasterizer can be reduced to one.

도 9는 본 발명의 효과를 나타내는 그래프이다.Figure 9 is a graph showing the effect of the present invention.

도 9(a) 및 도 9(b)는 이미지 향상 및 이미지 분류를 위해 사용되는 CNN에서 요구하는 ifmap SRAM 및 ifmap REG의 면적을 나타낸다.Figures 9(a) and 9(b) show the areas of ifmap SRAM and ifmap REG required by CNN used for image enhancement and image classification.

도 9(a)에 도시된 바와 같이 SRAM 면적은 본 실시예의 경우 종래의 기술에 비하여 평균적으로 1.84배만큼 감소하였고 ifmap REG 면적은 본 실시예의 경우 2.3배 증가하였다.As shown in FIG. 9(a), the SRAM area was reduced by an average of 1.84 times in this embodiment compared to the conventional technology, and the ifmap REG area was increased by 2.3 times in this embodiment.

그러나 ifmap REG의 크기가 ifmap SRAM에 비하여 117.95배 정도로 매우 작으므로 전체적으로는 온칩 메모리의 면적은 크게 감소한다.However, since the size of ifmap REG is very small, about 117.95 times that of ifmap SRAM, the overall on-chip memory area is greatly reduced.

도 9(b)의 경우에도 마찬가지로 본 실시예에서 종래에 비하여 메모리 면적이 감소함을 알 수 있다.In the case of FIG. 9(b), it can be seen that the memory area is reduced in this embodiment compared to the prior art.

1, 100: 하드웨어 가속기1, 100: hardware accelerator

Claims

Pertaining to an accelerator that performs a convolutional filtering operation on an image map with multiple rows and multiple columns,
a register that stores data while scanning a portion of the image map including a predetermined number of consecutive rows among the plurality of rows in column order; and
a memory that stores data corresponding to some rows of the portion when the portion is scanned in column order;
Including,
The convolution filtering operation is performed using data stored in the register, and some data that is not stored in the register among the data required for the convolution filtering operation is read from the memory and stored in the register, and then the convolution filtering operation is performed. accelerator.

The accelerator according to claim 1, wherein the minimum value of the certain number corresponds to twice the value obtained by subtracting 1 from the maximum value among the number of rows of one or more filters for the convolutional filtering operation.

The method of claim 1, wherein scanning is performed in a direction in which the row number increases within one column of the portion, and when all data included in one column is scanned, a scanning operation is performed on the next column of the one column to perform the convolution filtering operation. Accelerator that performs.

The accelerator according to claim 1, wherein when all data included in the portion of the image map is scanned, the convolution filtering operation is performed while a scan operation is performed on the remaining portion of the image map in column order.

The accelerator according to claim 1, wherein the memory is a single-port SRAM.

The accelerator of claim 1, wherein the subset of rows includes the last row of the portion.