KR20210074681A

KR20210074681A - Low Complexity Deep Learning Acceleration Hardware Data Processing Device

Info

Publication number: KR20210074681A
Application number: KR1020190165642A
Authority: KR
Inventors: 이상설; 장성준; 박종희
Original assignee: 한국전자기술연구원
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2021-06-22
Also published as: WO2021117942A1

Abstract

Provided is a deep learning acceleration device hardware that designs data requests in a predictable structure while reducing the number of external memory accesses, and maximizes data reusability to reduce a peak bandwidth. According to an embodiment of the present invention, the deep learning acceleration device comprises: a deep learning acceleration device that computes input data; an encoder that compresses output data of the deep learning acceleration device; and a WDMA that records compressed output data from an encoder to an external memory, wherein the encoder, based on a context of the output data, compresses the output data by selectively applying different compression methods. Therefore, the present invention is capable of reducing the number of external large-capacity memory accesses for data processing by the same channel/weight each time in the deep learning acceleration device.

Description

Low Complexity Deep Learning Acceleration Hardware Data Processing Device

본 발명은 인공지능을 이용한 영상 처리 하드웨어 기술에 관한 것으로, 더욱 상세하게는 입력 영상에 대해 딥러닝 처리하는 하드웨어 가속기의 구조 및 이의 설계 방법에 관한 것이다.The present invention relates to image processing hardware technology using artificial intelligence, and more particularly, to a structure of a hardware accelerator for deep learning processing on an input image, and a design method thereof.

딥러닝 처리를 위한 하드웨어 가속기에서 입력 영상 데이터(Feature map)와 입력 컨볼루션 파라미터(Weight)를 재사용하기 위한 기술들이 많이 연구&개발되고 있다. 외부 메모리로부터 입력된 데이터를 최대한 많이 재사용함으로써, 외부 메모리 접근을 줄여 주기 위함이다.Many technologies are being researched and developed for reusing input image data (feature map) and input convolution parameters (weight) in hardware accelerators for deep learning processing. This is to reduce external memory access by reusing data input from external memory as much as possible.

한편, 입력 영상에서 커널로 연산할 데이터를 채널 별로 생성하여 내부 또는 외부 메모리에 저장을 하고, 해당 데이터를 불러들여 연산을 수행하고 있는데, 입력 영상 데이터의 크기가 클 경우에는 메모리에 쓰기/읽기에 많은 전력을 소모하게 된다.On the other hand, data to be computed from the input image to the kernel is created for each channel, stored in internal or external memory, and the corresponding data is loaded and the operation is performed. consumes a lot of power.

또한, 하드웨어 구현 시에 외부 대용량/저속 메모리에 해당 데이터를 저장하는 경우, 외부 메모리으로의 매번 데이터 패칭이 필요하게 되어 고속 처리를 할 수 없을 뿐만 아니라, 데이터 입출력시 Bandwidth의 증가가 불가피하다.In addition, when the corresponding data is stored in an external large-capacity/low-speed memory in hardware implementation, data fetching to the external memory is required every time, so high-speed processing cannot be performed, and an increase in bandwidth is unavoidable during data input/output.

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은, 외부 메모리 접근 횟수를 줄임과 동시에 데이터 요청을 예측 가능한 구조로 설계하고, 데이터 재사용성의 최대화 하며, Peak Bandwidth의 감소할 수 있는 딥러닝 가속기 하드웨어를 제공함에 있다.The present invention has been devised to solve the above problems, and an object of the present invention is to reduce the number of external memory accesses and at the same time design a data request in a predictable structure, maximize data reusability, and reduce peak bandwidth. It is to provide deep learning accelerator hardware that can

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른, 딥러닝 가속기는 입력 데이터를 연산하는 딥러닝 가속기; 딥러닝 가속기의 출력 데이터를 압축하는 인코더; 및 인코더에서 압축된 출력 데이터를 외부 메모리에 기록하는 WDMA;를 포함하고, 인코더는, 출력 데이터의 컨텍스트를 기초로, 각기 다른 압축 방식을 선택적으로 적용하여 출력 데이터를 압축한다. According to an embodiment of the present invention for achieving the above object, a deep learning accelerator includes a deep learning accelerator for calculating input data; an encoder that compresses the output data of the deep learning accelerator; and WDMA for writing the output data compressed by the encoder to an external memory, wherein the encoder compresses the output data by selectively applying different compression schemes based on the context of the output data.

인코더는, 출력 데이터에 대해, 무손실 압축을 수행할 수 있다.The encoder may perform lossless compression on the output data.

인코더는, 출력 데이터들이 동일한 경우, 동일한 데이터의 개수를 이용하여 출력 데이터들을 인코딩할 수 있다.When the output data are the same, the encoder may encode the output data using the same number of data.

인코더는, 출력 데이터들이 다른 경우, 데이터들 간의 차이를 이용하여 출력 데이터들을 인코딩할 수 있다.When the output data are different, the encoder may encode the output data using a difference between the data.

압축 데이터는, 압축 스트림에 채널 단위로 구분되어 수록되고, 압축 스트림은, 압축 스트림에서 압축 데이터의 위치와 길이에 대한 정보가 수록되는 헤더를 포함할 수 있다.Compressed data is recorded in a compressed stream in a channel unit, and the compressed stream may include a header in which information about the location and length of the compressed data in the compressed stream is recorded.

본 발명에 따른 딥러닝 가속기는 외부 메모리로부터 압축된 입력 데이터를 읽어오는 RDMA; RDMA가 읽어온 압축된 입력 데이터를 신장시키는 디코더;를 더 포함하고, 딥러닝 가속기는, 디코더에서 신장된 입력 데이터를 연산할 수 있다.Deep learning accelerator according to the present invention reads compressed input data from an external memory RDMA; It further includes a decoder that expands the compressed input data read by the RDMA, and the deep learning accelerator may calculate the input data expanded by the decoder.

RDMA는, 입력 받을 채널의 개수를 적응적으로 결정할 수 있다.RDMA can adaptively determine the number of channels to be input.

본 발명의 다른 측면에 따르면, 딥러닝 가속기가, 입력 데이터를 연산하는 단계; 인코더가, 딥러닝 가속기에서 출력되는 데이터의 컨텍스트를 결정하는 단계; 결정된 컨텍스트를 기초로, 인코더가 각기 다른 압축 방식을 선택적으로 적용하여 출력 데이터를 압축하는 단계; WDMA가, 압축된 출력 데이터를 외부 메모리에 기록하는 단계;를 포함하는 것을 특징으로 하는 딥러닝 가속기 데이터 처리 방법이 제공된다.According to another aspect of the present invention, a deep learning accelerator comprising: calculating input data; determining, by the encoder, a context of data output from the deep learning accelerator; compressing the output data by selectively applying different compression schemes by an encoder based on the determined context; WDMA, writing the compressed output data to an external memory; Deep learning accelerator data processing method comprising the is provided.

이상 설명한 바와 같이, 본 발명의 실시예들에 따르면, 딥러닝 가속기에서 매번 동일한 채널/Weight 별 데이터 처리를 위해 외부 대용량의 메모리 접근 횟수를 줄이는 것이 가능해진다.As described above, according to the embodiments of the present invention, it becomes possible to reduce the number of times of accessing an external large-capacity memory for data processing for the same channel/weight each time in the deep learning accelerator.

또한, 본 발명의 실시예들에 따르면, 딥러닝 가속기에서 데이터 재사용성을 높여줌과 동시에, Peak Bandwidth를 줄여, 데이터 버퍼링 시간의 최소화를 통해 처리 속도를 향상시킬 수 있게 된다.In addition, according to embodiments of the present invention, it is possible to increase data reusability in the deep learning accelerator and at the same time reduce the peak bandwidth, thereby improving the processing speed by minimizing the data buffering time.

도 1은 본 발명의 일 실시예에 따른 저복잡도 딥러닝 하드웨어 가속기의 블럭도,
도 2는 16 채널 타일링을 적용한 DMA 구조와 데이터 흐름을 도식화한 도면,
도 3은 Loseless 인코더의 구성과 데이터 입출력 흐름을 도시한 도면, 그리고,
도 4는 다중 채널 기반 딥러닝 데이터 압축 스트림 구조를 도시한 도면이다.1 is a block diagram of a low-complexity deep learning hardware accelerator according to an embodiment of the present invention;
2 is a diagram schematically illustrating a DMA structure and data flow to which 16-channel tiling is applied;
3 is a diagram showing the configuration and data input/output flow of a lossless encoder, and
4 is a diagram illustrating the structure of a multi-channel-based deep learning data compression stream.

이하에서는 도면을 참조하여 본 발명을 보다 상세하게 설명한다.Hereinafter, the present invention will be described in more detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 저복잡도 딥러닝 하드웨어 가속기의 블럭도이다.1 is a block diagram of a low-complexity deep learning hardware accelerator according to an embodiment of the present invention.

본 발명의 실시예에 따른 딥러닝 하드웨어 가속기는, 도 1에 도시된 바와 같이, RDMA(Read Direct Memory Access)(110), Loseless 디코더(120), CNN(Convolutional Neural Network) 가속기(130), Loseless 디코더(140) 및 WDMA(Write Direct Memory Access)(150)를 포함하여 구성된다.Deep learning hardware accelerator according to an embodiment of the present invention, as shown in Figure 1, RDMA (Read Direct Memory Access) 110, Loseless decoder 120, CNN (Convolutional Neural Network) accelerator 130, Loseless It is configured to include a decoder 140 and a Write Direct Memory Access (WDMA) 150 .

RDMA(110)는 외부 메모리로부터 입력 데이터를 읽어와 내부 캐시에 저장한다. 입력 데이터에는 IFmap(Input Feature map)과 컨볼루션 파라미터(Weight)를 포함한다.The RDMA 110 reads input data from an external memory and stores it in an internal cache. The input data includes an input feature map (IFmap) and a convolution parameter (weight).

입력 데이터는 무손실 압축되어 있다. 이에 따라, Loseless 디코더(140)는 RDMA(110)가 읽어 온 입력 데이터를 Lossless Decoding 하여 압축된 입력 데이터를 신장시킨다.The input data is losslessly compressed. Accordingly, the lossless decoder 140 lossless decodes the input data read by the RDMA 110 to expand the compressed input data.

CNN 가속기(130)는 Loseless 디코더(140)에서 압축 해제된 입력 데이터를 연산하고, 연산 결과를 출력한다. CNN 가속기(130)의 출력 데이터는 OFmap(Output Feature map)이다.The CNN accelerator 130 calculates the input data decompressed by the lossless decoder 140 and outputs the operation result. Output data of the CNN accelerator 130 is an output feature map (OFmap).

Loseless 인코더(140)는 CNN 가속기(130)의 출력 데이터를 무손실 압축하여 WDMA(150)의 캐시에 저장한다. 그러면, WDMA(150)는 Loseless 디코더(120)에서 압축된 출력 데이터를 외부 메모리에 기록한다.The lossless encoder 140 losslessly compresses the output data of the CNN accelerator 130 and stores it in the cache of the WDMA 150 . Then, the WDMA 150 writes the output data compressed by the lossless decoder 120 to the external memory.

도 1에 도시된 RDMA(110)와 WDMA(150)의 상세 동작에 대해, 이하에서 도 2를 참조하여 상세히 설명한다. 도 2에는 16 채널 타일링(Tiling)을 적용한 DMA 구조와 데이터 흐름을 도식화한 도면이다.Detailed operations of the RDMA 110 and the WDMA 150 shown in FIG. 1 will be described in detail below with reference to FIG. 2 . 2 is a diagram schematically illustrating a DMA structure and data flow to which 16-channel tiling is applied.

CNN의 경우에는 Width×Height×Input Channel×Output Channel의 연산량을 기반으로 연산을 수행하게 된다. CNN을 위한 각각의 IFmap의 크기는 Width×Height×Input Channel이고, Weight의 크기는 n×m 커널을 사용할 경우 n×m×Input Channel×Output Channel이며, OFmap의 크기는 Width×Height×Output Channel이다.In the case of CNN, calculation is performed based on the amount of calculation of Width×Height×Input Channel×Output Channel. The size of each IFmap for CNN is Width×Height×Input Channel, the size of weight is n×m×Input Channel×Output Channel when using an n×m kernel, and the size of OFmap is Width×Height×Output Channel. .

IFmap, Weight는 외부 메모리로부터 입력되며, OFmap은 외부 메모리로 기록된다. 데이터의 입출력시에는 Bandwidth가 굉장히 중요하다. 딥러닝 연산을 위해 입력 데이터를 요청할 경우에 Peak Bandwidth를 필요로 하므로, 해당 Peak Bandwidth를 분산할 필요성이 있다.IFmap and Weight are input from external memory, and OFmap is written to external memory. Bandwidth is very important for data input/output. When requesting input data for deep learning operation, peak bandwidth is required, so it is necessary to distribute the corresponding peak bandwidth.

이에 따라, 본 발명의 실시예에서는, AXI 인터페이스 상에서 연산기의 대기 없이 데이터를 입력받을 수 있는 채널의 개수를 결정한다. 이를 테면, 2 채널, 4 채널, 8 채널, 16 채널, 32 채널 중 하나를 선택적으로 설정할 수 있다. 연산을 위한 bit width에 따라 선택 가능한 채널의 개수를 확장하거나 축소할 수 있다.Accordingly, in the embodiment of the present invention, the number of channels through which data can be input without waiting for an operator on the AXI interface is determined. For example, one of 2 channels, 4 channels, 8 channels, 16 channels, and 32 channels can be selectively set. Depending on the bit width for operation, the number of selectable channels can be expanded or reduced.

예를 들어, AXI 인터페이스의 bitwidth가 512bits, burst가 16, multiple outstanding이 8, Kernel 사이즈가 3×3, Fmap이 17bits~32 bits, Weight가 16 bit라고 한다면, 16 채널을 선택하고, 라인 메모리 기반이므로 사전에 데이터를 RDMA(110)로 미리 요청하여 내부 캐쉬에 저장하고, 이를 코어에서 불러들여 라인 메모리를 구성할 수 있다.For example, if the bitwidth of the AXI interface is 512bits, the burst is 16, the multiple outstanding is 8, the kernel size is 3×3, the Fmap is 17bits~32bits, and the weight is 16bits, select 16 channels and select the line memory based Therefore, it is possible to configure the line memory by requesting data in advance to the RDMA 110 in advance, storing it in the internal cache, and calling it from the core.

한 번의 multiple outstanding 요청으로 획득할 수 있는 데이터는 최대 32 클럭 이내에 {2048 데이터(32bits 기준) = 16 pixels × 16channel data}이며, 처리할 수 있는 데이터는 한번에 처리 가능한 연산은 2,304(3×3×16ch(in)×16ch(out))를 동시에 처리 및 WDMA(150)로 저장이 가능하다. 16 픽셀의 처리 및 출력으로 16채널 연산으로 인하여 RDMA(110)/WDMA(150)의 Bandwidth가 여유가 있으므로, Peak Bandwidth를 넘지 않는다.The data that can be acquired in one multiple outstanding request is {2048 data (based on 32bits) = 16 pixels × 16channel data} within a maximum of 32 clocks, and the number of data that can be processed is 2,304 (3×3×16ch) (in) x 16ch (out)) can be simultaneously processed and stored in the WDMA 150 . Because the bandwidth of RDMA 110/WDMA 150 has room due to 16-channel operation by processing and outputting 16 pixels, it does not exceed the Peak Bandwidth.

도 1에 도시된 Loseless 디코더(120)와 Loseless 디코더(140)의 상세 동작에 대해, 이하에서 상세히 설명한다.Detailed operations of the lossless decoder 120 and the lossless decoder 140 shown in FIG. 1 will be described in detail below.

위의 예시에 따른 연산을 위하여 모든 데이터를 버퍼링 한다면 병렬 처리수(Pcal)에 따라서 Pcal×n×m×Input Channel×Output Channel×2의 Peak Bandwidth가 필요하게 되며, 또한, OFmap의 누적연산을 위한 입출력 bit width 2배 크기의 덧셈용 버퍼를 필요로 한다.If all data is buffered for the operation according to the above example, a peak bandwidth of Pcal×n×m×Input Channel×Output Channel×2 is required according to the number of parallel processing (Pcal). A buffer for addition twice the size of the input/output bit width is required.

따라서, 위의 예시에서와 같이 데이터를 처리할 경우 데이터의 입출력에 많은 전력을 소모하게 되며, 데이터의 입출력을 위한 컨트롤이 굉장히 복잡할 수밖에 없다.Therefore, when data is processed as in the above example, a lot of power is consumed for input/output of data, and the control for input/output of data is inevitably very complicated.

이에 따라, 데이터의 감소를 위하여, 딥러닝 데이터를 압축, 구체적으로 Layer 단위 특히 그 내부의 채널 단위로 압축하여 데이터의 압축률을 높일 수 있다. IFmap과 OFmap은 영상과 동일한 형태이기 때문이며, 압축시 영상 압축 방식과 동일한 방식을 이용할 수 있다.Accordingly, in order to reduce data, the compression rate of data can be increased by compressing the deep learning data, specifically, in a layer unit, in particular in a channel unit inside it. This is because IFmap and OFmap have the same format as an image, and the same method as the image compression method can be used for compression.

도 3에 Loseless 디코더(120)의 구성과 데이터 입출력 흐름을 도시하였다. 도시된 바와 같이, Loseless 디코더(120)는 CNN 가속기(130)의 출력 데이터인 OFmap의 컨텍스트를 기초로, 2가지 압축 방식을 선택적으로 적용한다.3 shows the configuration of the lossless decoder 120 and the data input/output flow. As shown, the lossless decoder 120 selectively applies two compression methods based on the context of OFmap, which is the output data of the CNN accelerator 130 .

하나는 Regular Mode에 따른 무손실 압축이다. 압축할 OFmap의 데이터들이 서로 다른 경우에 데이터들 간의 차이값을 이용하여 출력 데이터들을 인코딩하는 방식이다.One is lossless compression according to regular mode. When the data of OFmap to be compressed are different from each other, the output data is encoded by using the difference value between the data.

다른 하나는 Run Mode에 따른 무손실 압축이다. 압축할 OFmap의 데이터들이 모두 같은 경우에, 같은 데이터의 개수를 이용하여 출력 데이터들을 인코딩하는 방식이다.The other is lossless compression according to the Run Mode. This is a method of encoding output data using the same number of data when all of the OFmap data to be compressed are the same.

도 4는 다중 채널 기반 딥러닝 데이터 압축 스트림 구조를 도시한 도면이다. 압축 스트림에는 압축 데이터들이 채널 단위로 구분되어 수록되어 있다. 또한, 랜덤 액세스를 지원하기 위하여, 압축 스트림의 헤더에는 압축 데이터들의 위치/길이 정보가 수록되어 있다.4 is a diagram illustrating the structure of a multi-channel-based deep learning data compression stream. In the compressed stream, compressed data is divided into channels and recorded. In addition, in order to support random access, position/length information of compressed data is recorded in the header of the compressed stream.

도시된 압축 스트림의 구조는 OFmap은 물론, IFmap과 Weight에도 적용될 수 있다.The structure of the illustrated compressed stream can be applied not only to OFmap, but also to IFmap and weight.

지금까지, 저복잡도 딥러닝 하드웨어 가속기에 대해, 바람직한 실시예들을 들어 상세히 설명하였다.So far, for the low-complexity deep learning hardware accelerator, preferred embodiments have been described in detail.

본 발명의 실시예에서는, 입력 받는 채널의 개수를 적응적으로 설정하고, 입출력에 소모되는 전력을 줄이고 컨트롤을 단순화하기 위해 데이터를 무손실 압축된 데이터로 구현하였다.In the embodiment of the present invention, data is implemented as lossless compressed data in order to adaptively set the number of input channels, reduce power consumption for input/output, and simplify control.

또한, 압축 스트림에는 압축 데이터를 채널 단위로 구분하여 수록하고, 구분자에 대한 정보를 압축 스트림의 헤더에 수록하여, 랜덤 액세스가 가능하도록 하였다.In addition, compressed data is recorded in each channel unit in the compressed stream, and information on the delimiter is recorded in the header of the compressed stream to enable random access.

이를 통해, 딥러닝 하드웨어 가속기에서 매번 동일한 채널/Weight 별 데이터 처리를 위해, 외부 대용량의 메모리 접근 횟수를 줄일 수 있고, 데이터 재사용성을 높여줌과 동시에, Peak Bandwidth를 줄이고 데이터 버퍼링 시간의 최소화를 통해 처리 속도를 향상시킬 수 있게 된다.Through this, in order to process data by the same channel/weight each time in the deep learning hardware accelerator, it is possible to reduce the number of external large-capacity memory accesses, increase data reusability, and reduce peak bandwidth and minimize data buffering time. speed can be improved.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.In addition, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the present invention belongs without departing from the gist of the present invention as claimed in the claims Various modifications are possible by those of ordinary skill in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

110 : RDMA
120 : Loseless 디코더
130 : CNN 가속기
140 : Loseless 인코더
150 : WDMA110: RDMA
120: Loseless decoder
130: CNN accelerator
140: Loseless encoder
150: WDMA

Claims

a deep learning accelerator that computes input data;
an encoder that compresses the output data of the deep learning accelerator; and
WDMA for writing compressed output data from the encoder to an external memory;
The encoder is
A deep learning accelerator characterized in that the output data is compressed by selectively applying different compression methods based on the context of the output data.

The method of claim 1 .
The encoder is
A deep learning accelerator characterized in that lossless compression is performed on the output data.

3. The method according to claim 2,
The encoder is
When the output data are the same, the deep learning accelerator, characterized in that encoding the output data by using the same number of data.

3. The method according to claim 2,
The encoder is
When the output data are different, the deep learning accelerator, characterized in that encoding the output data using the difference between the data.

The method according to claim 1,
compressed data,
The compressed stream is divided into channels and recorded.
Compressed stream is
A deep learning accelerator comprising a header containing information about the location and length of compressed data in a compressed stream.

The method according to claim 1,
RDMA reading compressed input data from external memory;
A decoder that expands the compressed input data read by RDMA; further comprising,
Deep learning accelerator,
A deep learning accelerator, characterized in that the decoder calculates the expanded input data.

7. The method of claim 6,
RDMA is
A deep learning accelerator that adaptively determines the number of channels to receive input.

Computing, by the deep learning accelerator, input data;
determining, by the encoder, a context of data output from the deep learning accelerator;
compressing the output data by selectively applying, by an encoder, different compression schemes based on the determined context;
WDMA, writing the compressed output data to an external memory; Deep learning accelerator data processing method comprising a.