KR102383962B1

KR102383962B1 - Deep learning accelerator with variable data encoder/decoder

Info

Publication number: KR102383962B1
Application number: KR1020200155060A
Authority: KR
Inventors: 이상설; 최병호; 장성준; 박종희
Original assignee: 한국전자기술연구원
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2022-04-07
Also published as: WO2022107929A1

Abstract

Provided is a deep learning accelerator comprising a variable data compressor/decompressor. A deep learning accelerator, according to an embodiment of the present invention, comprises: an RDMA which directly accesses an external memory and reads data for a deep learning operation from the external memory; a decompressor which decompresses the data read by the RDMA; an input buffer in which the data decompressed by the decompressor is stored; an operator which performs the deep learning operation by means of the data stored in the input buffer; and a controller which identifies the statuses of the RDMA and the input buffer and controls decompression of the decompressor on the basis of the identified statuses. The compressor/decompressor can be controlled according to the statuses of a DMA and a buffer, thereby maintaining the output of the deep learning accelerator at a maximum level.

Description

Deep learning accelerator with variable data encoder/decoder}

본 발명은 영상 처리 및 SoC(System on Chip) 기술에 관한 것으로, 더욱 상세하게는 딥러닝 하드웨어의 데이터 공급을 위하여 하드웨어 상태를 파악하여 압축/복원기를 제어하는 딥러닝 가속 장치에 관한 것이다.The present invention relates to image processing and SoC (System on Chip) technology, and more particularly, to a deep learning accelerator device for controlling a compressor/decompressor by grasping a hardware state for data supply of deep learning hardware.

종래 딥러닝 하드웨어 가속기의 대부분은 입력 데이터(Feature map), 입력 컨볼루션 파라미터(Weight)를 입력받아, 연산을 빠르게 수행하는 것을 목표로 하고 있다.Most of the conventional deep learning hardware accelerators receive input data (feature map) and input convolution parameters (weight), and aim to perform calculations quickly.

외부 메모리 접근 시에는 물리적 제약사항인 외부 메모리 허용 Bandwidth를 넘어갈 수 없기 때문에, 데이터의 입출력을 압축할 경우 많은 데이터를 공급할 수 있다.When accessing an external memory, it cannot exceed the allowable bandwidth of the external memory, which is a physical constraint, so a lot of data can be supplied when data input/output is compressed.

그러나, 단순히 압축/복원으로 간단한 무손실 압축을 사용하고 높은 압축률을 갖는 기법을 사용한다 하더라도, 실제 하드웨어 구현시에는 가속기 보다 압축/복원을 위한 하드웨어 크기가 커지는 문제점을 갖고 있다.However, even if a simple lossless compression is used as compression/restore and a technique having a high compression ratio is used, there is a problem in that the size of the hardware for compression/restore becomes larger than that of the accelerator when real hardware is implemented.

또한, 손실 압축을 사용하게 될 경우에는 성능의 열화가 발생하는 방법이기 때문에 더 큰 문제점을 발생하게 된다.In addition, when lossy compression is used, a larger problem occurs because it is a method in which performance degradation occurs.

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은, 가속기의 상태를 지속적으로 모니터링하며 최대한의 출력을 낼 수 있도록 하기 위한 방안으로, DMA와 버퍼 상태에 따라 압축/복원기를 제어하는 딥러닝 가속 장치를 제공함에 있다.The present invention has been devised to solve the above problems, and an object of the present invention is to continuously monitor the state of the accelerator and to produce the maximum output. Compression/restore according to the DMA and buffer states It is to provide a deep learning accelerator that controls the machine.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른, 딥러닝 가속 장치는, 외부 메모리에 직접 접근하여, 외부 메모리로부터 딥러닝 연산을 위한 데이터를 읽어들이는 RDMA(Read Direct Memory Access); RDMA에서 읽어들인 데이터를 복원하는 복원기; 복원기에서 복원된 데이터가 저장되는 입력 버퍼; 입력 버퍼에 저장된 데이터로 딥러닝 연산을 수행하는 연산기; RDMA와 입력 버퍼의 상황을 파악하고, 파악된 상황을 기초로 복원기의 복원 동작을 제어하는 제어기;를 포함한다.According to an embodiment of the present invention for achieving the above object, a deep learning acceleration device includes: RDMA (Read Direct Memory Access) for directly accessing an external memory, and reading data for a deep learning operation from the external memory; a restorer that restores data read from RDMA; an input buffer in which data restored by the restorer is stored; an operator that performs a deep learning operation with data stored in the input buffer; and a controller for determining the RDMA and the input buffer conditions, and controlling the restoration operation of the restorer based on the identified conditions.

본 발명의 실시예에 따른 딥러닝 가속 장치는, 연산기에서 딥러닝 연산된 데이터가 저장되는 출력 버퍼; 출력 버퍼에 저장된 데이터를 압축하는 압축기; 외부 메모리에 직접 접근하여, 압축기에서 압축된 데이터를 외부 메모리에 기록하는 WDMA(Write Direct Memory Access);를 더 포함하고, 체커는, WDMA와 출력 버퍼의 상황을 파악하고, 파악된 상황을 기초로 압축기의 압축 동작을 제어할 수 있다.Deep learning acceleration apparatus according to an embodiment of the present invention, an output buffer in which data calculated by deep learning in the calculator is stored; a compressor for compressing data stored in the output buffer; WDMA (Write Direct Memory Access) that directly accesses the external memory and writes the compressed data in the compressor to the external memory; further comprising, the checker understands the status of the WDMA and the output buffer, and based on the identified situation It is possible to control the compression operation of the compressor.

체커는, RDMA와 WDMA의 bandwidth 상황을 파악할 수 있다. 또한, 체커는 파악된 RDMA의 bandwidth 상황이 좋지 않은 경우, 복원기가 RDMA로부터 인가되는 데이터를 미리 복원하여 내부 캐시에 저장하도록 제어할 수 있다.The checker can grasp the bandwidth situation of RDMA and WDMA. In addition, when the identified RDMA bandwidth situation is not good, the checker can control the restorer to restore data applied from the RDMA in advance and store it in the internal cache.

그리고, 체커는 파악된 WDMA의 bandwidth 상황이 좋지 않은 경우, 압축기가 출력 버퍼에 저장된 데이터를 미리 압축하여 내부 캐시에 저장하도록 제어할 수 있다.And, when the identified WDMA bandwidth situation is not good, the checker may control the compressor to pre-compress the data stored in the output buffer and store it in the internal cache.

복원기와 압축기는, IFmap과 OFmap 데이터에 0이 임계치 이상으로 많은 경우, 병렬 처리를 최소화할 수 있다.The decompressor and the compressor can minimize parallel processing when there are many zeros in the IFmap and OFmap data more than a threshold.

복원기와 압축기는, Weight 데이터에 0이 임계치 이하로 적은 경우, 병렬 처리를 최대화할 수 있다.The restorer and the compressor can maximize parallel processing when the number of 0 in the weight data is less than or equal to a threshold.

한편, 본 발명의 다른 실시예에 따른, 딥러닝 가속 방법은, RDMA(Read Direct Memory Access)가, 외부 메모리에 직접 접근하여, 외부 메모리로부터 딥러닝 연산을 위한 데이터를 읽어들이는 단계; 복원기가, RDMA에서 읽어들인 데이터를 복원하는 단계; 입력 버퍼가, 복원된 데이터를 저장하는 단계; 연산기가, 저장된 데이터로 딥러닝 연산을 수행하는 단계; 체커가, RDMA와 입력 버퍼의 상황을 파악하고, 파악된 상황을 기초로 복원기의 복원 동작을 제어하는 단계;를 포함한다.On the other hand, according to another embodiment of the present invention, a deep learning acceleration method, RDMA (Read Direct Memory Access), directly accessing an external memory, reading data for a deep learning operation from the external memory; restoring, by a restorer, data read from RDMA; storing, by the input buffer, the restored data; performing, by the calculator, a deep learning operation on the stored data; and, by the checker, determining the RDMA and the input buffer conditions, and controlling the restoration operation of the restorer based on the identified conditions.

이상 설명한 바와 같이, 본 발명의 실시예들에 따르면, DMA와 버퍼 상태에 따라 압축/복원기를 제어할 수 있어, 딥러닝 가속기의 출력을 최대한으로 유지시킬 수 있게 된다. 특히, 병렬 데이터 처리 및 필요에 따른 속도 제어 가능하고, 압축/복원을 채널/Weight 별 데이터 처리를 위한 내/외부 메모리 접근 패턴 변경을 적용한 압축/복원이 가능하다.As described above, according to the embodiments of the present invention, it is possible to control the compressor/decompressor according to the state of the DMA and the buffer, so that it is possible to maintain the output of the deep learning accelerator to the maximum. In particular, it is possible to control parallel data processing and speed as needed, and compression/restore compression/restore by applying internal/external memory access pattern change for data processing by channel/weight.

도 1은 본 발명이 적용 가능한 딥러닝 가속 장치를 도시한 도면,
도 2는 본 발명의 일 실시예에 따른 딥러닝 가속 장치의 구조를 도시한 블럭도,
도 3은, 도 2에 도시된 연산기의 상세 구조를 도시한 블럭도,
도 4는 Run mode를 지원하는 복원기 예시,
도 5와 도 6은, 도 4에 예시된 복원기의 Index Union Module과 Data Collect Module에서 병렬 처리 과정, 그리고,
도 7은 본 발명의 다른 실시예에 따른 딥러닝 가속 장치의 동작 제어 방법의 설명에 제공되는 흐름도이다.1 is a view showing a deep learning acceleration device to which the present invention is applicable;
2 is a block diagram showing the structure of a deep learning acceleration apparatus according to an embodiment of the present invention;
Figure 3 is a block diagram showing the detailed structure of the operator shown in Figure 2;
4 is an example of a restorer supporting the Run mode,
5 and 6, the parallel processing process in the Index Union Module and the Data Collect Module of the restorer illustrated in FIG. 4, and,
7 is a flowchart provided to explain a method for controlling an operation of a deep learning acceleration apparatus according to another embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명을 보다 상세하게 설명한다.Hereinafter, the present invention will be described in more detail with reference to the drawings.

도 1은 본 발명이 적용 가능한 딥러닝 가속 장치를 도시한 도면이다. 도시된 딥러닝 가속 장치는, RDMA(Read Direct Memory Access)(11), 입력 버퍼(12), 연산기(13), 출력 버퍼(14), WDMA(Write Direct Memory Access)(15)를 포함하여 구성된다.1 is a diagram illustrating a deep learning acceleration device to which the present invention is applicable. The illustrated deep learning accelerator consists of a read direct memory access (RDMA) (11), an input buffer (12), an operator (13), an output buffer (14), and a write direct memory access (WDMA) (15). do.

딥러닝 가속 장치는 외부 메모리(10)로부터 데이터를 입력받아 딥러닝 연산을 수행하고, 연산 결과를 외부 메모리(10)로 출력하여 저장한다.The deep learning accelerator receives data from the external memory 10 , performs a deep learning operation, and outputs and stores the operation result to the external memory 10 .

외부 메모리(10)로부터 입력받는 데이터는 IFmap(Input Feature map : 입력 영상의 특징 데이터)와 Weight(딥러닝 모델의 컨볼루션 파라미터)이고, 외부 메모리(10)로 출력하는 딥러닝 연산 결과는 OFmap(Outut Feature map)이다.The data input from the external memory 10 is IFmap (Input Feature map: feature data of the input image) and Weight (convolution parameter of the deep learning model), and the deep learning operation result output to the external memory 10 is OFmap ( output feature map).

연산기(13)는 입력 버퍼(12)에 저장된 데이터로 딥러닝 연산을 수행한다. 이 과정에서, 연산기(13)는 필터링을 위하여 채널별 순차적으로 연산하여 합산하고, 여러 채널에 동시 필터링을 적용하며, 필터를 두고 다채널의 영상을 동시 처리할 수 있다.The operator 13 performs a deep learning operation with data stored in the input buffer 12 . In this process, the operator 13 sequentially calculates and sums each channel for filtering, applies simultaneous filtering to multiple channels, and may simultaneously process images of multiple channels with filters.

이와 같이, IFmap, Weight가 외부 메모리(10)로부터 입력되고, OFmap가 외부 메모리(10)로 저장되어야 하는데, 저전력 동작을 위해서는 외부 메모리와(10)의 데이터 송수신을 줄여 주어야 한다.As described above, the IFmap and Weight are input from the external memory 10 and the OFmap must be stored in the external memory 10 . For low-power operation, data transmission/reception with the external memory 10 should be reduced.

충분한 데이터 입출력을 위한 입출력 Bandwidth가 매우 중요하다. 보통 연산시 입력/출력된 데이터의 압축/복원을 통해 데이터량이 줄어 들게 되나, 특정한 레이어에서는 데이터의 압축률이 오히려 더 나빠지는 경우가 발생한다.I/O bandwidth for sufficient data input/output is very important. In general, the amount of data is reduced through compression/restore of input/output data during operation, but in a specific layer, the data compression rate becomes worse.

또한, 외부 메모리(10)의 공유로 인해 외부 메모리(10)에 접근이 불가능할 경우 데이터의 부족현상이 발생하여 압축/복원기의 오동작을 야기시키기도 한다. 이를 위하여 데이터를 충분히 제공하기 위하여 DMA와 버퍼의 상태에 따라 압축/복원기를 제어할 필요가 있다.In addition, when the external memory 10 cannot be accessed due to the sharing of the external memory 10, a data shortage may occur, which may cause malfunction of the compressor/decompressor. To this end, it is necessary to control the compressor/decompressor according to the state of the DMA and the buffer in order to sufficiently provide data.

이에 따라, 본 발명의 실시예에서는 딥러닝 가속 장치에 경량의 압축/복원기를 적용하되, DMA와 버퍼의 상태를 확인하여 압축/복원기를 제어하는 기법을 제시한다.Accordingly, in an embodiment of the present invention, a light-weight compressor/decompressor is applied to the deep learning acceleration device, but a technique for controlling the compressor/decompressor by checking the state of the DMA and the buffer is proposed.

도 2는 본 발명의 일 실시예에 따른 딥러닝 가속 장치의 구조를 도시한 블럭도이다. 본 발명의 실시예에 따른 딥러닝 가속 장치는, 도 2에 도시된 바와 같이, RDMA(110), 복원기(Decoder)(120), 입력 버퍼(130), 연산기(140), 출력 버퍼(150), 압축기(Encoder)(160), WDMA(170) 및 체커(Checker)(180)를 포함하여 구성된다.2 is a block diagram illustrating the structure of a deep learning acceleration apparatus according to an embodiment of the present invention. Deep learning acceleration apparatus according to an embodiment of the present invention, as shown in FIG. 2 , RDMA 110 , a decoder 120 , an input buffer 130 , an operator 140 , an output buffer 150 . ), a compressor (Encoder) 160, WDMA (170) and a checker (Checker) (180) is configured to include.

RDMA(110)는 외부 메모리(10)에 직접 접근하여, 외부 메모리(10)로부터 딥러닝 연산을 위한 데이터를 읽어들인다. RDMA(110)가 읽어들이는 데이터는 IFmap과 Weight이다.The RDMA 110 directly accesses the external memory 10 and reads data for a deep learning operation from the external memory 10 . Data read by the RDMA 110 are IFmap and Weight.

복원기(120)는 RDMA(110)가 읽어들인 압축된 데이터(IFmap과 Weight)를 복원(압축 해제)한다. 복원기(120)에 의해 복원된 데이터는 입력 버퍼(130)에 저장된다.The decompressor 120 restores (decompresses) the compressed data (IFmap and weight) read by the RDMA 110 . Data restored by the restorer 120 is stored in the input buffer 130 .

연산기(140)는 입력 버퍼(130)에 저장된 데이터로 딥러닝 연산을 수행하기 위한 모듈이다. 도 3은, 도 2에 도시된 연산기(140)의 상세 구조를 도시한 블럭도이다.The calculator 140 is a module for performing a deep learning operation with data stored in the input buffer 130 . FIG. 3 is a block diagram illustrating a detailed structure of the calculator 140 shown in FIG. 2 .

도시된 바와 같이, 연산기(140)는 딥러닝 연산을 위해 필요한 컨볼루션 연산 모듈(141), 어드레스 트리 모듈(142), 배치 정규화 모듈(143), Add Bias 모듈(144), Activation 모듈(145) 및 Maxpool 모듈(146)을 포함한다.As shown, the operator 140 includes a convolution operation module 141, an address tree module 142, a batch normalization module 143, an Add Bias module 144, and an Activation module 145 necessary for a deep learning operation. and a Maxpool module 146 .

출력 버퍼(150)는 연산기(140)에서 딥러닝 연산된 데이터인 OFmap이 저장되는 버퍼이다. 압축기(160)는 출력 버퍼(150)에 저장된 데이터(OFmap)를 압축한다. WDMA(170)는 외부 메모리(10)에 직접 접근하여, 압축기(160)에서 압축된 데이터를 외부 메모리(10)에 저장한다.The output buffer 150 is a buffer in which OFmap, which is data calculated by deep learning by the operator 140, is stored. The compressor 160 compresses the data OFmap stored in the output buffer 150 . The WDMA 170 directly accesses the external memory 10 and stores data compressed by the compressor 160 in the external memory 10 .

체커(180)는 RDMA(110)와 입력 버퍼(130)를 체크하고 출력 버퍼(150)와 WDMA(170)를 체크하여, RDMA(110)의 bandwidth 상황과 WDMA(170)의 bandwidth 상황을 파악한다.The checker 180 checks the RDMA 110 and the input buffer 130 and checks the output buffer 150 and the WDMA 170, to understand the bandwidth situation of the RDMA 110 and the bandwidth situation of the WDMA 170 .

그리고, 체커(180)는 DMA bandwidth 상황(채널 상황)을 기초로, 복원기(120)/압축기(160)의 복원/압축 동작을 제어한다.And, the checker 180 controls the restoration/compression operation of the restorer 120/compressor 160 based on the DMA bandwidth situation (channel situation).

구체적으로, RDMA(110)의 bandwidth 상황이 좋지 않은 경우, 체커(180)는 복원기(120)가 RDMA(110)로부터 인가되는 데이터를 미리 복원하여 내부 캐시에 저장하도록 한다.Specifically, when the bandwidth situation of the RDMA 110 is not good, the checker 180 restores the data applied from the RDMA 110 in advance by the restorer 120 and stores it in the internal cache.

또한, WDMA(110)의 bandwidth 상황이 좋지 않은 경우, 체커(180)는 압축기(160)가 출력 버퍼(150)에 저장된 데이터를 미리 압축하여 내부 캐시에 저장하도록 한다.In addition, when the bandwidth situation of the WDMA 110 is not good, the checker 180 allows the compressor 160 to pre-compress the data stored in the output buffer 150 and store it in the internal cache.

run mode를 지원하는 압축/복원기의 경우에는 사용자가 지정하는 run의 수를 위한 데이터 할당(RI) 데이터로 할당된 bits (NZ)로 구성된다. 즉, 압축된 스트림을 RI bits + NZ bits로 입력받아 데이터를 처리할 경우 최대 2^RI 개의 데이터를 최소 1개의 데이터를 생성할 수 있다는 의미가 된다.In the case of a compressor/decompressor that supports run mode, it consists of bits (NZ) allocated as data allocation (RI) data for the number of runs specified by the user. That is, when the compressed stream is input as RI bits + NZ bits and data is processed, it means that at least one data of up to 2^RI pieces of data can be generated.

N개로 병렬로 처리를 할 경우에는 최대 N*(2^RI)개를 처리할 수 있으므로, 초기 데이터의 유효한 스트림에 대한 정보를 내부 코어에 저장하여 필요로 하는 데이터의 위치를 기억하는 추가적인 메모리를 필요로한다.In the case of N parallel processing, a maximum of N*(2^RI) can be processed, so an additional memory that stores the information on the valid stream of the initial data in the internal core and stores the location of the required data in need.

몇 비트의 내부 메모리로 코어에서 필요로 하는 데이터를 충분히 확보가 가능하기 때문에, 메모리 접근 채널의 상황이 나쁠 경우에는 병렬 처리를 미리 수행하여 필요한 데이터를 미리 확보가 가능하다(내부 Global Buffer의 크기에 맞춰서 동작).Since it is possible to secure enough data required by the core with a few bits of internal memory, if the memory access channel is bad, parallel processing can be performed in advance to secure the necessary data in advance (depending on the size of the internal global buffer). act accordingly).

도 4는 복원기의 내부 구조를 예시한 도면이다. 저장해야 하는 데이터에는, 1) 현재 입력 데이터 카운트 수, 2) 복호화되지 못한 입력 데이터, 3) 캐시에 남아 있는 데이터, 4) 캐시에 남아 있는 데이터 수, 5) 출력해야 하는 목표 카운트 수, 6) 현재 출력한 데이터 카운트 수가 포함된다. 입력받는 신호에는, 1) 128비트 데이터&유효신호, 2) DMA 준비신호, 3) 출력해야 하는 목표 카운트 수, 4) 현재 상태를 저장하는 신호, 5) 저장된 상태를 불러오는 신호, 6) 상태 및 버퍼를 비우는 신호, 7) 캐시에 오는 데이터 요청 신호가 포함된다. 출력하는 신호에는, 1) 64비트 데이터 & 유효 신호, 2) DMA 데이터 요청신호, 3) 현재 입력 데이터 카운트 수, 4) 현재 출력한 데이터 카운트 수, 5) 저장 프로세스 진행 중 표시, 6) 로드 프로세스 진행 중 표시, 7) 비우는 프로세스 진행 중 표시, 8) 프로세스 후 휴식 상태 표시가 포함된다.4 is a diagram illustrating the internal structure of the restorer. Data that needs to be stored include: 1) the number of current input data counts, 2) undecrypted input data, 3) data remaining in cache, 4) number of data remaining in cache, 5) target count to be output, 6) The number of currently output data counts is included. Signals received include: 1) 128-bit data & valid signal, 2) DMA ready signal, 3) target count to be output, 4) signal to save current state, 5) signal to call saved state, 6) state and This includes a signal to empty the buffer, and 7) a signal to request data coming to the cache. Output signals include: 1) 64-bit data & valid signal, 2) DMA data request signal, 3) current input data count number, 4) current output data count number, 5) save process in progress indication, 6) load process In progress indication, 7) emptying process in progress indication, and 8) resting status indication after process are included.

도 5와 도 6에는 도 4에 예시된 복원기의 Index Union Module과 Data Collect Module에서 병렬 처리 과정, 구체적으로 입력 데이터로부터 필요한 병렬 처리를 위한 데이터 공급 및 출력 데이터를 Collection 하는 과정을 예시한 도면이다.5 and 6 are diagrams illustrating a parallel processing process in the Index Union Module and Data Collect Module of the restorer illustrated in FIG. 4 , specifically, a process of supplying data for parallel processing required from input data and collecting output data. .

해당 예시는 사용자가 run을 3bit로 data를 8bit로 지정하고 8개를 병렬로 처리할 경우이다. 하드웨어의 경우에는 복잡한 연산을 포함하고 있지 않기 때문에 다수의 병렬 코어를 준비하고, DMA 및 버퍼의 상황에 맞춰 병렬로 데이터를 압축/복원 처리를 수행할 수 있다. This example is when the user designates run as 3 bits and data as 8 bits and processes 8 in parallel. In the case of hardware, since it does not include complex operations, multiple parallel cores can be prepared and data compression/restore processing can be performed in parallel according to the DMA and buffer conditions.

또한, Activation(IFmap/OFmap)의 경우에는 0이 임계치 이상으로 많이 발생하는 레이어의 데이터의 경우 병렬 처리를 최소화 하여 저전력 동작을 가능하게 하고, Weight의 경우 0이 임계치 이하로 거의 없는 채널 데이터의 경우 병렬 처리를 최대로 수행하여 필요로 하는 데이터를 충분히 확보할 수 있도록 한다.In addition, in the case of activation (IFmap/OFmap), parallel processing is minimized to enable low-power operation by minimizing parallel processing in the case of layer data in which zeros occur more than the threshold, and in the case of weight, in the case of channel data with few zeros below the threshold Parallel processing is performed to the maximum so that the required data can be sufficiently secured.

도 7은 본 발명의 다른 실시예에 따른 딥러닝 가속 장치의 동작 제어 방법의 설명에 제공되는 흐름도이다.7 is a flowchart provided to explain a method for controlling an operation of a deep learning acceleration apparatus according to another embodiment of the present invention.

도시된 바와 같이, 딥러닝 가속 장치의 RDMA(110)는 외부 메모리(10)로부터 딥러닝 연산을 위한 데이터를 읽어들인다(S210). 그러면, 복원기(120)가 S210단계에서 읽어들인 압축된 데이터를 복원하여 입력 버퍼(130)에 저장한다(S220).As shown, the RDMA 110 of the deep learning accelerator reads data for a deep learning operation from the external memory 10 (S210). Then, the restorer 120 restores the compressed data read in step S210 and stores it in the input buffer 130 (S220).

연산기(140)는 S220단계에서 저장된 데이터로 딥러닝 연산을 수행하고, 딥러닝 연산 결과를 출력 버퍼(150)에 저장한다(S230).The calculator 140 performs a deep learning operation with the data stored in step S220 and stores the deep learning operation result in the output buffer 150 (S230).

그러면, 압축기(160)는 S230단계에서 저장된 데이터를 압축하고(S240), WDMA(170)는 S240단계에서 압축된 데이터를 외부 메모리(10)에 저장한다(S250).Then, the compressor 160 compresses the data stored in step S230 ( S240 ), and the WDMA 170 stores the data compressed in step S240 in the external memory 10 ( S250 ).

S210단계 내지 S250단계가 수행되는 중에, 체커(180)는 RDMA(110)와 입력 버퍼(130)를 체크하고 출력 버퍼(150)와 WDMA(170)를 체크하여, DMA bandwidth 상황을 파악한다(S260).While steps S210 to S250 are performed, the checker 180 checks the RDMA 110 and the input buffer 130 and checks the output buffer 150 and the WDMA 170 to determine the DMA bandwidth situation (S260) ).

그리고, 체커(180)는 S260단계에서 파악된 DMA bandwidth 상황을 기초로, 복원기(120)/압축기(160)의 복원/압축 동작을 제어한다(S270).Then, the checker 180 controls the restoration/compression operation of the restorer 120/compressor 160 based on the DMA bandwidth situation identified in step S260 (S270).

지금까지, 가변 데이터 압축/복원기를 포함하는 딥러닝 가속 장치에 대해, 바람직한 실시예를 들어 상세히 설명하였다.So far, for a deep learning acceleration device including a variable data compressor / decompressor, a preferred embodiment has been described in detail.

위 실시예에서는, 가속기의 상태를 지속적으로 모니터링하며 최대한의 출력을 낼 수 있는 모델 제시하였는데, 유연한 딥러닝 장치 및 다양한 네트워크 및 레이어에서도 적용 가능한 모델이다.In the above embodiment, a model capable of outputting the maximum output while continuously monitoring the state of the accelerator is presented, which is a model applicable to flexible deep learning devices and various networks and layers.

병렬 데이터 처리 및 필요에 따른 속도 제어 가능한 하드웨어 구조와 딥러닝 가속기에서 압축/복원을 채널/Weight 별 데이터 처리를 위해 내/외부 메모리 접근 패턴 변경을 적용한 압축/복원 하드웨어 구조이다.It is a hardware structure that allows parallel data processing and speed control as needed, and a compression/restore hardware structure that applies a change in internal/external memory access pattern for data processing by channel/weight for compression/restore in the deep learning accelerator.

한편, 본 실시예에 따른 장치와 방법의 기능을 수행하게 하는 컴퓨터 프로그램을 수록한 컴퓨터로 읽을 수 있는 기록매체에도 본 발명의 기술적 사상이 적용될 수 있음은 물론이다. 또한, 본 발명의 다양한 실시예에 따른 기술적 사상은 컴퓨터로 읽을 수 있는 기록매체에 기록된 컴퓨터로 읽을 수 있는 코드 형태로 구현될 수도 있다. 컴퓨터로 읽을 수 있는 기록매체는 컴퓨터에 의해 읽을 수 있고 데이터를 저장할 수 있는 어떤 데이터 저장 장치이더라도 가능하다. 예를 들어, 컴퓨터로 읽을 수 있는 기록매체는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광디스크, 하드 디스크 드라이브, 등이 될 수 있음은 물론이다. 또한, 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터로 읽을 수 있는 코드 또는 프로그램은 컴퓨터간에 연결된 네트워크를 통해 전송될 수도 있다.On the other hand, it goes without saying that the technical idea of the present invention can be applied to a computer-readable recording medium containing a computer program for performing the functions of the apparatus and method according to the present embodiment. In addition, the technical ideas according to various embodiments of the present invention may be implemented in the form of computer-readable codes recorded on a computer-readable recording medium. The computer-readable recording medium may be any data storage device readable by the computer and capable of storing data. For example, the computer-readable recording medium may be a ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, hard disk drive, or the like. In addition, the computer-readable code or program stored in the computer-readable recording medium may be transmitted through a network connected between computers.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.In addition, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the present invention belongs without departing from the gist of the present invention as claimed in the claims In addition, various modifications are possible by those of ordinary skill in the art, and these modifications should not be individually understood from the technical spirit or perspective of the present invention.

110 : RDMA(Read Direct Memory Access)
120 : 복원기
130 : 입력 버퍼
140 : 연산기
141 : 컨볼루션 연산 모듈
142 : 어드레스 트리 모듈
143 : 배치 정규화 모듈
144 : Add Bias 모듈
145 : Activation 모듈
146 : Maxpool 모듈
150 : 출력 버퍼
160 : 압축기
170 : WDMA(Write Direct Memory Access)
180 : 체커(Checker)110 : RDMA (Read Direct Memory Access)
120: restorer
130: input buffer
140: operator
141: convolution operation module
142: address tree module
143: batch normalization module
144 : Add Bias module
145: Activation module
146 : Maxpool module
150 : output buffer
160: compressor
170: WDMA (Write Direct Memory Access)
180: Checker

Claims

RDMA (Read Direct Memory Access) that directly accesses external memory and reads data for deep learning operation from external memory;
a restorer that restores data read from RDMA;
an input buffer in which data restored by the restorer is stored;
an operator that performs a deep learning operation with data stored in the input buffer;
an output buffer in which data calculated by deep learning in the calculator is stored;
a compressor for compressing data stored in the output buffer;
Write Direct Memory Access (WDMA), which directly accesses the external memory and writes the compressed data in the compressor to the external memory;
A controller for controlling the restoration operation of the restorer based on the identified situation by grasping the RDMA and input buffer conditions, and controlling the compression operation of the compressor based on the grasped condition by identifying the WDMA and output buffer conditions; includes; do,
restorer and compressor,
Deep learning acceleration device characterized in that parallel processing is minimized when there are many zeros in IFmap and OFmap data more than a threshold.

delete

The method according to claim 1,
The controller is
Deep learning accelerator, characterized in that it grasps the bandwidth situation of RDMA and WDMA.

4. The method according to claim 3,
The controller is
If the identified RDMA bandwidth situation is not good, a deep learning accelerator characterized in that the restorer restores the data applied from the RDMA in advance and controls it to be stored in the internal cache.

5. The method according to claim 4,
The controller is
If the identified WDMA bandwidth situation is not good, the deep learning accelerator, characterized in that the compressor controls the data stored in the output buffer to be pre-compressed and stored in the internal cache.

delete

The method according to claim 1,
restorer and compressor,
A deep learning accelerator, characterized in that when the weight data has a small number of 0 below the threshold, the parallel processing is maximized.

RDMA (Read Direct Memory Access), by directly accessing an external memory, reading data for deep learning operation from the external memory;
restoring, by a restorer, data read from RDMA;
storing, by the input buffer, the restored data;
performing, by an operator, a deep learning operation with data stored in an input buffer;
The output buffer, storing the deep learning calculated data;
compressing, by the compressor, the data stored in the output buffer;
WDMA (Write Direct Memory Access), accessing the external memory directly, and writing the compressed data to the external memory;
controlling, by the controller, the conditions of the RDMA and the input buffer, and controlling the restoration operation of the restorer based on the identified conditions;
Including, by the controller, determining the status of the WDMA and the output buffer, and controlling the compression operation of the compressor based on the identified situation;
restorer and compressor,
Deep learning acceleration method characterized in that parallel processing is minimized when there are many zeros in IFmap and OFmap data more than a threshold.