KR20190129240A

KR20190129240A - Neural network processor based on row operation and data processing method using thereof

Info

Publication number: KR20190129240A
Application number: KR1020180053570A
Authority: KR
Inventors: 하순회; 강진택
Original assignee: 서울대학교산학협력단
Priority date: 2018-05-10
Filing date: 2018-05-10
Publication date: 2019-11-20
Also published as: KR102126857B1; WO2019216513A1

Abstract

A row-by-row operation neural processor includes: an input part inputting data; a feature map on-chip memory part storing input pixels of input feature map raw data about the data inputted by the input part in rows adjacent to each other by channel; a filter weighted value on-chip memory part storing filter pixels of filter weighted value raw data about the inputted data in adjacent rows by channel; a feature map buffer part creating input feature map row data by storing the data stored in the feature map on-chip memory part by row; a filter weighted value buffer part creating filter weighted value row data by storing the data stored in the filter weighted value on-chip memory part by row; a convolution calculation part creating convolution data by multiplying the input feature map row data and the filter weighted value row data by each element; an addition tree part calculating a partial sum from the convolution data; and an output buffer part receiving and storing the calculated partial sum through a pipe line forming a data channel connected from the addition tree part, and outputting the stored partial sum. Therefore, the present invention is capable of minimizing connection overhead between hierarchies.

Description

NORAL NETWORK PROCESSOR BASED ON ROW OPERATION AND DATA PROCESSING METHOD USING THEREOF}

본 발명은 행 단위 연산 뉴럴 프로세서 및 이를 이용한 데이터 처리 방법에 관한 것이다. 보다 상세하게는 행 단위 연산, 온칩 메모리 및 파이프라인을 이용하여 효율적인 데이터 처리를 제공하며, 벡터 연산을 위한 API를 이용하여 사용자가 다양한 구성 요소 및 데이터 경로를 프로그램할 수 있는 행 단위 연산 뉴럴 프로세서 및 이를 이용한 데이터 처리 방법에 관한 것이다.The present invention relates to a row-wise neural processor and a data processing method using the same. More specifically, it provides efficient data processing using row-by-row operations, on-chip memory, and pipelines, and row-wise neural processors that allow users to program various components and data paths using APIs for vector operations. It relates to a data processing method using the same.

일반적으로 뉴럴 프로세서란 신경망과 같이 계산량이 높은 네트워크 또는 시스템에서 사용하는 프로세서로서 뉴럴 엔진, 뉴로 프로세서, 뉴로 컴퓨터 등의 다양한 용어로 불리워지고 있다. 뉴럴 프로세서는 다양한 기계 학습(machine learning)분야에서 널리 사용되고 있으며, 특히 합성곱 기반의 심층 학습(딥 러닝: Deep Learning)인 CNN (Convolution Neural Network)에서 주로 사용되고 있다. CNN은 DNN (Deep Neural Network)의 일종으로 이미지 분류, 객체 인식 등의 다양한 기계 학습 응용에서 널리 사용되는 기술로써 네트워크에서 요구하는 계산량이 매우 높기 때문에 이를 하드웨어적으로 가속하는 하드웨어 가속기인 뉴럴 프로세서의 연산 성능은 CNN에서 매우 중요한 요소이다.In general, a neural processor is a processor used in a network or a system having a high computational rate such as a neural network and is called various terms such as a neural engine, a neuro processor, a neuro computer, and the like. Neural processors are widely used in various machine learning fields, especially in the convolutional neural network (CNN), which is a deep learning based deep product. CNN is a type of deep neural network (DNN) that is widely used in various machine learning applications such as image classification and object recognition. Performance is a very important factor in CNN.

이러한 뉴럴 프로세서의 연산 성능을 개선하기 위해 다양한 방법이 제안되고 있으나, 일반적인 뉴럴 프로세서는 외부 메모리 및 연산 부분(합성곱 연산을 위한 부분, 활성화 부분, 풀링을 위한 부분)이 분리된 구조를 지니고 있어 메모리와 연산부 간의 데이터 전송을 위해 많은 시간 및 에너지가 소요되는 문제점은 여전히 해결되지 못하고 있다.Various methods have been proposed to improve the computational performance of such neural processors. However, a general neural processor has a structure in which external memory and arithmetic parts (parts for multiplication operations, activation parts, and pooling parts) are separated. Problems that require much time and energy for data transmission between the computing unit and the computing unit have not been solved.

한국공개특허공보 제10-2016-0142791호(2016.12.13)Korean Patent Publication No. 10-2016-0142791 (2016.12.13)

이에 본 발명의 기술적 과제는 이러한 점에서 착안된 것으로, 본 발명의 목적은 메모리가 칩 내부의 데이터 경로에 포함되며 연산 성능 및 데이터 처리 시간이 개선된 행 단위 연산 뉴럴 프로세서를 제공하는 것이다.Accordingly, the technical problem of the present invention has been conceived in this respect, and an object of the present invention is to provide a row-by-row neural processor in which a memory is included in a data path inside a chip and an operation performance and data processing time are improved.

또한 본 발명의 다른 목적은 메모리가 칩 내부의 데이터 경로에 포함되며 연산 성능 및 데이터 처리 시간이 개선된 행 단위 연산 뉴럴 프로세서를 이용한 행 단위 연산 데이터 처리 방법을 제공하는 것이다.Another object of the present invention is to provide a row-by-row operation data processing method using a row-by-row neural processor, in which a memory is included in a data path inside a chip and the operation performance and data processing time are improved.

상기한 본 발명의 목적을 실현하기 위한 행 단위 연산 뉴럴 프로세서는 데이터를 입력하는 입력부, 상기 입력부에 입력된 입력 데이터에 대한 입력 특징 지도 로우 데이터의 입력 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 특징 지도 온칩 메모리부, 상기 입력 데이터에 대한 필터 가중치 로우 데이터의 필터 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 필터 가중치 온칩 메모리부, 상기 특징 지도 온칩 메모리부에 저장된 데이터를 행 단위로 저장하여 입력 특징 지도 행데이터를 생성하는 특징 지도 버퍼부, 상기 필터 가중치 온칩 메모리부에 저장된 데이터를 행 단위로 저장하여 필터 가중치 행데이터를 생성하는 필터 가중치 버퍼부, 상기 입력 특징 지도 행데이터 및 상기 필터 가중치 행데이터를 요소별로 곱셈하여 합성곱 데이터를 생성하는 합성곱 계산부, 상기 합성곱 데이터로부터 부분합을 계산하는 가산 트리부 및 상기 가산 트리부로부터 연결되는 데이터 경로를 형성하는 파이프라인을 통해 상기 계산된 부분합을 전송 받아 저장하고, 상기 저장된 부분합을 출력하는 출력 버퍼부를 포함한다.In order to achieve the above object of the present invention, a row unit neural processor stores an input unit for inputting data and input pixels of input feature map row data for input data input to the input unit in rows adjacent to each other on a channel basis. A feature map on-chip memory unit, a filter weight on-chip memory unit for storing filter pixels of the filter weighted row data for the input data in adjacent rows on a channel basis, and storing data stored in the feature map on-chip memory unit on a row basis A feature map buffer for generating input feature map row data, A filter weight buffer for generating filter weighted row data by storing data stored in the filter weight on-chip memory unit in units of rows, The input feature map row data and the filter weight Multiply row data by element to produce composite product data A composite product calculation unit configured to receive and store the calculated partial sum through a pipeline forming a data path connected from the composite tree calculation unit, an addition tree unit calculating a subtotal from the composite product data, and outputting the stored partial sum And an output buffer section.

본 발명의 일 실시예에 있어서, 상기 행 단위 연산 뉴럴 프로세서는 상기 출력 버퍼부의 출력을 전송 받아 활성화 함수로부터 상기 출력의 활성 상태 및 비활성 상태를 결정하는 활성화 함수 계산부, 상기 활성화 함수 계산부의 출력을 저장하는 활성화 출력 온칩 메모리부 및 상기 출력 온칩 메모리부에 저장된 데이터를 풀링하는 풀링부를 더 포함할 수 있다.In one embodiment of the present invention, the row-by-row operation neural processor receives the output of the output buffer unit the activation function calculation unit for determining the active state and the inactive state of the output from the activation function, the output of the activation function calculation unit The apparatus may further include an activation output on-chip memory unit for storing and a pooling unit for pooling data stored in the output on-chip memory unit.

본 발명의 일 실시예에 있어서, 상기 파이프라인은 상기 가산 트리부, 상기 출력 버퍼부, 상기 활성화 함수 계산부, 상기 출력 온칩 메모리부 및 상기 풀링부의 출력과 입력이 순차적으로 포인트 투 포인트(P2P: point to point)방식으로 연결되어 하나의 데이터 전송 경로를 형성할 수 있다.In one embodiment of the present invention, the pipeline includes a point-to-point (P2P) output and an input of the addition tree unit, the output buffer unit, the activation function calculation unit, the output on-chip memory unit, and the pooling unit sequentially. Point to point) can be connected to form a single data transmission path.

본 발명의 일 실시예에 있어서, 상기 행 단위 연산 뉴럴 프로세서는 상기 행 단위 연산 뉴럴 프로세서의 구성 요소들의 처리 방법과 순서를 정의 및 제어하는 벡터 연산 API(Application Programming Interface) 를 지원하는 API 프로그램부를 더 포함할 수 있다.In one embodiment of the present invention, the row unit neural processor further includes an API program unit supporting a vector operation API (Application Programming Interface) for defining and controlling the processing method and order of the components of the row unit neural processor It may include.

본 발명의 일 실시예에 있어서, 상기 행 단위 연산 뉴럴 프로세서는 상기 풀링부의 출력은 상기 특징 지도 온칩 메모리부로 재입력 되고, 상기 출력 데이터의 저장과 상기 재입력 데이터의 처리가 동시에 수행되는 더블 버퍼를 포함하는 출력 재입력부를 더 포함할 수 있다.In one embodiment of the present invention, the row unit neural processor outputs the pooling unit is re-input to the feature map on-chip memory unit, a double buffer for storing the output data and processing the re-input data is performed at the same time It may further include an output re-input unit including.

본 발명의 일 실시예에 있어서, 상기 특징 지도 버퍼부는 상기 입력 특징 지도 행데이터를 재사용 하고, 상기 필터 가중치 버퍼부는 상기 필터 가중치 행데이터를 재사용 할 수 있다.In one embodiment of the present invention, the feature map buffer unit may reuse the input feature map row data, and the filter weight buffer unit may reuse the filter weight row data.

본 발명의 일 실시예에 있어서, 상기 입력 특징 지도 로우 데이터의 채널 크기는 2의 거듭제곱이고, 상기 특징 지도 온칩 메모리부의 메모리 폭은 2의 거듭제곱이고, 상기 입력 특징 지도 로우 데이터의 채널 크기와 동일하거나 채널 크기의 약수 또는 배수이며, 상기 필터 가중치 온칩 메모리부는 상기 필터 픽셀을 저장한 마지막 행에서 유효 데이터가 저장되지 않은 나머지 부분을 그 행의 끝까지 0으로 채우는 제로 패딩 기능을 수행할 수 있다.In one embodiment of the present invention, the channel size of the input feature map row data is a power of 2, the memory width of the feature map on-chip memory unit is a power of 2, and the channel size of the input feature map row data The filter weight on-chip memory unit may have a zero padding function that fills the remaining portion of the last row storing the filter pixel with no valid data to zero until the end of the row.

본 발명의 일 실시예에 있어서, 상기 필터 가중치 버퍼부는 시프트 버퍼일 수 있다.In one embodiment of the present invention, the filter weight buffer unit may be a shift buffer.

본 발명의 일 실시예에 있어서, 상기 필터 가중치 버퍼부는 상기 필터 가중치 온칩 메모리부의 행 폭의 배수의 크기를 갖는 순환 시프트 버퍼이고, 상기 순환 시프트 버퍼는 상기 필터 가중치 행데이터를 상기 입력 특징 지도 행데이터의 시작 주소의 오프셋만큼 시프트하며, 상기 필터 가중치 행데이터가 시프트한 이전 공간을 0으로 채우는 제로 패딩 기능을 수행하고, 상기 필터 가중치 행데이터의 마지막 행을 시프트 한 후에는 상기 순환 시프트 버퍼의 전체 또는 일부를 0으로 초기화할 수 있다.In one embodiment of the present invention, the filter weight buffer unit is a cyclic shift buffer having a size of a multiple of a row width of the filter weight on-chip memory unit, and the cyclic shift buffer is configured to convert the filter weighted row data into the input feature map row data. Shifts by an offset of a start address of a; and performs a zero padding function of filling the previous space shifted by the filter weighted row data with zero, and after shifting the last row of the filter weighted row data, all or the cyclic shift buffer. Some can be initialized to zero.

본 발명의 일 실시예에 있어서, 상기 행 단위 연산 뉴럴 프로세서는 칩 외부에 연결되어 상기 입력 특징 지도 로우 데이터 및 상기 필터 가중치 로우 데이터를 저장하는 추가적인 공간을 제공하는 외부 메모리를 더 포함할 수 있다.In an embodiment of the present disclosure, the row unit neural processor may further include an external memory connected to the outside of the chip to provide an additional space for storing the input feature map row data and the filter weight row data.

상기한 본 발명의 목적을 실현하기 위한 행 단위 연산 데이터 처리 방법은, 행 단위 연산 뉴럴 프로세서를 이용해 데이터를 처리하는 시스템에서, 입력부가 데이터를 입력 받아 입력 데이터를 생성하는 단계, 특징 지도 온칩 메모리부가 상기 입력 특징 지도 로우 데이터의 입력 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 단계, 필터 가중치 온칩 메모리부가 상기 필터 가중치 로우 데이터의 필터 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 단계, 특징 지도 버퍼부가 상기 특징 지도 온칩 메모리부에 저장된 데이터를 행 단위로 저장하여 입력 특징 지도 행데이터를 생성하는 단계, 필터 가중치 버퍼부가 상기 필터 가중치 온칩 메모리부에 저장된 데이터를 행 단위로 저장하여 필터 가중치 행데이터를 생성하는 단계, 합성곱 계산부가 상기 입력 특징 지도 행데이터 및 상기 필터 가중치 행데이터를 요소별로 곱셈하여 합성곱 데이터를 생성하는 단계,가산 트리부가 상기 합성곱 데이터로부터 부분합을 계산하는 단계 및 출력 버퍼부가 상기 가산 트리부로부터 연결되는 데이터 경로를 형성하는 파이프라인을 통해 상기 계산된 부분합을 전송 받아 저장하고, 상기 저장된 부분합을 출력하는 단계를 포함하고, 상기 특징 지도 로우 데이터를 저장하는 단계 및 상기 필터 가중치 로우 데이터를 저장하는 단계와 입력 특징 지도 행데이터를 생성하는 단계 및 필터 가중치 행데이터를 생성하는 단계는 병렬적으로 수행될 수 있다.In a row-wise operation data processing method for realizing the above object of the present invention, in the system for processing data using a row-by-row neural processor, the input unit receives the data to generate the input data, the feature map on-chip memory unit Storing the input pixels of the input feature map row data in rows adjacent to each other in channel units, and the filter weighted on-chip memory unit storing the filter pixels of the filter weighted row data in adjacent rows in channel units, and a feature map Generating, by the buffer unit, data stored in the feature map on-chip memory unit in row units, and generating input feature map row data; and the filter weighting buffer unit storing data stored in the filter weight on-chip memory unit in units of rows for filter weighted row data. Generating a composite product calculation unit; Multiplying the output feature map row data and the filter weighted row data for each element to generate a composite product data; an addition tree unit calculating a subtotal from the composite product data; and an output buffer unit connected from the addition tree unit And receiving and storing the calculated subtotal through a pipeline forming a, and outputting the stored subtotal, storing the feature map row data, and storing the filter weight row data and an input feature. Generating the map row data and generating the filter weighted row data may be performed in parallel.

본 발명의 일 실시예에 있어서, 상기 행 단위 연산 데이터 처리 방법은 활성화 함수 계산부가 상기 출력 버퍼부의 출력을 전송 받아 활성화 함수로부터 상기 출력의 활성 상태 및 비활성 상태를 결정하는 단계, 출력 온칩 메모리부가 상기 활성화 함수 계산부의 출력을 저장하는 단계 및 풀링부가 상기 출력 온칩 메모리부에 저장된 데이터를 풀링하는 단계를 더 포함할 수 있다.In one embodiment of the present invention, the row-by-row operation data processing method step of the activation function calculation unit receives the output of the output buffer unit to determine the active state and inactive state of the output from the activation function, the output on-chip memory unit is The method may further include storing an output of an activation function calculation unit and pooling data stored in the output on-chip memory unit.

본 발명의 일 실시예에 있어서, 기 파이프라인은 상기 가산 트리부, 상기 출력 버퍼부, 상기 활성화 함수 계산부, 상기 출력 온칩 메모리부 및 상기 풀링부의 출력과 입력이 순차적으로 포인트 투 포인트(P2P: point to point)방식으로 연결되어 하나의 데이터 전송 경로를 형성할 수 있다.In an exemplary embodiment of the present invention, an output pipeline includes a point-to-point (P2P) output and an input of the addition tree unit, the output buffer unit, the activation function calculation unit, the output on-chip memory unit, and the pooling unit sequentially. Point to point) can be connected to form a single data transmission path.

본 발명의 일 실시예에 있어서, 상기 행 단위 연산 데이터 처리 방법은 API 프로그램부가 벡터 연산 API(Application Programming Interface) 를 이용해 상기 행 단위 연산 뉴럴 프로세서의 구성 요소들의 처리 방법과 순서를 정의 및 제어하는 단계를 더 포함할 수 있다.In one embodiment of the present invention, the method for processing row-by-row operation data is defined by the API program unit using a vector operation API (Application Programming Interface) to define and control the processing method and order of the components of the unit-by-row neural processor It may further include.

본 발명의 일 실시예에 있어서, 출력 재입력부가 상기 풀링부의 출력을 상기 특징 지도 온칩 메모리부로 재입력하는 단계를 더 포함하고, 상기 재입력하는 단계는 상기 출력 재입력부의 더블 버퍼가 상기 풀링부 출력의 저장 및 상기 재입력 데이터의 처리를 동시에 수행할 수 있다.According to an embodiment of the present invention, the output re-input unit may further include re-input of the output of the pooling unit to the feature map on-chip memory unit, and the re-input may include the double buffer of the output re-input unit being the pooling unit. The storage of the output and the processing of the re-input data can be performed simultaneously.

본 발명의 일 실시예에 있어서, 상기 입력 특징 지도 행데이터를 생성하는 단계는 상기 특징 지도 버퍼부가 상기 입력 특징 지도 행데이터를 재사용하는 단계를 포함하고, 상기 필터 가중치 행데이터를 생성하는 단계는 상기 필터 가중치 버퍼부가 상기 필터 가중치 행데이터를 재사용하는 단계를 포함할 수 있다.In an embodiment of the present disclosure, the generating of the input feature map row data may include reusing the input feature map row data by the feature map buffer unit, and generating the filter weighted row data by the feature map buffer unit. The filter weight buffer unit may include reusing the filter weight row data.

본 발명의 일 실시예에 있어서, 상기 입력 특징 지도 로우 데이터의 채널 크기는 2의 거듭제곱이고, 상기 특징 지도 온칩 메모리부의 메모리 폭은 2의 거듭제곱이고, 상기 입력 특징 지도 로우 데이터의 채널 크기와 동일하거나 채널 크기의 약수 또는 배수이며, 상기 필터 가중치 로우 데이터의 필터 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 단계는 필터 가중치 온칩 메모리부가 상기 필터 픽셀을 저장한 마지막 행에서 유효 데이터가 저장되지 않은 나머지 부분을 그 행의 끝까지 0으로 채우는 제로 패딩 기능을 수행하는 단계를 더 포함할 수 있다.In one embodiment of the present invention, the channel size of the input feature map row data is a power of 2, the memory width of the feature map on-chip memory unit is a power of 2, and the channel size of the input feature map row data Storing the filter pixels of the filter weighted low data in adjacent rows in the channel unit may be the same or a divisor or multiple of the channel size, so that valid data is not stored in the last row of the filter weighted on-chip memory unit storing the filter pixels. The method may further include performing a zero padding function that fills the remaining portion with zeros to the end of the row.

본 발명의 일 실시예에 있어서, 상기 입력 특징 지도 행데이터의 시작 주소의 오프셋만큼 시프트 되어 상기 입력 특징 지도 행데이터와 출력 위치가 정렬될 수 있다.In one embodiment of the present invention, the input feature map row data and the output position may be aligned by shifting by an offset of a start address of the input feature map row data.

본 발명의 일 실시예에 있어서, 상기 필터 가중치 행데이터를 생성하는 단계는 상기 필터 가중치 버퍼부의 순환 시프트 버퍼가 상기 필터 가중치 온칩 메모리의 한 행을 순환 시프트 버퍼에 저장하는 단계, 상기 필터 가중치 버퍼부에 저장된 데이터를 상기 입력 특징 지도 행데이터의 시작 주소의 오프셋만큼 시프트 하는 단계, 상기 필터 가중치 행데이터가 시프트한 이전 공간을 0으로 채우는 제로 패딩 기능을 수행하는 단계, 데이터의 마지막 행을 시프트 한 후에 상기 순환 시프트 버퍼의 전체 또는 일부를 0으로 초기화 하는 단계를 더 포함하고, 필터 가중치 버퍼부의 순환 시프트 버퍼는 상기 필터 가중치 온칩 메모리부의 행 폭의 배수의 크기를 가질 수 있다.The generating of the filter weighted row data may include: storing, by the cyclic shift buffer of the filter weighted buffer unit, a row of the filter weighted on-chip memory in a cyclic shift buffer; Shifting the data stored in the offset by the offset of the start address of the input feature map row data, performing a zero padding function of filling the previous space shifted by the filter weighted row data with zero, and after shifting the last row of data Initializing all or part of the cyclic shift buffer to zero, wherein the cyclic shift buffer of the filter weight buffer unit may have a size of a multiple of the row width of the filter weight on-chip memory unit.

본 발명의 일 실시예에 있어서, 상기 행 단위 연산 데이터 처리 방법은 외부 메모리가 상기 입력 특징 지도 로우 데이터 및 상기 필터 가중치 로우 데이터를 저장하는 단계를 더 포함할 수 있다.In one embodiment of the present invention, the row-by-row operation data processing method may further comprise the step of an external memory to store the input feature map row data and the filter weight row data.

본 발명의 실시예들에 따르면, 행 단위 연산 뉴럴 프로세서 및 이를 이용한 데이터 처리 방법은 입력부, 특징 지도 온칩 메모리부, 필터 가중치 온칩 메모리부, 특징 지도 버퍼부, 필터 가중치 버퍼부, 합성곱 계산부, 가산 트리부, 출력 버퍼부 및 파이프라인을 포함하며, 활성화 함수 계산부, 활성화 출력 온칩 메모리부, 풀링부, 출력 재입력부를 포함할 수 있다. 따라서, 메모리가 칩 내부에 배치되어 데이터 처리 경로에 통합됨으로써 하드웨어 구조와 동작이 단순화될 수 있으며, 합성곱, 활성화 함수, 풀링을 하나의 파이프라인으로 통합하고 출력 특징 지도를 다음 합성곱 단계의 입력 특징 지도로 바로 사용할 수 있도록 하여 계층간 연결 오버헤드를 최소화 할 수 있다.According to embodiments of the present invention, a row unit neural processor and a data processing method using the same include an input unit, a feature map on chip memory unit, a filter weight on chip memory unit, a feature map buffer unit, a filter weight buffer unit, a composite product calculation unit, It includes an addition tree unit, an output buffer unit and a pipeline, and may include an activation function calculation unit, an activation output on-chip memory unit, a pooling unit, and an output re-input unit. Thus, the memory is placed inside the chip and integrated into the data processing path, simplifying the hardware structure and operation, integrating the product, activation function, and pooling into one pipeline and output output maps to the next product of the next product. It can be used directly as a feature map to minimize the connection overhead between layers.

또한, 본 발명은 벡터 API를 이용하여 온칩 메모리를 포함하는 각 구성 요소 및 데이터 경로를 사용자가 쉽게 프로그래밍 할 수 있고, 복수개의 필터를 이용하여 입력 특징 지도를 재사용함으로써 필터 개수의 비율만큼 성능을 향상 시킬 수 있어 확장이 용이하며, 합성곱 연산 외 다른 부분의 가속이 가능하도록 할 수 있어 완전연결 계층 등과 같은 다양한 계층의 가속이 가능할 수 있다.In addition, the present invention allows the user to easily program each component and data path including the on-chip memory using the vector API, and improve the performance by the ratio of the number of filters by reusing the input feature map using a plurality of filters. It can be extended to facilitate expansion, and acceleration of other parts besides the product of multiplication can be enabled, such that acceleration of various layers such as a fully connected layer can be possible.

도 1은 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서를 나타내는 구성도이다.
도 2는 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서를 나타내는 구성도이다.
도 3은 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법을 나타내는 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법을 나타내는 흐름도이다.
도 5는 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법의 입력 데이터를 생성하는 단계를 나타내는 흐름도이다.
도 6은 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법의 필터 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 단계를 나타내는 흐름도이다.
도 7은 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법의 입력 특징 지도 행데이터를 생성하는 단계를 나타내는 흐름도이다.
도 8은 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법의 필터 가중치 행데이터를 생성하는 단계를 나타내는 흐름도이다.
도 9는 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서의 구조를 나타내는 도면이다.
도 10은 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서의 구조를 나타내는 도면이다.
도 11은 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서의 구조를 나타내는 도면이다.
도 12는 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서 및 행 단위 연산 데이터 처리 방법의 입력 특징 지도 로우 데이터 및 필터 가중치 로우 데이터의 배치 형태를 나타내는 도면이다.
도 13은 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서 및 행 단위 연산 데이터 처리 방법의 입력 특징 지도 로우 데이터 및 필터 가중치 로우 데이터의 배치 형태 나타내는 도면이다.
도 14는 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서 및 행 단위 연산 데이터 처리 방법의 입력 특징 지도 로우 데이터 및 필터 가중치 로우 데이터의 배치 형태를 나타내는 도면이다.
도 15는 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서 및 행 단위 연산 데이터 처리 방법의 합성곱 연산 과정을 나타내는 도면이다.
도 16은 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서 및 행 단위 연산 데이터 처리 방법의 벡터 연산 API를 나타내는 도면이다.
도 17은 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서 및 행 단위 연산 데이터 처리 방법의 벡터 연산 API를 나타내는 도면이다.
도 18은 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서 및 행 단위 연산 데이터 처리 방법의 벡터 API를 나타내는 도면이다.
도 19는 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서 및 행 단위 연산 데이터 처리 방법의 파이프라인의 수행을 나타내는 도면이다.1 is a block diagram illustrating a row unit neural processor according to an exemplary embodiment of the present invention.
2 is a block diagram illustrating a row unit neural processor according to an exemplary embodiment of the present invention.
3 is a flowchart illustrating a method of processing row-by-row operation data according to an embodiment of the present invention.
4 is a flowchart illustrating a method of processing row-by-row operation data according to an embodiment of the present invention.
5 is a flowchart illustrating an operation of generating input data of a method of processing a row-by-row operation data according to an exemplary embodiment of the present invention.
FIG. 6 is a flowchart illustrating an operation of storing filter pixels of a row unit operation data processing method according to an embodiment of the present invention in rows adjacent to each other on a channel basis.
7 is a flowchart illustrating a step of generating input feature map row data in a method of processing row-by-row operation data according to an embodiment of the present invention.
8 is a flowchart illustrating a step of generating filter weighted row data in a method of processing row-by-row operation data according to an embodiment of the present invention.
FIG. 9 is a diagram illustrating the structure of a rowwise neural processor according to an embodiment of the present invention. FIG.
FIG. 10 illustrates a structure of a rowwise neural processor according to an embodiment of the present invention. FIG.
FIG. 11 illustrates a structure of a rowwise neural processor according to an embodiment of the present invention. FIG.
FIG. 12 is a diagram illustrating an arrangement form of input feature map row data and filter weighted row data of a row unit neural processor and a row unit operation data processing method according to an embodiment of the present invention.
FIG. 13 is a diagram illustrating a layout form of input feature map row data and filter weighted row data of a row-wise neural processor and a row-wise operation data processing method according to an embodiment of the present invention.
FIG. 14 is a diagram illustrating an arrangement form of input feature map row data and filter weighted row data of a row unit neural processor and a row unit operation data processing method according to an embodiment of the present invention.
15 is a diagram illustrating a composite product calculation process of a rowwise neural processor and a rowwise operation data processing method according to an embodiment of the present invention.
16 is a diagram illustrating a vector operation API of a row unit neural processor and a row unit operation data processing method according to an embodiment of the present invention.
17 is a diagram illustrating a vector operation API of a row unit neural processor and a row unit operation data processing method according to an embodiment of the present invention.
18 is a diagram illustrating a vector API of a row unit neural processor and a row unit operation data processing method according to an embodiment of the present invention.
19 illustrates performance of a pipeline of a row-wise neural processor and a row-by-row operation data processing method according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 실시예들을 본문에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다.As the inventive concept allows for various changes and numerous modifications, the embodiments will be described in detail in the text. However, this is not intended to limit the present invention to a specific disclosed form, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing the drawings, similar reference numerals are used for similar elements. Terms such as first and second may be used to describe various components, but the components should not be limited by the terms.

상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The terms are used only for the purpose of distinguishing one component from another. The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise.

본 출원에서, "포함하다" 또는 "이루어진다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. In this application, the terms "comprise" or "consist of" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art and shall not be construed in ideal or excessively formal meanings unless expressly defined in this application. Do not.

이하, 도면들을 참조하여 본 발명의 바람직한 실시예들을 보다 상세하게 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서를 나타내는 구성도이다.1 is a block diagram illustrating a row unit neural processor according to an exemplary embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서는 입력부(100), 특징 지도 온칩 메모리부(200), 필터 가중치 온칩 메모리부(300), 특징 지도 버퍼부(400), 필터 가중치 버퍼부(500), 합성곱 계산부(600), 가산 트리부(700) 및 출력 버퍼부(800)를 포함한다. Referring to FIG. 1, a row unit neural processor according to an exemplary embodiment of the present invention may include an input unit 100, a feature map on chip memory unit 200, a filter weight on chip memory unit 300, and a feature map buffer unit 400. And a filter weight buffer unit 500, a composite product calculation unit 600, an addition tree unit 700, and an output buffer unit 800.

상기 입력부(100)는 합성곱 연산이 필요한 다양한 데이터를 입력 받을 수 있다. 예를 들면, 상기 입력부(100)에 입력된 입력 데이터는 이미지 데이터일 수 있다. 예를 들면, 상기 입력 데이터는 이미지이고, 가로 크기가 W, 세로 크기가 H인 2차원 이미지일 수 있다.The input unit 100 may receive various data that requires a compound product operation. For example, the input data input to the input unit 100 may be image data. For example, the input data may be an image, and may be a two-dimensional image having a horizontal size of W and a vertical size of H.

상기 특징 지도 온칩 메모리부(200)는 상기 입력부(100)에 입력된 입력 데이터에 대한 입력 특징 지도 로우 데이터의 입력 픽셀을 채널 단위로 서로 인접하는 행에 저장할 수 있다. 상기 입력 특징 지도 로우 데이터는 상기 입력 데이터에 필터를 적용한 데이터일 수 있다. 상기 입력 픽셀은 상기 입력 특징 지도 로우 데이터를 처리하는 처리 단위로 나누어진 데이터일 수 있다. 도 12를 참조하여 예를 들면, CNN의 입력 데이터에 대한 6차원 루프 연산에서 채널 넘버 방향의 축을 Z축, 커널 높이 방향의 축을 Y축, 커널 너비 방향의 축을 X축이라고 할 때, 도 12의 (a)와 같은 일반적인 CNN의 6차원 루프 연산의 최초 3개의 루프 순환 연산의 순서가 X, Y, Z 축의 순서로 수행되는 것과는 다르게 도 12의 (b)와 같이 Z, Y, X 축의 순서로 수행되며, 이로 인해 각 입력 픽셀은 상기 Z축 및 상기 Y축으로 이루어진 ZY평면으로 분리된 데이터일 수 있고, 이러한 입력 픽셀은 채널 단위로 서로 인접하여 메모리의 각 행에 저장될 수 있다.The feature map on-chip memory unit 200 may store the input pixels of the input feature map row data for the input data input to the input unit 100 in adjacent rows in channel units. The input feature map row data may be data obtained by applying a filter to the input data. The input pixel may be data divided into processing units for processing the input feature map row data. For example, in the six-dimensional loop operation on the input data of the CNN, the axis in the channel number direction is referred to as the Z axis, the axis in the kernel height direction as the Y axis, and the axis in the kernel width direction as the X axis. Unlike the order of the first three loop cyclic operations of the general CNN six-dimensional loop operation as shown in (a), the Z, Y, and X axes are shown in FIG. As a result, each input pixel may be data separated into a ZY plane formed of the Z-axis and the Y-axis, and these input pixels may be stored in each row of the memory adjacent to each other on a channel basis.

상기 입력 특징 지도 로우 데이터의 채널 크기는 2의 거듭제곱일 수 있다. 상기 채널 넘버의 최대값인 채널의 크기는 2의 거듭제곱일 수 있다. 상기 특징 지도 온칩 메모리부(200)의 메모리 폭은 2의 거듭제곱이고, 상기 입력 특징 지도 로우 데이터의 채널 크기와 동일하거나 채널 크기의 약수 또는 배수일 수 있다. 또는 상기 채널의 크기는 상기 특징 지도 온칩 메모리부(200)의 메모리 폭의 배수 또는 약수일 수 있다. 도 13을 참조하여 예를 들면, 상기 채널의 크기는 상기 특징 지도 온칩 메모리부(200)의 메모리 폭의 배수이고, 상기 입력 데이터를 상기 ZY평면 단위로 분리한 입력 픽셀은 상기 특징 지도 온칩 메모리부(200)에 서로 인접한 행에 순차적으로 저장될 수 있다.The channel size of the input feature map row data may be a power of two. The channel size, which is the maximum value of the channel number, may be a power of two. The memory width of the feature map on-chip memory unit 200 is a power of 2, and may be the same as the channel size of the input feature map row data, or may be a divisor or multiple of the channel size. Alternatively, the size of the channel may be a multiple or a multiple of the memory width of the feature map on-chip memory unit 200. Referring to FIG. 13, for example, the size of the channel is a multiple of the memory width of the feature map on-chip memory unit 200, and the input pixel obtained by dividing the input data by the ZY plane unit is the feature map on-chip memory unit. 200 may be sequentially stored in rows adjacent to each other.

상기 특징 지도 온칩 메모리부(200)의 행의 모든 원소들은 합성곱을 위한 의미 있는 연산이 될 수 있다. 상기 Z축 및 상기 Y축으로 이루어진 ZY평면은 상기 X축 및 상기 Y축으로 이루어진 XY 평면보다 픽셀의 개수가 더 많을 수 있다. 상기 특징 지도 온칩 메모리부(200)는 스크래치패드 메모리(SPM: Scratch Pad Memory)일 수 있다. 상기 특징 지도 온칩 메모리부(200)는 복수 개일 수 있다.All elements of the row of the feature map on-chip memory unit 200 may be a meaningful operation for the composite product. The ZY plane formed of the Z axis and the Y axis may have a larger number of pixels than the XY plane formed of the X axis and the Y axis. The feature map on-chip memory unit 200 may be a scratch pad memory (SPM). The feature map on-chip memory unit 200 may be a plurality.

상기 필터 가중치 온칩 메모리부(300)는 상기 입력 데이터에 대한 필터 가중치 로우 데이터의 필터 픽셀을 채널 단위로 서로 인접하는 행에 저장할 수 있다. 상기 필터 가중치 로우 데이터는 상기 입력 데이터에 적용할 가중치 필터일 수 있다. 상기 필터 픽셀은 상기 가중치 필터를 처리하는 처리 단위로 나누어진 데이터일 수 있다. 도 12를 참조하여 예를 들면, CNN의 입력 데이터에 대한 6차원 루프 연산에서 채널 넘버 방향의 축을 Z축, 커널 높이 방향의 축을 Y축, 커널 너비 방향의 축을 X축이라고 할 때, 상기 필터 픽셀은 상기 Z축 및 상기 Y축으로 이루어진 ZY평면으로 분리된 데이터일 수 있고, 이러한 필터 픽셀은 채널 단위로 서로 인접하여 메모리의 각 행에 저장될 수 있다.The filter weight on-chip memory unit 300 may store filter pixels of the filter weighted row data for the input data in adjacent rows in channel units. The filter weight row data may be a weight filter to be applied to the input data. The filter pixel may be data divided into processing units for processing the weight filter. Referring to FIG. 12, for example, in a six-dimensional loop operation on input data of a CNN, when the axis in the channel number direction is referred to as the Z axis, the axis in the kernel height direction as the Y axis, and the axis in the kernel width direction as the X axis, the filter pixel May be data separated into a ZY plane consisting of the Z axis and the Y axis, and the filter pixels may be stored in each row of the memory adjacent to each other in a channel unit.

상기 필터 가중치 온칩 메모리부(300)는 상기 필터 가중치 로우 데이터를 저장한 마지막 행에서 데이터가 저장되지 않은 나머지 부분을 그 행의 끝까지 0으로 채우는 제로 패딩 기능을 수행할 수 있다. 이로 인해, 상기 필터 가중치 온칩 메모리부(300)는 행 단위로 정렬하여 데이터를 처리할 수 있다.The filter weight on-chip memory unit 300 may perform a zero padding function of filling the remaining portion of the last row storing the filter weighting row data with zero data until the end of the row. For this reason, the filter weight on-chip memory unit 300 may process the data by sorting by row.

상기 가중치 필터의 크기는 상기 합성곱 계산부(600) 또는 상기 가산 트리부(700)의 계산 완료 단위를 결정할 수 있다. 예를 들면, 상기 가중치 필터의 크기가 K이고, 상기 특징 지도 온칩 메모리의 폭이 W이며, 상기 채널 크기가 C이고 C는 W에 x 배수일때, 상기 합성곱 계산부(600) 및 상기 가산 트리부(700)의 계산 완료 단위는 x에 K를 곱한 값일 수 있다. 상기 계산 완료 단위는 사이클 일 수 있다.The size of the weight filter may determine a calculation completion unit of the composite product calculation unit 600 or the addition tree unit 700. For example, when the size of the weight filter is K, the feature map on-chip memory is W, the channel size is C, and C is a multiple of x, W, the composite product calculating unit 600 and the addition tree. The calculation completion unit of the unit 700 may be a value of x multiplied by K. The calculation completion unit may be a cycle.

상기 필터 가중치 온칩 메모리부(300) 및 상기 특징 지도 온칩 메모리부(200)는 스크래치패드 메모리(SPM: Scratch Pad Memory)일 수 있다 .상기 필터 가중치 온칩 메모리부(300)는 복수 개일 수 있다. 상기 특징 지도 온칩 메모리부(200)와 상기 필터 가중치 온칩 메모리부(300)는 상기 입력 데이터가 처리되는 데이터 경로의 일부일 수 있다. 따라서, 메모리가 연산부와 동일한 데이터 경로를 갖는 일체화된 구조를 가질 수 있다. The filter weight on chip memory unit 300 and the feature map on chip memory unit 200 may be a scratch pad memory (SPM). The filter weight on chip memory unit 300 may be provided in plurality. The feature map on chip memory unit 200 and the filter weight on chip memory unit 300 may be part of a data path through which the input data is processed. Thus, the memory may have an integrated structure having the same data path as the computing unit.

상기 특징 지도 버퍼부(400)는 상기 특징 지도 온칩 메모리부(200)에 저장된 데이터를 행 단위로 저장하여 입력 특징 지도 행데이터를 생성할 수 있다.The feature map buffer 400 may generate input feature map row data by storing data stored in the feature map on-chip memory unit 200 in units of rows.

상기 특징 지도 버퍼부(400)는 상기 입력 특징 지도 행데이터를 재사용할 수 있다. 도 10을 참조하여 예를 들면, 상기 필터 가중치 온칩 메모리부(300) 및 상기 필터 가중치 버퍼부(500)는 복수개이고, 상기 입력 특징 지도 행데이터는 상기 복수개의 상기 필터 가중치 행데이터에 적용될 수 있다. 따라서, 하나의 입력 특징 지도 행데이터는 다수의 필터 가중치 행데이터에 적용되어 상기 합성곱 계산부(600)로 입력되는 가중치 필터 단위 병렬화가 될 수 있다. 또한, 이러한 입력 특징 지도 행데이터 재사용 여부는 API 프로그램부(1300)에 의해 프로그램될 수 있다.The feature map buffer 400 may reuse the input feature map row data. For example, the filter weight on-chip memory unit 300 and the filter weight buffer unit 500 may be plural, and the input feature map row data may be applied to the plurality of filter weight row data. . Accordingly, one input feature map row data may be applied to a plurality of filter weighted row data to be parallelized by a weight filter unit input to the composite product calculation unit 600. In addition, whether to reuse the input feature map row data may be programmed by the API program unit 1300.

상기 필터 가중치 버퍼부(500)는 상기 필터 가중치 온칩 메모리부(300)에 저장된 데이터를 행 단위로 저장하여 필터 가중치 행데이터를 생성할 수 있다. 상기 필터 가중치 버퍼부(500)는 시프트 버퍼일 수 있다.The filter weight buffer unit 500 may generate filter weighted row data by storing data stored in the filter weighted on-chip memory unit 300 in units of rows. The filter weight buffer unit 500 may be a shift buffer.

상기 필터 가중치 버퍼부(500)는 순환 시프트 버퍼일 수 있다. 상기 순환 시프트 버퍼는 상기 필터 가중치 온칩 메모리부(300)의 행 폭의 배수의 크기를 가질 수 있다. 예를 들면, 상기 순환 시프트 버퍼의 크기는 상기 필터 가중치 온칩 메모리부(300)의 행 폭의 두배일 수 있다.The filter weight buffer unit 500 may be a cyclic shift buffer. The cyclic shift buffer may have a size that is a multiple of the row width of the filter weight on-chip memory unit 300. For example, the size of the cyclic shift buffer may be twice the row width of the filter weight on-chip memory unit 300.

상기 순환 시프트 버퍼는 상기 필터 가중치 행데이터를 상기 입력 특징 지도 행데이터의 시작 주소의 오프셋만큼 시프트 할 수 있다. 예를 들면, 상기 입력 특징 지도 로우 데이터의 시작 주소의 오프셋이 d인 경우 상기 순환 시프트 버퍼는 상기 필터 가중치 로우 데이터의 첫번째 행을 상기 필터 가중치 행데이터로 저장한 이후 다음 사이클에서 d 오프셋만큼 시프트 할 수 있다. 따라서, 상기 입력 특징 지도 행데이터의 원소들의 위치와 상기 필터 가중치 행데이터의 원소들의 위치가 정렬될 수 있다. 상기 순환 시프트 버퍼는 상기 필터 가중치 행데이터가 시프트한 이전 공간을 0으로 채우는 제로 패딩 기능을 수행할 수 있다. 상기 시프트 된 필터 가중치 행데이터는 상기 입력 특징 지도 행데이터와 위치가 정렬되어 각 원소별로 상기 합성곱 계산부(600)로 전송될 수 있다.The cyclic shift buffer may shift the filter weight row data by an offset of a start address of the input feature map row data. For example, when the offset of the start address of the input feature map row data is d, the cyclic shift buffer stores the first row of the filter weighted row data as the filter weighted row data and then shifts by d offset in the next cycle. Can be. Thus, the positions of the elements of the input feature map row data and the positions of the elements of the filter weighted row data may be aligned. The cyclic shift buffer may perform a zero padding function of filling the previous space shifted by the filter weight row data with zero. The shifted filter weighted row data may be aligned with the input feature map row data and transmitted to the composite product calculator 600 for each element.

상기 동작은 하나의 상기 필터 픽셀의 마지막 행까지 반복될 수 있다. 예를 들면, 상기 순환 시프트 버퍼는 상기 필터 픽셀의 최초의 행을 상기 필터 가중치 행데이터로 저장한 후 다음 사이클에서 상기 입력 특징 지도 로우 데이터의 시작 주소의 오프셋만큼 시프트 하고, 이를 합성곱 계산부(600)로 출력하며, 출력 후 상기 순환 시프트 버퍼의 내부 데이터를 상기 필터 가중치 온칩 메모리부(300)의 행 폭에서 상기 오프셋을 제외한 만큼을 시프트함과 동시에 상기 필터 가중치 로우 데이터의 다음 행을 상기 필터 가중치 행데이터로 저장한 후 이를 상기 필터 픽셀의 마지막행까지 반복할 수 있다. The operation may be repeated up to the last row of one said filter pixel. For example, the cyclic shift buffer stores the first row of the filter pixel as the filter weighted row data, and then shifts the offset by the offset of the start address of the input feature map row data in the next cycle. 600), and after outputting the internal data of the cyclic shift buffer by shifting the width of the filter weight on-chip memory unit 300 except for the offset, and simultaneously filtering the next row of the filter weighted row data. After storing as weighted row data, it may be repeated until the last row of the filter pixel.

상기 순환 시프트 버퍼는 상기 하나의 필터 픽셀의 마지막 행을 출력한 후 다음 필터 픽셀을 입력 받기전에 상기 순환 시프트 버퍼의 전체 또는 일부를 0으로 초기화할 수 있다. 예를 들면, 상기 하나의 필터 픽셀의 마지막행을 상기 합성곱 계산부(600)로 출력 후 시프트 할 때 다음 필터 픽셀의 첫번째 행을 저장하지 않고 시프트된 공간을 0으로 초기화할 수 있다. 또는 상기 순환 시프트 버퍼의 모든 공간을 초기화 할 수 있다.The cyclic shift buffer may initialize all or part of the cyclic shift buffer to 0 after outputting the last row of the one filter pixel and before receiving the next filter pixel. For example, when outputting and shifting the last row of one filter pixel to the composite product calculation unit 600, the shifted space may be initialized to 0 without storing the first row of the next filter pixel. Alternatively, all spaces of the circular shift buffer may be initialized.

도 14를 참조하여 예를 들면, 출력 특징 지도의 (x,y) 좌표에 있는 픽셀을 위한 합성곱 연산을 하고자 할 때, 입력 특징 지도의 가장 첫번째 원소가 행(row)의 시작점에서부터 d 원소 개수만큼 떨어져서 위치하게 된다. 연산을 위한 입력 특징 지도의 가장 첫번째 원소의 상기 특징 지도 온칩 메모리부(200)의 주소는 다음의 수학식 1에 의해 정의될 수 있다.Referring to FIG. 14, for example, when performing a composite product operation for pixels at (x, y) coordinates of an output feature map, the first element of the input feature map is the number of d elements from the starting point of the row. Positioned as far apart. The address of the feature map on-chip memory unit 200 of the first element of the input feature map for calculation may be defined by Equation 1 below.

수학식 1Equation 1

sA=CХfHХx+CХysA = CХfHХx + CХy

여기서, sA는 연산을 위한 입력 특징 지도의 가장 첫번째 원소의 상기 특징 지도 온칩 메모리부(200)의 주소, C는 채널 넘버(크기), fH는 입력 특징 지도의 너비, x는 출력 픽셀의 X축 좌표값, y는 출력 픽셀의 Y축 좌표값을 나타낸다.Where sA is the address of the feature map on-chip memory unit 200 of the first element of the input feature map for computation, C is the channel number (size), fH is the width of the input feature map, and x is the X axis of the output pixel. The coordinate value, y, represents the Y-axis coordinate value of the output pixel.

상기 출력 픽셀은 상기 출력 특징 지도를 처리하는 처리 단위로 나누어진 데이터일 수 있다. 상기 출력 특징 지도는 상기 출력 버퍼부(800) 또는 풀링부(1100)의 출력일 수 있다.The output pixel may be data divided into processing units for processing the output feature map. The output feature map may be an output of the output buffer unit 800 or the pooling unit 1100.

따라서 입력 특징 지도의 첫번째 원소가 시작되는 오프셋인 d는 sA%W가 되고, ZY 평면을 구성하는 행의 개수는 다음의 수학식 2에 의해 정의될 수 있다.Accordingly, d, which is the offset at which the first element of the input feature map starts, becomes sA% W, and the number of rows constituting the ZY plane may be defined by Equation 2 below.

수학식 2Equation 2

nR=(CХK+d)/WnR = (CХK + d) / W

여기서, nR은 ZY 평면을 구성하는 행의 개수, C는 채널 넘버(크기), K는 가중치 필터의 크기, d는 시작 주소의 오프셋, W는 상기 필터 가중치 온칩 메모리부(300)의 폭을 나타낸다.Here, nR is the number of rows constituting the ZY plane, C is the channel number (size), K is the size of the weight filter, d is the offset of the start address, W is the width of the filter weight on-chip memory unit 300 .

상기 필터 가중치 온칩 메모리부(300)의 경우 필터를 구성하는 필터 픽셀은 행의 경계에서 정렬(align)되므로, 필터 픽셀의 마지막 행에서 유효 데이터가 아닌 부분을 0으로 채우는 제로 패딩 기능을 수행하여 행을 정렬할 수 있다. 또한, 상기 필터 가중치 버퍼부는 상기 정렬된 행으로부터 상기 필터 가중치 행데이터를 생성할 수 있다.In the case of the filter weight on-chip memory unit 300, the filter pixels constituting the filter are aligned at the boundary of the row, so that the zero padding function is performed to fill the non-valid data with zeros in the last row of the filter pixel. Can be sorted. The filter weight buffer unit may generate the filter weight row data from the sorted rows.

상기 입력 특징 지도 로우 데이터의 채널의 크기가 2의 거듭제곱인 경우 그 시작 주소의 오프셋의 경우의 수가 한정되므로 상기 순환 시프트 버퍼의 구현 및 동작은 보다 단순화 될 수 있다.When the channel size of the input feature map row data is a power of 2, the number of cases of the offset of the start address is limited, so that the implementation and operation of the cyclic shift buffer can be simplified.

도 15를 참조하여 예를 들면, 상기 입력 특징 지도 로우 데이터의 시작 주소의 오프셋인 d는 2이며, 상기 필터 가중치 버퍼부(500)의 순환 시프트 버퍼는 상기 필터 가중치 온칩 메모리부(300)의 폭인 W의 2배의 크기를 갖고, 내부의 데이터를 (W-d)만큼 시프트 함과 동시에 상기 필터 가중치 온칩 메모리부(300)에서 행단위로 데이터를 읽어서 저장하며, 이후 읽은 데이터를 d 만큼 시프트 하여 상기 합성곱 계산부(600)로 출력할 수 있다.Referring to FIG. 15, for example, d, which is an offset of a start address of the input feature map row data, is 2, and a cyclic shift buffer of the filter weight buffer unit 500 is a width of the filter weight on-chip memory unit 300. It has a size twice as large as W, shifts the internal data by (Wd), and simultaneously reads and stores the data in units of rows in the filter weight on-chip memory unit 300, and then shifts the read data by d to perform the composite product. The calculation unit 600 may output the result.

상기 필터 가중치 버퍼부(500)는 상기 필터 가중치 행데이터를 재사용할 수 있다. 예를 들면, 상기 필터 가중치 온칩 메모리부(300)의 행 폭이 채널 크기보다 클 경우, 일정 크기의 필터 가중치 버퍼부(500)를 상기 입력 특징 지도 행데이터에 반복하여 사용하여 재사용할 수 있다. 상기 필터 가중치 행데이터를 재사용할 경우 상기 필터 가중치 온칩 메모리부(300)에서 상기 필터 가중치 버퍼부(500)로 데이터가 이동하거나, 상기 필터 가중치 버퍼부(500) 내부의 시프트 버퍼에서 데이터가 시프트 되는 것을 생략할 수 있다. 또한, 이러한 필터 가중치 행데이터 재사용 여부는 API 프로그램부(1300)에 의해 프로그램될 수 있다.The filter weight buffer unit 500 may reuse the filter weight row data. For example, when the row width of the filter weight on-chip memory unit 300 is larger than the channel size, the filter weight buffer unit 500 having a predetermined size may be repeatedly used for the input feature map row data. When the filter weight row data is reused, data is moved from the filter weight on-chip memory unit 300 to the filter weight buffer unit 500, or data is shifted in a shift buffer inside the filter weight buffer unit 500. Can be omitted. In addition, whether or not to reuse the filter weight row data may be programmed by the API program unit 1300.

예를 들면, 상기 필터 가중치 버퍼부(500)에 저장된 필터 가중치 행데이터는 상기 필터 가중치 온칩 메모리부(300)를 반복해서 접근하지 않고 다수의 입력 특징 지도 행데이터에 대한 합성곱 연산을 하는데 사용될 수 있다. 또한, 이러한 필터 가중치 행데이터 재사용 여부는 API 프로그램부(1300)에 의해 프로그램될 수 있다.For example, the filter weight row data stored in the filter weight buffer unit 500 may be used to perform a composite product operation on a plurality of input feature map row data without repeatedly accessing the filter weight on-chip memory unit 300. have. In addition, whether or not to reuse the filter weight row data may be programmed by the API program unit 1300.

상기 특징 지도 온칩 메모리부(200) 및 상기 특징 지도 버퍼부(400)는 복수개일 수 있다.The feature map on-chip memory unit 200 and the feature map buffer unit 400 may be plural.

상기 합성곱 계산부(600)는 상기 입력 특징 지도 행데이터 및 상기 필터 가중치 행데이터를 요소별로 곱셈하여 합성곱 데이터를 생성할 수 있다. 상기 합성곱 계산부(600)는 상기 특징 지도 버퍼부(400)에서 상기 입력 특징 지도 행데이터를, 상기 필터 가중치 버퍼부(500)에서 상기 필터 가중치 행데이터를 전송 받아 요소별로 곱셈(element-wise multiplication)을 할 수 있다. 상기 요소별 곱셈은 다수의 곱셈기에 의해 병렬 처리될 수 있다.The composite product calculation unit 600 may generate composite product data by multiplying the input feature map row data and the filter weighted row data by elements. The composite product calculation unit 600 receives the input feature map row data from the feature map buffer unit 400 and the filter weighted row data from the filter weight buffer unit 500, and multiplies each element by element. You can do multiplication. The element-by-element multiplication may be processed in parallel by a plurality of multipliers.

상기 가산 트리부(700)는 상기 합성곱 데이터로부터 부분합을 계산할 수 있다. 상기 가산 트리부(700)는 상기 합성곱 계산부(600)로부터 합성곱 데이터를 전송 받을 수 있다. 상기 가산트리의 부분합은 상기 가산트리로 다시 피드백 될 수 있다.The addition tree unit 700 may calculate a partial sum from the composite product data. The addition tree unit 700 may receive the composite product data from the composite product calculation unit 600. The subtotal of the addition tree may be fed back to the addition tree.

상기 출력 버퍼부(800)는 상기 가산 트리부(700)로부터 연결되는 데이터 경로를 형성하는 파이프라인을 통해 상기 계산된 부분합을 전송 받아 저장하고, 상기 저장된 부분합을 출력하는 출력 버퍼부(800)를 포함할 수 있다. 상기 출력 버퍼부(800)는 상기 가산 트리부(700)가 하나의 출력 픽셀 계산을 완료하면 그 결과를 저장할 수 있다. 상기 출력 버퍼부(800)의 출력은 상기 행 단위 연산 뉴럴 프로세서의 출력 특징 지도일 수 있다. 상기 출력 버퍼부(800)는 복수 개일 수 있다.The output buffer unit 800 receives and stores the calculated subtotal through a pipeline forming a data path connected from the addition tree unit 700, and outputs the stored subtotal to the output buffer unit 800. It may include. The output buffer unit 800 may store the result when the addition tree unit 700 completes the calculation of one output pixel. The output of the output buffer unit 800 may be an output feature map of the row unit neural processor. There may be a plurality of output buffer units 800.

상기 파이프라인은 상기 가산 트리부(700), 상기 출력 버퍼부(800), 상기 활성화 함수 계산부(900), 상기 출력 온칩 메모리부 및 상기 풀링부(1100)의 출력과 입력이 순차적으로 포인트 투 포인트(P2P: point to point)방식으로 연결되어 하나의 데이터 전송 경로를 형성할 수 있다.The pipeline points and outputs the input and output of the addition tree unit 700, the output buffer unit 800, the activation function calculation unit 900, the output on-chip memory unit, and the pooling unit 1100 sequentially. Point to point (P2P) method can be connected to form a data transmission path.

도 2는 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서를 나타내는 구성도이다.2 is a block diagram illustrating a row unit neural processor according to an exemplary embodiment of the present invention.

본 실시예에 따른 행 단위 연산 뉴럴 프로세서는 활성화 함수 계산부(900), 활성화 출력 온칩 메모리부(1000), 풀링부(1100) 및 출력 재입력부(1200)를 제외하고는 도 1의 행 단위 연산 뉴럴 프로세서와 실질적으로 동일하다. 따라서, 도 1의 행 단위 연산 뉴럴 프로세서와 동일한 구성요소는 동일한 도면 부호를 부여하고, 반복되는 설명은 생략한다.The row unit calculation neural processor according to the present embodiment is a row unit operation of FIG. 1 except for the activation function calculation unit 900, the activation output on-chip memory unit 1000, the pooling unit 1100, and the output re-input unit 1200. It is substantially the same as a neural processor. Accordingly, the same components as those in the rowwise neural processor of FIG. 1 are given the same reference numerals, and repeated descriptions are omitted.

도 2를 참조하면, 본 발명의 일 실시예에 따른 행 단위 연산 뉴럴 프로세서는 활성화 함수 계산부(900), 활성화 출력 온칩 메모리부(1000), 풀링부(1100), 출력 재입력부(1200), API 프로그램부(1300) 및 외부 메모리(1400)를 포함할 수 있다.2, a row unit neural processor according to an exemplary embodiment of the present invention may include an activation function calculator 900, an activation output on-chip memory unit 1000, a pooling unit 1100, an output re-input unit 1200, The API program unit 1300 and the external memory 1400 may be included.

도 9를 참조하여 상기 행 단위 연산 뉴럴 프로세서 구조의 예를 들면, 상기 행 단위 연산 뉴럴 프로세서는 특징 지도 온칩 메모리부(200), 필터 가중치 온칩 메모리부(300), 특징 지도 버퍼부(400), 필터 가중치 버퍼부(500), 합성곱 계산부(600), 가산 트리부(700), 출력 버퍼부(800), 활성화 함수 계산부(900), 활성화 출력 온칩 메모리부(1000), 풀링부(1100) 및 출력 재입력부(1200)를 포함할 수 있다. 도 9의 구조는 상기 입력부(100), 외부 입력 경로 및 외부 출력 경로가 생략된 형태이다.Referring to FIG. 9, for example, the row unit neural processor may include a feature map on chip memory unit 200, a filter weight on chip memory unit 300, a feature map buffer unit 400, Filter weight buffer 500, composite product calculation 600, addition tree 700, output buffer 800, activation function calculation 900, activation output on-chip memory 1000, pooling ( 1100 and the output re-input unit 1200. 9 is a form in which the input unit 100, the external input path and the external output path are omitted.

상기 활성화 함수 계산부(900)는 상기 출력 버퍼부(800)의 출력을 전송 받아 활성화 함수로부터 상기 출력의 활성 상태 및 비활성 상태를 결정할 수 있다. 상기 출력 버퍼부(800)의 출력은 상기 활성화 함수 계산부(900)의 입력으로 연결될 수 있다. 상기 활성화 함수 계산부(900)는 ALU일 수 있다. 상기 활성화 함수 계산부(900)의 활성화 함수는 상기 API 프로그램부(1300)에 의해 입력 또는 변경될 수 있다.The activation function calculation unit 900 may receive an output of the output buffer unit 800 to determine an active state and an inactive state of the output from an activation function. An output of the output buffer unit 800 may be connected to an input of the activation function calculator 900. The activation function calculation unit 900 may be an ALU. The activation function of the activation function calculation unit 900 may be input or changed by the API program unit 1300.

상기 활성화 출력 온칩 메모리부(1000)는 상기 활성화 함수 계산부(900)의 출력을 저장할 수 있다. 상기 활성화 출력 온칩 메모리부(1000)는 상기 풀링부(1100)에서 풀링 연산을 하기에 충분한 크기일 수 있다. 상기 활성화 출력 온칩 메모리부(1000)는 상기 API 프로그램부(1300)에 의해 프로그램될 수 있다.The activation output on-chip memory unit 1000 may store the output of the activation function calculator 900. The activation output on-chip memory unit 1000 may be large enough to perform a pulling operation in the pooling unit 1100. The activation output on-chip memory unit 1000 may be programmed by the API program unit 1300.

상기 풀링부(1100)는 상기 출력 온칩 메모리부에 저장된 데이터를 풀링하는 풀링부(1100)를 더 포함할 수 있다. 상기 풀링부(1100)는 상기 출력 온칩 메모리부의 출력을 입력으로 받을 수 있다. 상기 풀링부(1100)의 출력은 상기 행 단위 연산 뉴럴 프로세서의 출력 특징 지도일 수 있다.The pooling unit 1100 may further include a pooling unit 1100 for pooling data stored in the output on-chip memory unit. The pooling unit 1100 may receive an output of the output on-chip memory unit as an input. The output of the pooling unit 1100 may be an output feature map of the row-wise neural processor.

상기 파이프라인은 상기 가산 트리부(700), 상기 출력 버퍼부(800), 상기 활성화 함수 계산부(900), 상기 출력 온칩 메모리부 및 상기 풀링부(1100)의 출력과 입력이 순차적으로 포인트 투 포인트(P2P: point to point)방식으로 연결되어 하나의 데이터 전송 경로를 형성할 수 있다. 예를 들면, 상기 파이프라인은 상기 가산 트리부(700)의 출력이 상기 출력 버퍼부(800)의 입력으로, 상기 출력 버퍼부(800)의 출력이 상기 활성화 함수 계산부(900)의 입력으로, 상기 활성화 함수 계산부(900)의 출력이 상기 출력 온칩 메모리부의 입력으로, 상기 출력 온칩 메모리부의 출력이 상기 풀링부(1100)의 입력으로 연결될 수 있다. 이에 더하여 상기 풀링부(1100)의 출력은 상기 출력 재입력부(1200)의 입력으로, 상기 출력 재입력부(1200)의 출력은 상기 특징 지도 온칩 메모리부(200)로 연결될 수 있다. 따라서, 상기 파이프라인은 합성곱을 계산하는 부분, 활성화 함수 계산 부분 및 풀링 계산 부분이 하나의 데이터 경로를 형성할 수 있다. 단, 상기 활성화 함수 계산부(900) 및 상기 풀링부(1100)는 필요에 따라 또는 상기 API 프로그램부(1300)에 의해 우회되거나 바이패스 될 수 있다.The pipeline points and outputs the input and output of the addition tree unit 700, the output buffer unit 800, the activation function calculation unit 900, the output on-chip memory unit, and the pooling unit 1100 sequentially. Point to point (P2P) method can be connected to form a data transmission path. For example, in the pipeline, the output of the addition tree unit 700 is an input of the output buffer unit 800, and the output of the output buffer unit 800 is an input of the activation function calculation unit 900. The output of the activation function calculation unit 900 may be connected to an input of the output on chip memory unit, and the output of the output on chip memory unit may be connected to an input of the pooling unit 1100. In addition, the output of the pooling unit 1100 may be connected to the input of the output re-input unit 1200, and the output of the output re-input unit 1200 may be connected to the feature map on-chip memory unit 200. Accordingly, in the pipeline, a part for calculating a composite product, an activation function calculation part, and a pooling calculation part may form one data path. However, the activation function calculation unit 900 and the pooling unit 1100 may be bypassed or bypassed as necessary or by the API program unit 1300.

상기 파이프라인은 포인트 투 포인트(P2P: point to point)방식으로 연결되어 각 구성 요소 간의 통신 경합 등의 오버헤드를 미연에 방지할 수 있다. 상기 합성곱, 상기 활성화 함수 및 상기 풀링은 상기 입력 데이터의 한 출력 픽셀을 생성하기 위해 매 사이클 파이프라이닝될 수 있다.The pipelines may be connected in a point-to-point (P2P) manner to prevent overhead such as contention contention between components. The composite product, the activation function, and the pooling may be pipelined every cycle to produce one output pixel of the input data.

도 19를 참조하여 예를 들면, (a)와 같이 상기 입력 특징 지도 로우 데이터는 3x3의 크기를 가지며, 상기 필터 가중치 로우 데이터는 2x2의 크기를 갖고, 채널의 크기는 메모리 행의 폭과 같으며, 상기 풀링부(1100)의 크기는 2x2이며 스트라이드(stride)는 1인 경우 (b)와 같은 스케줄 다이어그램과 같이 상기 행 단위 연산 뉴럴 프로세서의 각 구성요소들이 상기 벡터 연산 API들을 통해 파이프라이닝될 수 있다. 상기 스케줄 다이어그램의 각 행은 상기 벡터 연산 API와 연관되어 수행되는 상기 행 단위 연산 뉴럴 프로세서의 구성요소를 나타내고, 각 열은 시간의 흐름을 나타낼 수 있다. 이때, 상기 스케줄 다이어그램의 모든 연산들은 완전히 파이프라이닝될 수 있다. 상기 파이프라인은 첫 초기화를 위한 대기 시간 이후에 하나의 출력 픽셀이 매 4 클락 사이클마다 생성될 수 있다. 상기 출력 픽셀이 만들어진 다음 사이클에는 활성화 함수가 동작될 수 있다. 상기 풀링부(1100)는 첫번째 행의 출력 특징 지도에 대한 연산이 완료될 때까지 기다릴 수 있다. (b)의 스케줄 다이어그램의 BLK는 블록(block)을, Cv는 합성곱 계산부(600), B&A는 활성화 함수부, P&W는 풀링부(1100)를, 대괄호 안의 숫자는 해당하는 블록의 라인 넘버를 나타낸다. Referring to FIG. 19, for example, as shown in (a), the input feature map row data has a size of 3x3, the filter weighted row data has a size of 2x2, and the channel size is equal to the width of a memory row. When the size of the pooling unit 1100 is 2x2 and the stride is 1, each component of the row-wise neural processor can be piped through the vector operation APIs as shown in the schedule diagram as in (b). have. Each row of the schedule diagram may represent a component of the row-wise neural processor performed in association with the vector operation API, and each column may represent a passage of time. At this time, all the operations of the schedule diagram may be fully piped. The pipeline may generate one output pixel every four clock cycles after the wait time for initial initialization. The activation function may be operated in the next cycle in which the output pixel is made. The pooling unit 1100 may wait until the operation on the output feature map of the first row is completed. In the schedule diagram of (b), BLK is a block, Cv is a composite product calculation unit 600, B & A is an activation function unit, P & W is a pooling unit 1100, and numbers in square brackets are line numbers of corresponding blocks. Indicates.

상기 API 프로그램부(1300)는 상기 행 단위 연산 뉴럴 프로세서의 구성 요소들의 처리 방법과 순서를 정의 및 제어하는 벡터 연산 API를 지원할 수 있다. 상기 벡터 연산 API는 상기 행 단위 연산 뉴럴 프로세서를 세가지 카테고리로 나누어 제어할 수 있다. 상기 세가지 카테고리는 합성곱 연산과 관련된 제1 카테고리, 활성화 함수와 관련된 제2 카테고리 및 풀링과 관련된 제3 카테고리일 수 있다. 상기 세가지 카테고리는 동일한 클락에 동기되어 동작할 수 있다. 예를 들면, 상기 제1 카테고리는 제1 클락에, 상기 제2 카테고리 및 제3 카테고리의 구성은 제2 클락에 동기되어 동작할 수 있다.The API program unit 1300 may support a vector operation API for defining and controlling a processing method and an order of components of the row unit neural processor. The vector arithmetic API can control the row-wise arithmetic neural processor into three categories. The three categories may be a first category associated with the product of multiplication, a second category associated with an activation function, and a third category associated with pooling. The three categories may operate in synchronization with the same clock. For example, the first category may operate in synchronization with the first clock, and the configuration of the second category and the third category may be synchronized with the second clock.

도 16를 참조하여 예를 들면, 상기 제1 카테고리는 도 16의 No. 1에서 9와 같이, 제1 클락(CLK)에 동기되어 동작될 수 있으며, 상기 제1 카테고리는 상기 특징 지도 온칩 메모리부(200), 상기 필터 가중치 온칩 메모리부(300), 상기 특징 지도 버퍼부(400), 상기 필터 가중치 버퍼부(500), 상기 합성곱 계산부(600) 및 상기 가산 트리부(700)를 포함할 수 있다.For example, referring to FIG. 16, the first category may be No. 1 of FIG. 16. As in 1 to 9, it may be operated in synchronization with the first clock CLK, and the first category may include the feature map on chip memory unit 200, the filter weight on chip memory unit 300, and the feature map buffer unit. 400, the filter weight buffer unit 500, the composite product calculation unit 600, and the addition tree unit 700 may be included.

상기 제2 카테고리 및 상기 제3 카테고리는 도 16의 No. 10 및 No. 11과 같이, 제2 클락(pxl_CLK)에 동기되어 동작될 수 있으며, 상기 제2 카테고리는 상기 활성화 함수 계산부(900) 및 상기 활성화 출력 온칩 메모리부(1000)를 포함할 수 있다. 상기 제3 카테고리는 상기 활성화 출력 온칩 메모리부(1000) 및 상기 풀링부(1100)를 포함할 수 있다.The second category and the third category may be Nos. 10 and No. As shown in FIG. 11, the operation may be performed in synchronization with the second clock pxl_CLK, and the second category may include the activation function calculator 900 and the activation output on-chip memory unit 1000. The third category may include the activation output on chip memory unit 1000 and the pooling unit 1100.

상기 벡터 연산 API는 상기 특징 지도 온칩 메모리부(200) 및 상기 필터 가중치 온칩 메모리부(300)의 주소를 이용할 수 있다. 또한, 상기 벡터 연산 API는 상기 활성화 출력 온칩 메모리부(1000)의 주소를 이용할 수 있다. 따라서, 상기 API프로그램부는 상기 제1, 제2 및 제3 카테고리의 데이터 배치 형태를 제어할 수 있다. 또한, 상기 API프로그램부는 상기 제1, 제2 및 제3 카테고리의 데이터 연결 구성을 제어할 수 있다.The vector operation API may use addresses of the feature map on chip memory unit 200 and the filter weight on chip memory unit 300. In addition, the vector operation API may use the address of the activation output on-chip memory unit 1000. Accordingly, the API program unit may control the data arrangement form of the first, second and third categories. The API program unit may control a data connection configuration of the first, second, and third categories.

상기 벡터 연산 API는 하드웨어 요소의 인덱스를 "#"로 표시할 수 있다. 도 16을 참조하여 예를 들면, No. 1의 fmem#의 "#"는 다수의 상기 특징 지도 온칩 메모리부(200)의 번호를 의미할 수 있다. 이와 같이 상기 행 단위 연산 뉴럴 프로세서의 구성요소들에 번호를 할당할 수 있다. 따라서, 상기 벡터 연산 API는 상기 행 단위 연산 뉴럴 프로세서의 하드웨어 확장시에도 프로그램을 용이하게 할 수 있다.The vector operation API may indicate an index of a hardware element as "#". With reference to FIG. 16, for example, No. "#" Of fmem # of 1 may mean a number of the feature map on-chip memory unit 200. As such, numbers may be assigned to components of the row-wise neural processor. Accordingly, the vector operation API may facilitate a program even when the hardware of the row-wise neural processor is expanded.

상기 API 프로그램부(1300)는 상기 외부 메모리(1400) 및 상기 외부 메모리(1400)와의 통신 및 처리 방법과 순서를 정의 및 제어할 수 있다. 도 17을 참조하여 예를 들면, 상기 외부 메모리(1400)는 DRAM(Dynamic Random Access Memory)이고, 상기 DRAM과의 연결은 DMA(Direct Memory Access)에 의해 이루어지며, 상기 벡터 연산 API는 상기 DRAM 및 상기 DMA를 제어 할 수 있다.The API program unit 1300 may define and control a method and order of communication and processing with the external memory 1400 and the external memory 1400. Referring to FIG. 17, for example, the external memory 1400 is a dynamic random access memory (DRAM), and the connection with the DRAM is made by direct memory access (DMA). The DMA can be controlled.

상기 API 프로그램부(1300)는 상기 활성화 함수부의 활성화 함수 타입 또는 바이어스(bias)를 설정할 수 있다. 상기 API 프로그램부(1300)는 상기 벡터 연산 API를 병렬 수행할 수 있다. 예를 들면, 상기 벡터 연산 API는 블록(block)으로 이루어지고, 상기 블록 내부의 명령은 동시에 수행될 수 있다.The API program unit 1300 may set an activation function type or a bias of the activation function unit. The API program unit 1300 may perform the vector operation API in parallel. For example, the vector operation API is composed of blocks, and instructions within the blocks may be executed at the same time.

도 18을 참조하여 상기 API 프로그램부(1300)의 동작을 예를 들면, 상기 행 단위 연산 뉴럴 프로세서는 초기화 단계(Initialization step)에서 입력 특징 지도와 필터 가중치들은 외부 DRAM으로부터 상기 특징 지도 온칩 메모리부(200)와 상기 필터 가중치 온칩 메모리부(300)로 복사할 수 있다. "||" 표시 양쪽에 있는 선언문은 동시에 수행되는 명령어 일 수 있다. 합성곱 블록의 for 루프는 4개의 선언문이 있어서 한 번 수행하는데 4 사이클이 걸리나, 각 ZY 평면 연산을 할 때에는 파이프라이닝 될 수 있다. 활성화 함수 블록은 한 출력 픽셀을 계산하는데 걸리는 시간인 pxl_CLK마다 작동한다. pxl_CLK은 가중치 필터의 ZY 평면을 계산하는데 필요한 연산 횟수인 nR을 ZY 평면 개수만큼 더한 것으로 계산될 수 있다. 풀링부(1100)를 제어하기 위해서, 먼저 풀링부(1100)를 위한 출력 픽셀들이 생성될 때까지 기다리도록 설정할 수 있으며, delayHeight 만큼의 출력 행들이 생성될 때까지 기다리도록 설정할 수 있다. 출력 특징 지도의 폭이 outW인 경우 delayHeight에 대한 수식에 따라 상기 풀링부(1100) 및 상기 입력 특징 지도 로우 데이터에 쓰기 연산을 하도록 상기 풀링부(1100)가 delayHeight만큼 기다리도록 하고, delayWidth만큼의 pxl_CLK마다 한 번씩 동작하도록 하여 최종 출력 픽셀을 상기 특징 지도 온칩 메모리부(200)에 기록하도록 할 수 있다. For example, referring to FIG. 18, the operation of the API program unit 1300 may include the input feature map and the filter weights of the feature map on-chip memory unit in an initialization step. 200 and the filter weight may be copied to the on-chip memory unit 300. "||" Declarations on either side of the display can be commands executed concurrently. The for loop of a convolutional block has four statements that take four cycles to execute once, but can be piped for each ZY plane operation. The activation function block works every pxl_CLK, which is the time it takes to compute one output pixel. pxl_CLK may be calculated by adding nR, which is the number of operations required to calculate the ZY plane of the weight filter, by the number of ZY planes. In order to control the pooling unit 1100, it may be set to wait until output pixels for the pooling unit 1100 are generated, and may be set to wait until output rows equal to delayHeight are generated. If the width of the output feature map is outW, the pooling unit 1100 waits for delayHeight to write to the pooling unit 1100 and the input feature map row data according to the formula for delayHeight, and delayWidth pxl_CLK. Each operation may be performed once so that the final output pixel may be recorded in the feature map on-chip memory unit 200.

상기 출력 재입력부(1200)는 상기 풀링부(1100)의 출력이 상기 특징 지도 온칩 메모리부로 재입력 되고, 상기 출력 데이터의 저장과 상기 재입력 데이터의 처리가 동시에 수행되는 더블 버퍼를 포함할 수 있다. 이를 통해 제1 시점의 합성곱 계층 또는 풀링부(1100)의 출력인 출력 특징 지도는 제1 시점과 연결되는 제2 시점의 합성곱 계층의 입력으로 사용될 수 있다. 상기 행 단위 연산 뉴럴 프로세서는 상기 출력 재입력부(1200)에서 입력되는 상기 재입력 데이터를 저장하기 위한 복수개의 상기 특징 지도 온칩 메모리부(200)를 포함할 수 있다.The output reinput unit 1200 may include a double buffer in which the output of the pooling unit 1100 is re-input to the feature map on-chip memory unit, and the storage of the output data and the processing of the re-input data are performed at the same time. . As a result, the output feature map output from the composite product layer or the pooling unit 1100 of the first view may be used as an input of the composite product layer of the second view connected to the first view. The row unit neural processor may include a plurality of the feature map on-chip memory unit 200 for storing the re-input data input from the output re-input unit 1200.

상기 외부 메모리(1400)는 칩 외부에 연결되어 상기 입력 특징 지도 로우 데이터 및 상기 필터 가중치 로우 데이터를 저장하는 추가적인 공간을 제공할 수 있다. 상기 행 단위 연산 뉴럴 프로세서는 상기 외부 메모리(1400)와의 연결을 제공하는 외부 메모리(1400) 연결 모듈을 포함할 수 있다. 예를 들면, 상기 외부 메모리(1400)는 DRAM(Dynamic Random Access Memory)이고, 상기 외부 메모리(1400) 연결 모듈은 상기 DRAM과의 연결을 위해 DMA(Direct Memory Access)기능을 제공할 수 있다. 상기 외부 메모리(1400) 연결 모듈은 상기 외부 메모리(1400)의 접근은 내부 연산 시간과 병렬적으로 수행할 수 있다. 따라서, 상기 외부 메모리(1400)의 접근 오버헤드(overhead)를 줄일 수 있다.The external memory 1400 may be connected to the outside of the chip to provide an additional space for storing the input feature map row data and the filter weight row data. The row unit neural processor may include an external memory 1400 connection module that provides a connection with the external memory 1400. For example, the external memory 1400 may be a dynamic random access memory (DRAM), and the external memory 1400 connection module may provide a direct memory access (DMA) function for connection with the DRAM. The external memory 1400 connection module may access the external memory 1400 in parallel with an internal operation time. Thus, an access overhead of the external memory 1400 may be reduced.

상기 외부 메모리(1400)는 상기 출력 재입력부(1200)에 연결될 수 있다. 상기 외부 메모리(1400)는 더블 버퍼 또는 트리플 버퍼를 포함할 수 있다. The external memory 1400 may be connected to the output reinput unit 1200. The external memory 1400 may include a double buffer or a triple buffer.

도 11을 참조하여 예를 들면, 상기 행 단위 연산 뉴럴 프로세서는 상기 외부 메모리(1400)를 포함하는 구조를 가질 수 있다. 또한, 상기 외부 메모리(1400)와의 연결을 위한 DMA를 지원하는 연결 모듈을 포함할 수 있다. For example, the row unit neural processor may have a structure including the external memory 1400. In addition, a connection module supporting DMA for connection with the external memory 1400 may be included.

상기 외부 메모리(1400)를 포함하는 경우, 상기 필터 가중치 온칩 메모리(300)의 크기는 CNN 알고리즘의 모든 합성곱 계층에서 필요로 하는 가장 큰 가중치 필터의 크기와 같거나 클 수 있다. 상기 특징 지도 온칩 메모리(200)의 크기는 CNN 알고리즘의 모든 계층 중에서 가장 큰 입력 특징 지도 로우 데이터의 크기와 같거나 클 수 있다.When the external memory 1400 is included, the size of the filter weight on-chip memory 300 may be equal to or larger than the size of the largest weight filter required in all convolutional layers of the CNN algorithm. The size of the feature map on-chip memory 200 may be equal to or larger than the size of the largest input feature map row data among all layers of the CNN algorithm.

도 3는 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법을 나타내는 흐름도이다.3 is a flowchart illustrating a method of processing row-by-row operation data according to an embodiment of the present invention.

본 실시예에 따른 행 단위 연산 데이터 처리 방법은 도 1 내지 도 2의 행 단위 연산 뉴럴 프로세서를 사용한다. 따라서, 도 1 내지 도 2의 행 단위 연산 뉴럴 프로세서와 동일한 구성요소는 동일한 도면 부호를 부여하고, 반복되는 설명은 생략한다.The row unit operation data processing method according to the present embodiment uses the row unit operation neural processor of FIGS. 1 to 2. Therefore, the same components as those of the row-wise neural processor of FIGS. 1 to 2 are denoted by the same reference numerals, and repeated descriptions are omitted.

도 3을 참조하면, 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법은 입력 데이터를 생성하는 단계(S100), 입력 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 단계(S200), 필터 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 단계(S300), 입력 특징 지도 행데이터를 생성하는 단계(S400), 필터 가중치 행데이터를 생성하는 단계(S500), 합성곱 데이터를 생성하는 단계(S600), 부분합을 계산하는 단계(S700) 및 부분합을 출력하는 단계(S800)를 포함한다.Referring to FIG. 3, in the row-wise operation data processing method according to an embodiment of the present invention, generating input data (S100), storing input pixels in adjacent rows in channel units (S200), and filters Storing pixels in rows adjacent to each other in units of channels (S300), generating input feature map row data (S400), generating filter weighted row data (S500), and generating composite product data ( S600), calculating the subtotals (S700) and outputting the subtotals (S800).

상기 행 단위 연산 데이터 처리 방법은 행 단위 연산 뉴럴 프로세서를 이용해 데이터를 처리하는 시스템에서 수행될 수 있다.The row unit operation data processing method may be performed in a system that processes data using a row unit operation neural processor.

상기 입력 데이터를 생성하는 단계(S100)에서는 입력부가 데이터를 입력 받아 입력 데이터를 생성할 수 있다. 상기 입력 데이터는 합성곱 연산이 필요한 다양한 데이터 일 수 있다. 예를 들면, 상기 입력 데이터는 이미지 데이터일 수 있다. 예를 들면, 상기 입력 데이터는 이미지이고, 가로 크기가 W, 세로 크기가 H인 2차원 이미지일 수 있다.In the generating of the input data (S100), the input unit may receive data and generate input data. The input data may be various data that requires a composite product operation. For example, the input data may be image data. For example, the input data may be an image, and may be a two-dimensional image having a horizontal size of W and a vertical size of H.

상기 입력 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 단계(S200)에서는 상기 특징 지도 온칩 메모리부가 상기 입력 특징 지도 로우 데이터의 입력 픽셀을 채널 단위로 서로 인접하는 행에 저장할 수 있다. 이때, 상기 입력 특징 지도 로우 데이터의 채널 크기는 2의 거듭제곱일 수 있다. 상기 특징 지도 온칩 메모리부의 메모리 폭은 2의 거듭제곱이고, 상기 입력 특징 지도 로우 데이터의 채널 크기와 동일하거나 채널 크기의 약수 또는 배수일 수 있다.In the operation S200 of storing the input pixels in rows adjacent to each other in the channel unit, the feature map on-chip memory unit may store the input pixels of the input feature map row data in rows adjacent to each other in channel units. In this case, the channel size of the input feature map row data may be a power of two. The memory width of the feature map on-chip memory unit is a power of 2, and may be the same as the channel size of the input feature map row data, or may be a divisor or multiple of the channel size.

상기 필터 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 단계(S300)에서는 상기 필터 가중치 온칩 메모리부가 상기 필터 가중치 로우 데이터의 필터 픽셀을 채널 단위로 서로 인접하는 행에 저장할 수 있다. In operation S300, the filter weight on-chip memory unit may store the filter pixels of the filter weighted row data in rows adjacent to each other on a channel basis.

상기 입력 특징 지도 행데이터를 생성하는 단계(S400)에서는 특징 지도 버퍼부가 상기 특징 지도 온칩 메모리부에 저장된 데이터를 행 단위로 저장하여 입력 특징 지도 행데이터를 생성할 수 있다.In the generating of the input feature map row data (S400), the feature map buffer unit may generate the input feature map row data by storing data stored in the feature map on-chip memory unit in units of rows.

상기 필터 가중치 행데이터를 생성하는 단계(S500)에서는 필터 가중치 버퍼부가 상기 필터 가중치 온칩 메모리부에 저장된 데이터를 행 단위로 저장하여 필터 가중치 행데이터를 생성할 수 있다.In the generating of the filter weighted row data (S500), the filter weighted buffer unit may generate the filter weighted row data by storing the data stored in the filter weighted on-chip memory unit in units of rows.

상기 합성곱 데이터를 생성하는 단계(S600)에서는 합성곱 계산부가 상기 입력 특징 지도 행데이터 및 상기 필터 가중치 행데이터를 요소별로 곱셈하여 합성곱 데이터를 생성할 수 있다.In the generating of the composite product data (S600), a composite product calculation unit may generate the composite product data by multiplying the input feature map row data and the filter weighted row data by elements.

상기 부분합을 계산하는 단계(S700)에서는 가산 트리부가 상기 합성곱 데이터로부터 부분합을 계산할 수 있다.In the calculating of the subtotals (S700), the addition tree unit may calculate a subtotal from the composite product data.

상기 부분합을 출력하는 단계(S800)에서는 출력 버퍼부가 상기 가산 트리부로부터 연결되는 데이터 경로를 형성하는 파이프라인을 통해 상기 계산된 부분합을 전송 받아 저장하고, 상기 저장된 부분합을 출력할 수 있다.In the outputting of the subtotals (S800), an output buffer unit may receive and store the calculated subtotals through a pipeline forming a data path connected from the addition tree unit, and output the stored subtotals.

상기 파이프라인은 상기 가산 트리부, 상기 출력 버퍼부, 상기 활성화 함수 계산부, 상기 출력 온칩 메모리부 및 상기 풀링부의 출력과 입력이 순차적으로 포인트 투 포인트(P2P: point to point)방식으로 연결되어 하나의 데이터 전송 경로를 형성할 수 있다.The pipeline is sequentially connected to the output and the input of the addition tree, the output buffer, the activation function calculation unit, the output on-chip memory unit and the pooling unit in a point to point (P2P) method one Can form a data transmission path.

상기 입력 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 단계(S200) 및 상기 필터 픽셀을 채널 단위로 서로 인접하는 행에 저장하는 단계(S300)와 입력 특징 지도 행데이터를 생성하는 단계(S400) 및 필터 가중치 행데이터를 생성하는 단계(S500)는 병렬적으로 수행될 수 있다.Storing the input pixels in rows adjacent to each other in units of channels (S200) and storing the filter pixels in rows adjacent to each other in units of channels (S300) and generating input feature map row data (S400). And generating filter weight row data (S500) may be performed in parallel.

도 3에 따른 상기 단계들은 상기 도 1 내지 도 2의 행 단위 연산 뉴럴 프로세서의 동작과 실질적으로 동일하므로 동일한 명칭 및 동작에 대한 반복되는 설명은 생략한다.Since the steps according to FIG. 3 are substantially the same as the operations of the row-wise neural processor of FIGS. 1 to 2, repeated descriptions of the same names and operations will be omitted.

도 4는 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법을 나타내는 흐름도이다.4 is a flowchart illustrating a method of processing row-by-row operation data according to an embodiment of the present invention.

본 실시예에 따른 행 단위 연산 데이터 처리 방법은 출력의 활성 상태 및 비활성 상태를 결정하는 단계(S900), 활성화 함수 계산부의 출력을 저장하는 단계(S1000), 출력 온칩 메모리부에 저장된 데이터를 풀링하는 단계(S1100) 및 풀링부의 출력을 상기 특징 지도 온칩 메모리부로 재입력하는 단계(S1200)를 제외하고는 도 3의 행 단위 연산 데이터 처리 방법과 실질적으로 동일하다. 따라서, 도 3의 행 단위 연산 데이터 처리 방법과 동일한 구성요소는 동일한 도면 부호를 부여하고, 반복되는 설명은 생략한다.In the row-wise operation data processing method according to the present embodiment, the step of determining the active state and the inactive state of the output (S900), storing the output of the activation function calculation unit (S1000), pooling the data stored in the output on-chip memory unit Except for the step S1100 and the step of re-input of the output of the pooling unit into the feature map on-chip memory unit (S1200) is substantially the same as the row-by-row operation data processing method of FIG. Therefore, the same components as those in the row-wise calculation data processing method of FIG. 3 are given the same reference numerals, and repeated descriptions are omitted.

상기 출력의 활성 상태 및 비활성 상태를 결정하는 단계(S900)에서는 활성화 함수 계산부가 상기 출력 버퍼부의 출력을 전송 받아 활성화 함수로부터 상기 출력의 활성 상태 및 비활성 상태를 결정할 수 있다.In operation S900 of determining an active state and an inactive state of the output, an activation function calculator may receive an output of the output buffer unit to determine an active state and an inactive state of the output from an activation function.

상기 활성화 함수 계산부의 출력을 저장하는 단계(S1000)에서는 출력 온칩 메모리부가 상기 활성화 함수 계산부의 출력을 저장할 수 있다.In an operation S1000 of storing the output of the activation function calculator, the output on-chip memory unit may store the output of the activation function calculator.

상기 출력 온칩 메모리부에 저장된 데이터를 풀링하는 단계(S1100)에서는 풀링부가 상기 출력 온칩 메모리부에 저장된 데이터를 풀링할 수 있다.In the step S1100 of pooling data stored in the output on-chip memory unit, the pooling unit may pull data stored in the output on-chip memory unit.

상기 풀링부의 출력을 상기 특징 지도 온칩 메모리부로 재입력하는 단계(S1200)에서는 출력 재입력부가 상기 풀링부의 출력을 상기 특징 지도 온칩 메모리부로 재입력할 수 있다.In the step S1200 of re-input of the output of the pooling unit into the feature map on-chip memory unit, the output re-input unit may re-input the output of the pooling unit into the feature map on-chip memory unit.

상기 재입력하는 단계에서는 상기 출력 재입력부의 더블 버퍼가 상기 풀링부 출력의 저장 및 상기 재입력 데이터의 처리를 동시에 수행할 수 있다.In the re-input, the double buffer of the output re-input unit may simultaneously perform the storage of the output of the pooling unit and the re-input data.

도 4에 따른 상기 단계들은 상기 도 1 내지 도 2의 행 단위 연산 뉴럴 프로세서의 동작과 실질적으로 동일하므로 동일한 명칭, 동작 및 효과에 대한 반복되는 설명은 생략한다.Since the steps according to FIG. 4 are substantially the same as the operations of the row-wise neural processor of FIGS. 1 to 2, repeated descriptions of the same names, operations, and effects will be omitted.

도 5는 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법의 입력 데이터를 생성하는 단계를 나타내는 흐름도이다.5 is a flowchart illustrating an operation of generating input data of a method of processing a row-by-row operation data according to an exemplary embodiment of the present invention.

본 실시예에 따른 행 단위 연산 데이터 처리 방법은 벡터 연산 API를 이용해 행 단위 연산 뉴럴 프로세서의 구성 요소들의 처리 방법과 순서를 정의 및 제어하는 단계(S101) 및 외부 메모리가 상기 입력 특징 지도 로우 데이터 및 상기 필터 가중치 로우 데이터를 저장하는 단계(S110)를 제외하고는 도 3 내지 도 4의 행 단위 연산 데이터 처리 방법과 실질적으로 동일하다. 따라서, 도 3 내지 도 4의 행 단위 연산 데이터 처리 방법과 동일한 구성요소는 동일한 도면 부호를 부여하고, 반복되는 설명은 생략한다.In the row-wise operation data processing method according to the present embodiment, a step of defining and controlling the processing method and the order of the elements of the row-wise neural processor using a vector operation API (S101), and the external memory to the input feature map row data and Except for storing the filter weighted row data (S110), the method is substantially the same as that of the row-wise operation data processing method of FIGS. 3 to 4. Therefore, the same components as those in the row-wise calculation data processing method of FIGS. 3 to 4 are assigned the same reference numerals, and repeated descriptions are omitted.

상기 벡터 연산 API를 이용해 행 단위 연산 뉴럴 프로세서의 구성 요소들의 처리 방법과 순서를 정의 및 제어하는 단계(S101)에서는 API 프로그램부가 벡터 연산 API를 이용해 상기 행 단위 연산 뉴럴 프로세서의 구성 요소들의 처리 방법과 순서를 정의 및 제어할 수 있다.In the step S101 of defining and controlling the processing method and the order of the components of the row unit neural processor using the vector operation API, the API program unit processes the components of the row unit neural processor using the vector operation API; The order can be defined and controlled.

상기 외부 메모리가 상기 입력 특징 지도 로우 데이터 및 상기 필터 가중치 로우 데이터를 저장하는 단계(S110)에서는 외부 메모리가 상기 입력 특징 지도 로우 데이터 및 상기 필터 가중치 로우 데이터를 저장할 수 있다. In operation S110, the external memory stores the input feature map row data and the filter weight row data. The external memory may store the input feature map row data and the filter weight row data.

상기 외부 메모리가 상기 입력 특징 지도 로우 데이터 및 상기 필터 가중치 로우 데이터를 저장하는 단계(S110)는 상기 행 단위 연산 뉴럴 프로세서의 외부 메모리 연결 모듈이 상기 외부 메모리에 연결하는 단계(미도시)를 포함할 수 있다. 예를 들면, 상기 외부 메모리에 연결하는 단계에서는 DRAM(Dynamic Random Access Memory)인 외부 메모리와 DMA(Direct Memory Access)기능을 제공하는 상기 외부 메모리 연결 모듈이 연결될 수 있다. 상기 외부 메모리 연결 모듈은 상기 외부 메모리의 접근은 내부 연산 시간과 병렬적으로 수행할 수 있다. 따라서, 상기 외부 메모리의 접근 오버헤드(overhead)를 줄일 수 있다.The storing of the input feature map row data and the filter weight row data by the external memory (S110) may include connecting an external memory connection module of the row unit neural processor to the external memory (not shown). Can be. For example, in the step of connecting to the external memory, an external memory which is a dynamic random access memory (DRAM) and an external memory connection module providing a direct memory access (DMA) function may be connected. The external memory connection module may perform the access of the external memory in parallel with an internal operation time. Therefore, the access overhead of the external memory can be reduced.

상기 외부 메모리가 상기 입력 특징 지도 로우 데이터 및 상기 필터 가중치 로우 데이터를 저장하는 단계(S110)는 상기 외부 메모리의 더블 버퍼 또는 트리플 버퍼를 이용하여 수행될 수 있다.The storing of the input feature map row data and the filter weight row data by the external memory (S110) may be performed using a double buffer or a triple buffer of the external memory.

도 5에 따른 상기 단계들은 상기 도 1 내지 도 2의 행 단위 연산 뉴럴 프로세서의 동작과 실질적으로 동일하므로 동일한 명칭, 동작 및 효과에 대한 반복되는 설명은 생략한다.Since the steps according to FIG. 5 are substantially the same as the operations of the row-wise neural processor of FIGS. 1 to 2, repeated descriptions of the same names, operations, and effects will be omitted.

도 6은 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법의 필터 가중치 로우 데이터를 채널 넘버, 커널 높이 및 커널 너비의 순서로 저장하는 단계를 나타내는 흐름도이다.6 is a flowchart illustrating a step of storing filter weighted row data in the order of a channel number, a kernel height, and a kernel width in a row-by-row operation data processing method according to an embodiment of the present invention.

본 실시예에 따른 행 단위 연산 데이터 처리 방법은 필터 픽셀을 저장한 마지막 행에서 유효 데이터가 저장되지 않은 나머지 부분을 그 행의 끝까지 0으로 채우는 제로 패딩 기능을 수행하는 단계(S310)를 제외하고는 도 3 내지 도 5의 행 단위 연산 데이터 처리 방법과 실질적으로 동일하다. 따라서, 도 3 내지 도 5의 행 단위 연산 데이터 처리 방법과 동일한 구성요소는 동일한 도면 부호를 부여하고, 반복되는 설명은 생략한다.The row-by-row operation data processing method according to the present exemplary embodiment except for performing a zero padding function of filling the remaining portion, in which the valid data is not stored, with zeros to the end of the row in the last row storing the filter pixel (S310). It is substantially the same as the method for processing row-by-row operation data of FIGS. 3 to 5. Therefore, the same components as those in the row-wise operation data processing method of FIGS. 3 to 5 are assigned the same reference numerals, and repeated descriptions are omitted.

상기 필터 픽셀을 저장한 마지막 행에서 유효 데이터가 저장되지 않은 나머지 부분을 그 행의 끝까지 0으로 채우는 제로 패딩 기능을 수행하는 단계(S310)에서는 필터 가중치 온칩 메모리부가 상기 필터 픽셀을 저장한 마지막 행에서 유효 데이터가 저장되지 않은 나머지 부분을 그 행의 끝까지 0으로 채울 수 있다. In the step (S310) of performing the zero padding function of filling the remaining portion of the last row storing the filter pixel, in which no valid data is stored, to zero until the end of the row, the filter weight on-chip memory unit stores the filter pixel in the last row. You can fill the rest of the line with zeros until the end of the line, where no valid data is stored.

도 6에 따른 상기 단계들은 상기 도 1 내지 도 2의 행 단위 연산 뉴럴 프로세서의 동작과 실질적으로 동일하므로 동일한 명칭, 동작 및 효과에 대한 반복되는 설명은 생략한다.Since the steps according to FIG. 6 are substantially the same as those of the row-wise neural processor of FIGS. 1 to 2, repeated descriptions of the same names, operations, and effects will be omitted.

도 7은 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법의 입력 특징 지도 행데이터를 생성하는 단계(S400)를 나타내는 흐름도이다.7 is a flowchart illustrating a step (S400) of generating input feature map row data in a row-by-row operation data processing method according to an embodiment of the present invention.

본 실시예에 따른 행 단위 연산 데이터 처리 방법은 입력 특징 지도 행데이터를 재사용하는 단계(S410)를 제외하고는 도 3 내지 도 6의 행 단위 연산 데이터 처리 방법과 실질적으로 동일하다. 따라서, 도 3 내지 도 6의 행 단위 연산 데이터 처리 방법과 동일한 구성요소는 동일한 도면 부호를 부여하고, 반복되는 설명은 생략한다.The row-by-row calculation data processing method according to the present embodiment is substantially the same as the row-by-row calculation data processing method of FIGS. 3 to 6 except for reusing the input feature map row data (S410). Therefore, the same components as those in the row-wise operation data processing method of FIGS. 3 to 6 are assigned the same reference numerals, and repeated descriptions are omitted.

상기 입력 특징 지도 행데이터를 재사용하는 단계(S410)에서는 상기 특징 지도 버퍼부가 상기 입력 특징 지도 행데이터를 재사용할 수 있다. 도 10을 참조하여 예를 들면, 상기 필터 가중치 온칩 메모리부 및 상기 필터 가중치 버퍼부는 복수개이고, 상기 입력 특징 지도 행데이터는 상기 복수개의 상기 필터 가중치 행데이터에 적용될 수 있다. 따라서, 하나의 입력 특징 지도 행데이터는 다수의 필터 가중치 행데이터에 적용되어 상기 합성곱 계산부로 입력되는 가중치 필터 단위 병렬화가 될 수 있다. 또한, 이러한 입력 특징 지도 행데이터 재사용 여부는 API 프로그램부에 의해 프로그램될 수 있다.In the operation S410 of reusing the input feature map row data, the feature map buffer unit may reuse the input feature map row data. For example, the filter weight on-chip memory unit and the filter weight buffer unit may be plural, and the input feature map row data may be applied to the plurality of filter weight row data. Accordingly, one input feature map row data may be applied to a plurality of filter weighted row data to be parallelized by a weight filter unit input to the composite product calculation unit. In addition, whether to reuse the input feature map row data may be programmed by the API program unit.

도 7에 따른 상기 단계들은 상기 도 1 내지 도 2의 행 단위 연산 뉴럴 프로세서의 동작과 실질적으로 동일하므로 동일한 명칭, 동작 및 효과에 대한 반복되는 설명은 생략한다.Since the steps according to FIG. 7 are substantially the same as the operations of the row-wise neural processor of FIGS. 1 to 2, repeated descriptions of the same names, operations, and effects will be omitted.

도 8은 본 발명의 일 실시예에 따른 행 단위 연산 데이터 처리 방법의 필터 가중치 행데이터를 생성하는 단계(S500)를 나타내는 흐름도이다.8 is a flowchart illustrating a step (S500) of generating filter weighted row data in a row-by-row operation data processing method according to an embodiment of the present invention.

본 실시예에 따른 행 단위 연산 데이터 처리 방법은 필터 가중치 온칩 메모리의 한 행을 순환 시프트 버퍼에 저장하는 단계(S510), 시작 주소의 오프셋만큼 시프트 하는 단계(S520), 시프트한 이전 공간을 0으로 채우는 제로 패딩 기능을 수행하는 단계(S530), 필터 가중치 버퍼부의 전체 또는 일부를 0으로 초기화 하는 단계 및 필터 가중치 행데이터를 재사용하는 단계(S550)를 제외하고는 도 3 내지 도 7의 행 단위 연산 데이터 처리 방법과 실질적으로 동일하다. 따라서, 도 3 내지 도 7의 행 단위 연산 데이터 처리 방법과 동일한 구성요소는 동일한 도면 부호를 부여하고, 반복되는 설명은 생략한다.In the row-by-row operation data processing method according to the present embodiment, a step of storing one row of the filter weight on-chip memory in a cyclic shift buffer (S510), a step of shifting by an offset of a start address (S520), and a shifted previous space to 0 3 to 7 except for performing a zero padding function of filling (S530), initializing all or a portion of the filter weight buffer unit to 0, and reusing the filter weight row data (S550). It is substantially the same as the data processing method. Therefore, the same components as those in the row-wise calculation data processing method of FIGS. 3 to 7 are assigned the same reference numerals, and repeated descriptions are omitted.

필터 가중치 온칩 메모리의 한 행을 순환 시프트 버퍼에 저장하는 단계(S510)에서는 상기 필터 가중치 버퍼부의 순환 시프트 버퍼가 필터 가중치 온칩 메모리의 한 행을 상기 필터 가중치 버퍼부의 순환 시프트 버퍼에 저장할 수 있다. 상기 필터 가중치 버퍼부의 순환 시프트 버퍼는 상기 필터 가중치 온칩 메모리부의 행 폭의 두 배의 크기를 가질 수 있다.In the step S510 of storing one row of the filter weight on-chip memory in the cyclic shift buffer, the cyclic shift buffer of the filter weight buffer may store one row of the filter weight on-chip memory in the cyclic shift buffer of the filter weight buffer. The cyclic shift buffer of the filter weight buffer unit may have a size twice the row width of the filter weight on-chip memory unit.

상기 시작 주소의 오프셋만큼 시프트 하는 단계(S520)에서는 상기 필터 가중치 버퍼부의 순환 시프트 버퍼가 상기 필터 가중치 버퍼부의 순환 시프트 버퍼에 저장된 데이터를 상기 입력 특징 지도 행데이터의 시작 주소의 오프셋만큼 시프트 할 수 있다.In the shifting by the offset of the start address (S520), the cyclic shift buffer of the filter weight buffer unit may shift data stored in the cyclic shift buffer of the filter weight buffer unit by the offset of the start address of the input feature map row data. .

상기 시프트한 이전 공간을 0으로 채우는 제로 패딩 기능을 수행하는 단계(S530)에서는 상기 필터 가중치 버퍼부의 순환 시프트 버퍼가 상기 필터 가중치 행데이터가 시프트한 이전 공간을 0으로 채우는 제로 패딩 기능을 수행할 수 있다.In step S530, the zero-padding function of filling the shifted previous space with zero may be performed by the cyclic shift buffer of the filter weight buffer unit to zero-pad the previous space shifted by the filter weighted row data with zero. have.

상기 순환 시프트 버퍼의 전체 또는 일부를 0으로 초기화 하는 단계(S540)에서는 상기 필터 가중치 버퍼부의 순환 시프트 버퍼가 상기 필터 가중치 행데이터의 마지막 행을 시프트 한 후에 상기 필터 가중치 버퍼부의 전체 또는 일부를 0으로 초기화 할 수 있다.In the initializing of all or part of the cyclic shift buffer to zero (S540), after the cyclic shift buffer of the filter weight buffer unit shifts the last row of the filter weight row data, all or part of the filter weight buffer unit is zero. You can initialize it.

상기 필터 가중치 행데이터를 재사용하는 단계(S550)에서는 필터 가중치 버퍼부가 상기 필터 가중치 행데이터를 재사용할 수 있다. 예를 들면, 상기 필터 가중치 온칩 메모리부의 행 폭이 채널 크기보다 클 경우, 일정 크기의 필터 가중치 버퍼부를 상기 입력 특징 지도 행데이터에 반복하여 사용하여 재사용할 수 있다. 상기 필터 가중치 행데이터를 재사용할 경우 상기 필터 가중치 온칩 메모리부에서 상기 필터 가중치 버퍼부로 데이터가 이동하거나, 상기 필터 가중치 버퍼부 내부의 시프트 버퍼에서 데이터가 시프트 되는 것을 생략할 수 있다. 또한, 이러한 필터 가중치 행데이터 재사용 여부는 API 프로그램부에 의해 프로그램될 수 있다. In step S550 of reusing the filter weight row data, the filter weight buffer unit may reuse the filter weight row data. For example, when the row width of the filter weight on-chip memory unit is larger than the channel size, the filter weight buffer unit having a predetermined size may be repeatedly used in the input feature map row data to be reused. When the filter weighted row data is reused, it may be omitted that data is moved from the filter weighted on-chip memory unit to the filter weighted buffer unit or data is shifted from the shift buffer inside the filter weighted buffer unit. In addition, whether to reuse the filter weight row data may be programmed by the API program unit.

예를 들면, 상기 필터 가중치 버퍼부에 저장된 필터 가중치 행데이터는 상기 필터 가중치 온칩 메모리부를 반복해서 접근하지 않고 다수의 입력 특징 지도 행데이터에 대한 합성곱 연산을 하는데 사용될 수 있다. 또한, 이러한 필터 가중치 행데이터 재사용 여부는 API 프로그램부에 의해 프로그램될 수 있다.For example, the filter weight row data stored in the filter weight buffer unit may be used to perform a compound product operation on a plurality of input feature map row data without repeatedly accessing the filter weight on-chip memory unit. In addition, whether to reuse the filter weight row data may be programmed by the API program unit.

도 8에 따른 상기 단계들은 상기 도 1 내지 도 2의 행 단위 연산 뉴럴 프로세서의 동작과 실질적으로 동일하므로 동일한 명칭, 동작 및 효과에 대한 반복되는 설명은 생략한다.Since the steps according to FIG. 8 are substantially the same as the operations of the row-wise neural processor of FIGS. 1 to 2, repeated descriptions of the same names, operations, and effects will be omitted.

상기 행 단위 연산 데이터 처리 방법 및 이를 이용한 뉴럴 프로세서는 완전연결 계층 및 RNN(Recursive Neural Network)의 가속에도 확장 적용할 수 있다.The row-wise operation data processing method and a neural processor using the same may be applied to acceleration of a fully connected layer and a recursive neural network (RNN).

이상에서는 실시예들을 참조하여 설명하였지만, 해당 기술 분야의 숙련된 통상의 기술자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although described above with reference to the embodiments, those skilled in the art can variously modify and change the present invention without departing from the spirit and scope of the invention described in the claims below. I can understand that.

100: 입력부
200: 특징 지도 온칩 메모리부
300: 필터 가중치 온칩 메모리부
400: 특징 지도 버퍼부
500: 필터 가중치 버퍼부
600: 합성곱 계산부
700: 가산 트리부
800: 출력 버퍼부100: input unit
200: feature map on-chip memory section
300: filter weight on-chip memory unit
400: feature map buffer unit
500: filter weight buffer unit
600: composite product calculation unit
700: addition tree part
800: output buffer unit

Claims

An input unit for inputting data;
A feature map on-chip memory unit for storing input pixel of the input feature map row data for the input data input to the input unit in adjacent rows on a channel basis;
A filter weight on-chip memory unit for storing filter pixels of the filter weighted row data for the input data in adjacent rows in channel units;
A feature map buffer unit configured to generate input feature map row data by storing data stored in the feature map on-chip memory unit in units of rows;
A filter weight buffer unit for generating filter weighted row data by storing data stored in the filter weight on-chip memory unit in units of rows;
A compound product calculator configured to generate a compound product data by multiplying the input feature map row data and the filter weighted row data by elements;
An addition tree unit for calculating a partial sum from the composite product data; And
And an output buffer unit configured to receive and store the calculated subtotal through a pipeline forming a data path connected from the addition tree unit, and to output the stored subtotal.

The method of claim 1,
An activation function calculator configured to receive an output of the output buffer unit and determine an active state and an inactive state of the output from an activation function;
An activation output on-chip memory unit for storing an output of the activation function calculation unit; And
And a pooling unit configured to pool data stored in the output on-chip memory unit.

The method of claim 2,
The pipeline is sequentially connected to the output and the input of the addition tree, the output buffer, the activation function calculation unit, the output on-chip memory unit and the pooling unit in a point to point (P2P) method one A row-wise neural processor that forms a data transfer path.

The method of claim 2,
And an API program unit supporting a vector operation API (Application Programming Interface) for defining and controlling a processing method and an order of the components of the row-wise neural processor.

The method of claim 2,
And outputting the pooling unit to the feature map on-chip memory unit, and further including an output re-input unit including a double buffer to simultaneously store the output data and process the re-input data.

The method of claim 1,
The feature map buffer unit reuses the input feature map row data,
And the filter weight buffer unit reuses the filter weight row data.

The method of claim 1,
The channel size of the input feature map row data is a power of two,
The memory width of the feature map on-chip memory unit is a power of 2, is equal to the channel size of the input feature map row data, is a divisor or multiple of the channel size,
And the filter weight on-chip memory unit performs a zero padding function that fills the remaining portion of the last row storing the filter pixel with zero valid data until the end of the row.

The method of claim 1,
And the filter weight buffer unit is a shift buffer.

The method of claim 1,
The filter weight buffer unit is a cyclic shift buffer having a size of a multiple of a row width of the filter weight on-chip memory unit,
The cyclic shift buffer shifts the filter weighted row data by an offset of a start address of the input feature map row data, performs a zero padding function of filling the previous space shifted by the filter weighted row data with zero, and performs the filter weighting. A row-wise neural processor for initializing all or part of the circular shift buffer to zero after shifting the last row of row data.

The method of claim 1,
And an external memory coupled to the outside of the chip to provide an additional space for storing the input feature map row data and the filter weighted row data.

In systems that process data using row-wise neural processors,
Generating an input data by the input unit receiving data;
Storing, by a feature map on-chip memory unit, the input pixels of the input feature map row data in adjacent rows on a channel basis;
Storing, by a filter weight on-chip memory unit, the filter pixels of the filter weighted row data in adjacent rows in channel units;
Generating input feature map row data by storing, in units of rows, data stored in the feature map on-chip memory unit by a feature map buffer unit;
Generating filter weighted row data by storing data stored in the filter weighted on-chip memory unit in units of rows by a filter weight buffer unit;
Generating a composite product data by multiplying the input feature map row data and the filter weighted row data by elements by a composite product calculation unit;
Adding a tree part to calculate a partial sum from the composite product data; And
And receiving and storing the calculated subtotal through an output buffer unit through a pipeline forming a data path connected from the addition tree unit, and outputting the stored subtotal,
Storing the feature map row data, storing the filter weighted row data, generating the input feature map row data, and generating the filter weighted row data may be performed in parallel. Treatment method.

The method of claim 11,
An activation function calculation unit receiving the output of the output buffer unit to determine an active state and an inactive state of the output from an activation function;
Storing an output of the activation function calculator by an output on-chip memory unit; And
And a pooling unit to pool data stored in the output on-chip memory unit.

The method of claim 12,
The pipeline is sequentially connected to the output and the input of the addition tree, the output buffer, the activation function calculation unit, the output on-chip memory unit and the pooling unit in a point to point (P2P) method one A row-wise operation data processing method for forming a data transmission path of the.

The method of claim 12,
And defining and controlling a processing method and an order of components of the row-wise neural processor using an API program unit using a vector arithmetic application programming interface (API).

The method of claim 12,
Re-inputting, by the output re-input unit, the output of the pooling unit into the feature map on-chip memory unit;
In the re-input, the double buffer of the output re-input unit performs the operation of storing the output of the pooling unit and processing the re-input data at the same time.

The method of claim 11,
Generating the input feature map row data comprises reusing the input feature map row data by the feature map buffer;
The generating of the filter weighted row data may include reusing the filter weighted row data by the filter weighted buffer unit.

The method of claim 11,
The channel size of the input feature map row data is a power of two,
The memory width of the feature map on-chip memory unit is a power of 2, is equal to the channel size of the input feature map row data, is a divisor or multiple of the channel size,
The storing of the filter pixels of the filter weighted row data in rows adjacent to each other in units of channels may be performed by setting the filter weight on-chip memory unit to zero to the end of the row in the last row in which the valid data is not stored. And performing a padding zero padding function.

The method of claim 11,
And wherein the filter weighted row data is shifted by the shift buffer by an offset of a start address of the input feature map row data so that the input feature map row data is aligned with the output position.

The method of claim 11,
The generating of the filter weight row data may include generating a cyclic shift buffer of the filter weight buffer unit.
Storing a row of said filter weight on-chip memory in a circular shift buffer;
Shifting data stored in the filter weight buffer unit by an offset of a start address of the input feature map row data;
Performing a zero padding function of filling the previous space shifted by the filter weight row data with zero;
Initializing all or a portion of the cyclic shift buffer to zero after shifting the last row of the filter weighted row data, wherein the cyclic shift buffer of the filter weighted buffer portion is a multiple of the row width of the filter weighted on-chip memory portion. Method for processing row-wise operation data having a size of.

The method of claim 11,
And storing, by an external memory, the input feature map row data and the filter weight row data.