KR20190005043A

KR20190005043A - SIMD MAC unit with improved computation speed, Method for operation thereof, and Apparatus for Convolutional Neural Networks accelerator using the SIMD MAC array

Info

Publication number: KR20190005043A
Application number: KR1020170085579A
Authority: KR
Inventors: 이종은
Original assignee: 울산과학기술원
Priority date: 2017-07-05
Filing date: 2017-07-05
Publication date: 2019-01-15
Also published as: KR101981109B1

Abstract

The present invention relates to a SIMD MAC unit with improved computation speed, an operation method thereof, and a convolution neural network accelerator using an array of a SIMD MAC unit. The SIMD MAC unit comprises: a plurality of DSPs including a multiplier for performing multiplication of operands including multipliers and multiplicands; a counter obtaining a count value obtained by counting an overflow generated as a result of add operation of multiplication operation result values performed in each DSP; a sub accumulator accumulating the multipliers inputted to each of the DSPs to obtain a cumulative multiplier value; and a main accumulator correcting the add operation result value based on the count value and cumulative multiplier value.

Description

Technical Field [0001] The present invention relates to a SIMD MAC unit, an operation method thereof, and a convolutional neural network accelerator using an arrangement of SIMD MAC units, }

본 발명은 연산 속도를 향상시킨 SIMD(Single Instruction Multiple Data) MAC(Multiply and Accumulate) 유닛, 그 동작 방법 및 SIMD MAC 유닛의 배열을 이용한 콘볼루션 신경망(Convolutional Neural Networks, CNN) 가속기에 대한 것으로, 상세하게는 FPGA(Field-programmable Gate Array)에서 곱셈 및 누적 연산을 위하여 사용되는 디지털 시그널 프로세서(Digital Signal Processor, DSP)를 SIMD MAC 배열의 유닛으로 구성하여, 데이터 패키징 및 데이터 보정을 통해 동시에 복수개의 곱셈 및 누적 연산을 수행함으로써, 연산 속도를 향상시킨 SIMD MAC 유닛, 그 동작 방법 및 SIMD MAC 유닛의 배열을 이용한 콘볼루션 신경망 가속기에 대한 것이다.The present invention relates to a Single Instruction Multiple Data (MAC) Multiply and Accumulate (SIMD) unit, an operation method thereof, and a Convolutional Neural Network (CNN) accelerator using an arrangement of SIMD MAC units. A digital signal processor (DSP) used for multiplication and accumulation operations in a field-programmable gate array (FPGA) is configured as a unit of a SIMD MAC array to perform data multiplication And a convolution neural network accelerator using an SIMD MAC unit, an operation method thereof, and an arrangement of SIMD MAC units.

데이터 처리를 위하여 전자기기에 탑재되는 프로세서 중 하나로 디지털 시그널 프로세서가 있다. DSP는 특정 디지털 신호처리 기능을 수행하기 위한 프로세서로서, 특정 디지털 신호처리 동작의 특성을 이용하여 효율적으로 계산할 수 있도록 설계된다. DSP 상에서 수행되는 디지털 신호처리 동작은 일반적으로 대량의 순차적인 연산을 반복적으로 수행하는 특성을 가지고 있다.A digital signal processor is one of the processors mounted on an electronic device for data processing. A DSP is a processor for performing a specific digital signal processing function, and is designed to be efficiently computed using the characteristics of a specific digital signal processing operation. Digital signal processing operations performed on a DSP generally have the property of performing a large number of sequential operations repeatedly.

디지털 신호처리에 있어서 특히 중요하게 여겨지는 연산으로는 곱셈 누적 연산(Multiply and Accumulate; MAC)이 있다. 대다수의 DSP는 MAC 연산을 효율적으로 수행하기 위하여 곱셈기(Multiplier), 덧셈기(Adder) 및 누적기(Accumulator)를 포함하고 있다. 곱셈기를 사용하여 두 개의 피연산자에 대한 곱셈을 수행하고, 곱셈 결과를 덧셈기를 이용하여 누적 서브누적기에 저장된 값과 합하여 누적 서브누적기에 저장함으로써 MAC 연산을 수행한다. 이러한 종래의 DSP에 SIMD 곱셈기를 구현하는 것은 매우 어려운 것으로 알려져 있다.An operation that is particularly important in digital signal processing is Multiply and Accumulate (MAC). Most DSPs include a multiplier, an adder, and an accumulator in order to efficiently perform a MAC operation. Performs a multiplication on two operands using a multiplier, and adds the multiplication result to an accumulator subaccumulator using an adder, and stores the sum in a cumulative subaccumulator. Implementing a SIMD multiplier in such a conventional DSP is known to be very difficult.

한편, 콘볼루션 신경망은 복잡한 입력 데이터로부터 특징을 추출하기 위하여 이용될 수 있다. Conversational neural networks, on the other hand, can be used to extract features from complex input data.

도 1은 종래의 콘볼루션 신경망 가속기에 대한 Floating point MAC 배열을 도시한 도면이다.1 is a diagram illustrating a floating point MAC arrangement for a conventional convolution neural network accelerator.

도 1에 도시된 종래의 콘볼루션 신경망 가속기(100)는 선행기술은 곱셈기 및 덧셈기를 2차원 배열로 구성하여 곱셈 및 덧셈 연산을 수행하며, 트리구조의 형식을 이루는 덧셈기의 배열 구성을 가지고 있다.In the prior art convolution neural network accelerator 100 shown in FIG. 1, a multiplier and an adder are constructed in a two-dimensional array to perform multiplication and addition operations, and an array configuration of an adder in the form of a tree structure is provided.

이러한 구성은, Tn의 입력에 대하여 Tm의 출력을 발생시키므로, 입/출력 특성 맵(input/output feature maps) 동작시 발생하는 버퍼 사이즈가 극대화되어, 하나의 FPGA의 칩 메모리에서 동작하기에는 최적의 출력을 생성하지 못하는 문제점이 있다.This configuration generates the output of Tm with respect to the input of Tn, thereby maximizing the size of the buffer generated during the input / output feature maps operation, so that the optimum output There is a problem in that it can not be generated.

"Optimizing FPGA-based accelerator design for deep convolutional neural networks", Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '15, pages 161-170, New York, NY, USA, 2015. ACM."Optimizing FPGA-based accelerator design for deep convolutional neural networks", Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. In Proceedings of the 2015 ACM / SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA'15, pages 161-170, New York, NY, USA, 2015. ACM.

본 발명은 MAC 연산을 기성 FPGA의 하나의 DSP 블록으로 패킹(Packing)함으로써, 연산 속도를 향상시킨 SIMD MAC 유닛, 그 동작 방법 및 SIMD MAC 유닛의 배열을 이용한 콘볼루션 신경망 가속기을 제공하는 것을 목적으로 한다.An object of the present invention is to provide a SIMD MAC unit, a method of operating the SIMD MAC unit, and a convolution neural network accelerator using an arrangement of SIMD MAC units, by packing a MAC operation into one DSP block of a conventional FPGA .

본 발명의 일 실시예에 따른 연산 속도를 향상시킨 SIMD MAC 유닛은 승수와 피승수들을 포함하는 피연산자들의 곱셈 연산을 수행하는 곱셈기를 포함하는 복수의 DSP들; 각 DSP에서 수행된 곱셈 연산 결과값의 덧셈 연산 결과로 발생된 오버플로우를 카운트한 카운트값을 얻는 카운터; 상기 복수의 DSP 각각에 입력된 승수들을 누적하여 누적승수값을 얻는 서브누적기; 및 상기 카운트값 및 누적승수값에 기반하여 상기 덧셈 연산 결과값을 보정하는 메인누적기를 포함할 수 있다.The SIMD MAC unit with improved operation speed according to an embodiment of the present invention includes a plurality of DSPs including a multiplier for performing multiplication of operands including multipliers and multiplicands; A counter for obtaining a count value obtained by counting an overflow generated as a result of addition operation of multiplication operation result values performed in each DSP; A sub accumulator for accumulating the multipliers input to each of the plurality of DSPs to obtain a cumulative multiplier value; And a main accumulator for correcting the addition operation result value based on the count value and the accumulated multiplier value.

상기 복수의 DSP들은, 파이프라인으로 연결되며, 곱셈기를 포함하는 상기 파이프라인의 최상단 DSP; 및 곱셈기 및 상기 곱셈 연산 결과값의 덧셈 연산을 수행하는 덧셈기를 포함하는 DSP들을 포함할 수 있다.Wherein the plurality of DSPs are pipelined and include a top DSP of the pipeline including a multiplier; And DSPs including a multiplier and an adder for performing an addition operation of the multiplication operation result value.

상기 복수의 DSP들은, 언사이드(unsigned)형인 상기 승수; 및 사인드(signed)형인 상기 피승수들을 입력받을 수 있다.Wherein the plurality of DSPs comprises: a multiplier that is unsigned; And the multiplicand which is a signed type.

상기 복수의 DSP들은, 상기 피승수들의 결합을 통해 결합피승수를 생성할 수 있다.The plurality of DSPs may generate a combined multiplicand by combining the multiplicand.

상기 복수의 DSP들은, 상기 피승수들 사이에 복수의 비트를 추가하여 결합함으로써 결합피승수를 생성할 수 있다.The plurality of DSPs may generate a combined multiplicand by adding and combining a plurality of bits between the multiplicand.

상기 복수의 DSP들은, 상기 생성된 결합피승수를 상기 각 DSP의 곱셈기에 입력시킬 수 있다.The plurality of DSPs may input the generated combined multiplicand to a multiplier of each DSP.

상기 곱셈기는, 상기 승수와 상기 결합피승수의 곱셈 연산을 수행할 수 있다.The multiplier may perform a multiplication operation of the multiplier and the multiplicand.

상기 덧셈기는, 상기 최상단 DSP의 곱셈 연산 결과값을 시작으로 이후의 DSP들의 곱셈 연산 결과값의 덧셈 연산을 순차적으로 수행할 수 있다.The adder may sequentially perform an addition operation of the multiplication operation result values of the subsequent DSPs starting from the result of the multiplication operation of the uppermost DSP.

상기 카운터는, 상기 덧셈 연산 결과값의 가드비트에 발생하는 오버플로우를 카운트하며, 상기 가드비트는, 상기 덧셈 연산 결과값의 2N+1번째에 위치한 1비트이고, 상기 N은, 상기 피연산자들의 비트수일 수 있다.Wherein the counter counts an overflow occurring in a guard bit of the result of the addition operation, the guard bit is a bit located at 2N + 1-th position of the addition operation result value, and N is a bit of the operands Can be.

상기 카운트값은,

의 연산을 통해 구해진 α비트수를 갖으며, 여기서, K는 일차원 콘볼루션 필터의 크기이며, N은 상기 피연산자들의 비트수이고, Tn은 하나의 SIMD MAC 배열에서 상기 피연산자들을 입력받는 DSP들의 수일 수 있다.The count value is a value

Where K is the size of the one-dimensional convolution filter, N is the number of bits in the operands, and Tn is the number of DSPs that receive the operands in one SIMD MAC array have.

상기 서브누적기는, 음수인 피승수가 입력된 DSP들에 입력된 승수들을 누적할 수 있다.The sub accumulator may accumulate multipliers input to the DSPs to which the multiplicand, which is negative, is input.

상기 서브누적기는, 상기 결합피승수 생성 시, 하위 N비트에 포함된 음수인 피승수가 입력된 DSP들에 입력된 승수들을 누적하며, 상기 N은, 상기 피연산자들의 비트수일 수 있다.The sub accumulator accumulates multipliers input to the DSPs to which the multiplicand, which is a negative number included in the lower N bits, is input when generating the combined multiplicand, and N may be a bit number of the operands.

상기 메인누적기는, 상기 카운트값 및 덧셈 연산 결과값을 연관시킨(Concatenated) 제1 연관값과 상기 누적승수값에 기반한 제2 연관값의 뺄셈 연산을 통해 상기 덧셈 연산 결과값에 대하여 보정을 수행할 수 있다.The main accumulator performs a correction on the addition operation result value through a subtraction operation of a first associative value associated with the count value and an addition operation result value and a second associative value based on the cumulative multiplier value .

상기 제1 연관값은, 상기 카운트값을 2N비트 만큼 왼쪽 시프트(Left Shift) 시킨 결과에 상기 덧셈 연산 결과값의 하위 2N비트를 더한 값이고, 상기 제2 연관값은, 상기 누적승수값을 N비트 만큼 왼쪽 시프트 시킨 값이며, 상기 N은, 상기 피연산자들의 비트수일 수 있다.Wherein the first association value is a value obtained by adding the lower 2N bits of the result of the addition operation to the result of performing left shift of the count value by 2N bits and the second association value is a value obtained by adding the cumulative multiplier value to N Shifted left by a bit, and N may be the number of bits of the operands.

상기 메인누적기는, 각 DSP에서 수행된 곱셈 연산 결과값을 모두 더한 덧셈 연산 결과값에 대하여 보정을 수행할 수 있다.The main accumulator may perform a correction on an addition operation result value obtained by adding all the multiplication operation result values performed in each DSP.

본 발명의 일 실시예에 따른 연산 속도를 향상시킨 SIMD MAC 유닛의 동작 방법은, 승수와 피승수들을 포함하는 피연산자들이 입력된 복수의 DSP들 각각에서 상기 피연산자들을 곱셈 연산하는 단계; 상기 곱셈 연산 결과값을 덧셈 연산하는 단계; 상기 덧셈 연산하는 과정에서 발생하는 오버플로우를 카운트하여 카운트값을 얻는 단계; 상기 곱셈 연산하는 단계와 병렬적으로 각 DSP에 입력된 승수의 값을 누적하여 누적승수값을 얻는 단계; 및 상기 카운트값 및 누적승수값에 기반하여 상기 덧셈 연산 결과값을 보정하는 단계를 포함할 수 있다.According to an embodiment of the present invention, an operation method of an SIMD MAC unit having an improved operation speed includes: multiplying the operands in each of a plurality of DSPs to which operands including multipliers and multiplicands are input; Adding the result of the multiplication operation; Counting an overflow occurring in the addition operation and obtaining a count value; Obtaining a cumulative multiplier value by accumulating values of multipliers input to each DSP in parallel with the multiplication operation; And correcting the addition operation result value based on the count value and the cumulative multiplier value.

상기 곱셈 연산하는 단계는, 상기 피승수들 사이에 복수의 비트를 추가하여 결합피승수를 생성하는 단계를 포함할 수 있다.The multiplying step may include adding a plurality of bits between the multiplicand to generate a combined multiplicand.

상기 결합피승수를 생성하는 단계는, 상기 피승수들 중 제1 피승수를 2N+1비트 만큼 왼쪽 시프트 시킨 결과값에 제2 피승수를 더하여 상기 결합피승수를 생성하는 단계를 포함하며, 상기 N은, 상기 피연산자들의 비트수일 수 있다.Wherein the step of generating the combined multiplicand includes generating a combined multiplicand by adding a second multiplicand to the result of left-shifting the first multiplicand of the multiplicand by 2N + 1 bits, wherein N is the number of the operand Lt; / RTI >

상기 곱셈 연산하는 단계는, 상기 승수와 상기 결합피승수를 곱셈 연산하는 단계를 더 포함할 수 있다.The multiplying operation may further include multiplying the multiplier by the multiplicand.

상기 덧셈 연산하는 단계는, 파이프라인에 연결된 최상단 DSP의 곱셈 연산 결과값을 시작으로 이후의 DSP들의 곱셈 연산 결과값의 덧셈 연산을 순차적으로 수행하는 단계를 포함할 수 있다.The adding operation may include sequentially performing an addition operation of multiplication operation result values of the subsequent DSPs starting from a result of the multiplication operation of the uppermost DSP connected to the pipeline.

상기 카운트값을 얻는 단계는, 상기 덧셈 연산 결과값에서 2N+1번째에 위치한 가드비트에서 발생한 오버플로우를 카운트하는 단계를 포함할 수 있다.The step of obtaining the count value may include a step of counting an overflow occurring in the guard bit located at the (2N + 1) -th place in the result of the addition operation.

상기 누적승수값을 얻는 단계는, 음수인 피승수가 입력된 DSP들에 입력된 승수들의 값을 누적하는 단계를 포함할 수 있다.The step of obtaining the cumulative multiplier value may include accumulating values of multipliers input to the DSPs to which the multiplicand which is negative is inputted.

상기 누적승수값을 얻는 단계는, 상기 결합피승수의 하위 N비트에 포함된 음수인 피승수가 입력된 DSP들에 입력된 승수들을 누적하는 단계를 포함하며, 상기 N은, 상기 피연산자들의 비트수일 수 있다.The step of obtaining the cumulative multiplier value includes accumulating multipliers input to the DSPs to which the multiplicand, which is a negative number included in the lower N bits of the combined multiplicand, is input, and N may be the number of bits of the operands .

상기 보정하는 단계는, 상기 복수의 DSP들의 곱셈 연산 결과값을 모두 더한 덧셈 연산 결과값을 보정하는 단계를 포함할 수 있다.The step of correcting may include correcting an addition operation result value obtained by adding all the multiplication operation result values of the plurality of DSPs.

상기 보정하는 단계는, 상기 카운트값 및 상기 덧셈 연산 결과값을 연관시킨(Concatenated) 제1 연관값을 생성하는 단계; 및 상기 누적승수값에 기반한 제2 연관값을 생성하는 단계를 포함할 수 있다.Wherein the step of correcting comprises: generating a first associated value that is concatenated with the count value and the result of the addition operation; And generating a second association value based on the cumulative multiplier value.

상기 제1 연관값을 생성하는 단계는, 상기 카운트값을 2N비트 만큼 왼쪽 시피트 시킨 결과에 상기 덧셈 연산 결과값의 하위 2N비트를 더하여 제1 연관값을 생성하는 단계를 포함하며, 상기 N은, 상기 피연산자들의 비트수일 수 있다.The generating of the first association value may include generating a first association value by adding the lower 2N bits of the result of the addition operation to a result obtained by shifting the count value by 2N bits to the left, , And the number of bits of the operands.

상기 제2 연관값을 생성하는 단계는, 상기 누적승수값을 N비트 만큼 왼쪽 시프트 시킨 제2 연관값을 생성하는 단계를 포함하며, 상기 N은, 상기 피연산자들의 비트수일 수 있다.The generating of the second association value may include generating a second association value that is left-shifted by N bits in the cumulative multiplier value, and N may be the number of bits of the operands.

본 발명의 일 실시예에 따른 SIMD MAC 유닛의 배열을 이용한 콘볼루션 신경망 가속기는, 연산 속도를 향상시킨 SIMD MAC 유닛 복수개가 병렬로 연결된 배열을 포함할 수 있다.A convolution neural network accelerator using an array of SIMD MAC units according to an embodiment of the present invention may include an array in which a plurality of SIMD MAC units with improved operation speeds are connected in parallel.

음수인 승수가 입력된 경우, 상기 음수인 승수를 언사인드(unsigned)형으로 변환할 수 있다.When a negative multiplier is input, the negative multiplier can be converted into an unsigned multiplier.

본 발명의 일 실시예에 따른 연산 속도를 향상시킨 SIMD MAC 유닛, 그 동작 방법 및 SIMD MAC 유닛의 배열을 이용한 콘볼루션 신경망 가속기에 따르면, 덧셈기를 파이프라인으로 연결시켜 MAC의 연산을 누적하여 수행하므로, 계산 속도를 향상시킬 수 있는 효과가 있다.According to the SIMD MAC unit with improved computation speed, the operation method thereof, and the convolution neural network accelerator using the arrangement of the SIMD MAC unit according to an embodiment of the present invention, the operations of the MAC are cumulatively performed by connecting the adders through a pipeline , The calculation speed can be improved.

또한, 피연산자들을 입력받은 Tm개의 SIMD MAC 유닛의 배열의 결과값으로 Tm/2의 출력을 발생시키므로, 메모리를 효율적으로 사용할 수 있는 효과가 있다.In addition, since the output of Tm / 2 is generated as the result of the array of Tm SIMD MAC units receiving the operands, the memory can be used efficiently.

또한, 피승수들을 연관시킨 값과 승수의 곱셈 연산을 수행하기 때문에, 곱셈 연산의 횟수가 줄어들어 계산 속도를 향상시킬 수 있는 효과가 있다.In addition, since the multiplication operation is performed by multiplying multiplication values by values associated with multiplicities, the number of multiplication operations is reduced, thereby improving the calculation speed.

도 1은 종래의 콘볼루션 신경망 가속기에서 Floating point MAC 배열을 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 SIMD MAC 유닛의 구성요소들을 간략히 도시한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 SIMD MAC 유닛의 구조를 도시한 도면이다.
도 4는 본 발명의 일 실시예에 따른 SIMD MAC 유닛의 배열을 이용한 콘볼루션 신경망 가속기의 구조를 도시한 도면이다.
도 5는 본 발명의 일 실시예에 따른 SIMD MAC 유닛의 동작 방법을 간략히 도시한 순서도이다.
도 6은 본 발명의 일 실시예에 따른 SIMD MAC 유닛의 동작 방법에 있어서, 덧셈 연산의 결과를 보정하는 과정을 간략히 도시한 도면이다.1 is a diagram illustrating a floating point MAC arrangement in a conventional convolution neural network accelerator.
2 is a block diagram briefly illustrating components of a SIMD MAC unit according to an embodiment of the present invention.
3 is a diagram illustrating a structure of a SIMD MAC unit according to an embodiment of the present invention.
4 is a diagram illustrating a structure of a convolution neural network accelerator using an arrangement of SIMD MAC units according to an embodiment of the present invention.
5 is a flowchart briefly showing a method of operating the SIMD MAC unit according to an embodiment of the present invention.
6 is a diagram schematically illustrating a process of correcting a result of an addition operation in an operation method of an SIMD MAC unit according to an embodiment of the present invention.

본 명세서에서 제1 및/또는 제2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 즉, 구성요소들을 상기 용어들에 의해 한정하고자 함이 아니다.The terms first and / or second in this specification are used only for the purpose of distinguishing one element from another. That is, the components are not intended to be limited by the terms.

본 명세서에서 '포함하다' 라는 표현으로 언급되는 구성요소, 특징, 및 단계는 해당 구성요소, 특징 및 단계가 존재함을 의미하며, 하나 이상의 다른 구성요소, 특징, 단계 및 이와 동등한 것을 배제하고자 함이 아니다.The components, features, and steps referred to in the specification as " comprising " in this specification are intended to mean that there are corresponding components, features, and steps, and do not preclude the presence of one or more other components, features, steps, and the like Is not.

본 명세서에서 단수형으로 특정되어 언급되지 아니하는 한, 복수의 형태를 포함한다. 즉, 본 명세서에서 언급된 구성요소 등은 하나 이상의 다른 구성요소 등의 존재나 추가를 의미할 수 있다.Includes plural forms as long as it is not specified and specified in the singular form herein. That is, the components and the like referred to in this specification may mean the presence or addition of one or more other components or the like.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함하여, 본 명세서에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자(통상의 기술자)에 의하여 일반적으로 이해되는 것과 동일한 의미이다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs to be.

즉, 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미인 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.That is, terms such as those defined in commonly used dictionaries should be construed as meaning consistent with meaning in the context of the related art, and unless otherwise expressly defined herein, interpreted in an ideal or overly formal sense It does not.

도 2는 본 발명의 일 실시예에 따른 SIMD MAC 유닛의 구성요소들을 간략히 도시한 블록도이다.2 is a block diagram briefly illustrating components of a SIMD MAC unit according to an embodiment of the present invention.

도 2를 참조하면, SIMD MAC 유닛(200)은 DSP(201), 카운터(207), 서브누적기(209) 및 메인누적기(211)를 포함할 수 있다. DSP(201)는 곱셈기(203) 및 덧셈기(205)를 포함할 수 있다.2, the SIMD MAC unit 200 may include a DSP 201, a counter 207, a subaccumulator 209, and a main accumulator 211. [ The DSP 201 may include a multiplier 203 and an adder 205.

DSP(201)는 N비트수를 갖는 승수, 제1 피승수 및 제2 피승수를 포함하는 피연산자들을 입력받을 수 있다. 승수는 언사인드(unsigned)형, 피승수는 사인드(signed)형일 수 있다.The DSP 201 may receive operands including a multiplier having a number of N bits, a first multiplicand, and a second multiplicand. The multiplier may be unsigned and the multiplicand may be signed.

언사인드(unsigned)형이란 음수가 아닌 데이터형으로 0과 양의 정수로 표현될 수 있고, 사인드(signed)형은 음수, 0 및 양수로 표현될 수 있는 자료형이다.An unsigned type is a nonnegative data type that can be represented by 0 and a positive integer, and a signed type is a data type that can be represented by a negative number, 0, and a positive number.

DSP(201)는 곱셈 연산을 수행하기 전에 입력된 피승수들인 제1 피승수와 제2 피승수 사이에 복수의 비트를 추가하여 결합함으로써 결합피승수를 생성할 수 있다.The DSP 201 may generate a combined multiplicand by adding a plurality of bits between the first multiplicand and the second multiplicand, which are multiplicands inputted before performing the multiplication operation.

구체적으로는, 제1 피승수를 2N+1비트만큼 왼쪽 시프트(Left-Shift) 시킨 값에 제2 피승수를 합하여 결합피승수를 생성할 수 있다. 따라서, 결합피승수는 3N+1비트의 길이를 갖을 수 있다.Specifically, the combined multiplicand can be generated by adding the second multiplicand to a value obtained by left-shifting the first multiplicand by 2N + 1 bits. Thus, the combined multiplicand can have a length of 3N + 1 bits.

결합피승수에서 2N+1번째에 위치한 비트는 가드비트(Guard bit)로서, 초기에 0으로 설정되어 있을 수 있다. The bit located at 2N + 1 in the combined multiplicand is a guard bit, which may be initially set to zero.

곱셈기(203)는 결합피승수와 승수의 곱셈 연산을 수행할 수 있다. 곱셈 연산의 결과는 4N+1비트의 결과값일 수 있다.The multiplier 203 can perform a multiplication operation of the multiplicand multiplier and the multiplier. The result of the multiplication operation may be a result value of 4N + 1 bits.

덧셈기(205)는 각 DSP(201)의 곱셈 연산 결과값을 누적하는 덧셈 연산을 수행하며, 누적의 순서는 파이프라인의 최상단 DSP(201)부터 파이프라인에 연결된 순서대로 DSP(201)의 곱셈 연산 결과값을 누적할 수 있다. 즉, 덧셈기(205)는 파이프라인의 최상단 DSP(201)의 곱셈 연산 결과값을 시작으로 이후의 DSP(201)들의 곱셈 연산 결과값의 덧셈 연산을 순차적으로 수행할 수 있다.The adder 205 performs an addition operation for accumulating the result of multiplication operation of each DSP 201. The order of accumulation is the order of the multiplication operation of the DSP 201 from the top DSP 201 of the pipeline to the pipeline, The result values can be accumulated. That is, the adder 205 can sequentially perform the addition operation of the multiplication operation result values of the DSPs 201 after the multiplication operation result value of the uppermost DSP 201 of the pipeline.

덧셈기(205)에서 각 DSP(201)의 곱셈 연산 결과값을 순서대로 누적하는 과정에서 가드비트에 오버플로우가 발생할 수 있다. 가드비트는 초기에 0으로 설정된 1bit 길이의 값으로서, 곱셈 연산 결과값을 누적하는 과정에서 가드비트를 초과하는 오버플로우가 발생할 수 있으므로 오버플로우를 처리하는 과정이 필요할 수 있다.An overflow may occur in the guard bits in the process of accumulating the result of multiplication operation of each DSP 201 in the adder 205 in order. Since the guard bit is a 1-bit length value initially set to 0, an overflow exceeding the guard bit may occur in the process of accumulating the result of the multiplication operation, so that a process of overflowing may be required.

카운터(207)는 각 DSP(201)의 곱셈기(203)에서 수행된 곱셈 연산 결과값의 덧셈 연산 결과로 발생되는 오버플로우를 카운트하여 카운트값을 얻을 수 있다.The counter 207 may count the overflow generated as a result of the addition operation of the multiplication operation result values performed by the multiplier 203 of each DSP 201 to obtain a count value.

카운터(207)는 덧셈기(205)에서 곱셈 연산 결과값을 누적하는 과정에서 가드비트에 발생하는 오버플로우를 처리하기 위한 것으로, 가드비트에 발생한 오버플로우를 카운트하여 카운트값을 얻을 수 있다. 카운트가 완료되면 파이프라인의 다음 DSP의 곱셈 연산 결과값과 덧셈 연산에서 발생할 수 있는 오버플로우를 카운트하기 위하여 가드비트를 0으로 초기화 시킬 수 있다.The counter 207 is for processing an overflow occurring in the guard bits in the process of accumulating the result of the multiplication operation in the adder 205 and can count the overflow occurring in the guard bits to obtain a count value. When the count is completed, the guard bit can be initialized to zero to count overflows that may occur in the multiplication operation result of the next DSP of the pipeline and the addition operation.

SIMD MAC 유닛(200)에 포함된 DSP(201)들은 파이프라인으로 연결되어 있기 때문에, 카운터(207)에서 순차적인 덧셈 연산의 과정에서 가드비트에 발생한 오버플로우를 카운트하고 가드비트를 0으로 초기화함으로써 결과값에 영향을 주지 않을 수 있다.Since the DSPs 201 included in the SIMD MAC unit 200 are connected in a pipeline, overflows occurring in the guard bits in the sequential addition operation in the counter 207 are counted and the guard bits are initialized to 0 It may not affect the result value.

카운트값의 비트수(α)는 다음과 같이 계산될 수 있다.The number of bits (?) Of the count value can be calculated as follows.

여기서 K는 1차원에서의 콘볼루션 필터의 사이즈이며, N은 피연산자들의 비트수이고, Tn은 SIMD MAC 유닛(200)에 포함된 DSP(201)의 수로서, 피연산자들을 입력받은 DSP(201)의 개수일 수 있다. 즉, 카운트값의 크기는 피연산자들의 비트수 N과 피연산자들을 입력받은 DSP(201)의 수 Tn에 영향을 받을 수 있다.Where N is the number of bits of the operands and Tn is the number of DSPs 201 included in the SIMD MAC unit 200 and the number of operands of the DSP 201 Lt; / RTI > That is, the magnitude of the count value may be affected by the number of bits N of the operands and the number Tn of the DSPs 201 to which the operands are input.

곱셈기(203)에서 곱셈 연산을 수행하는 것과 병렬적으로, 서브누적기(209)는 제2 피승수가 음수인 경우 승수를 저장할 수 있다. 즉, 서브누적기(209)는 입력된 제2 피승수가 음수인 DSP(201)들에 입력된 승수들의 값을 누적할 수 있다.In parallel with performing a multiplication operation in the multiplier 203, the sub accumulator 209 may store a multiplier when the second multiplicand is negative. That is, the sub accumulator 209 can accumulate the values of the multipliers inputted to the DSPs 201 whose input second multiplicand is negative.

제2 피승수는 결합피승수의 하위 N비트에 포함되어 있으므로, 서브누적기(209)는 결합피승수의 하위 N비트가 음수인 피승수가 입력된 DSP들에 입력된 승수들을 누적할 수 있다.Since the second multiplicand is included in the lower N bits of the combined multiplicand, the subaccumulator 209 can accumulate multipliers inputted to the DSPs to which the multiplicand having the lower N bits of the combined multiplicand are negative.

메인누적기(211)는 SIMD MAC 유닛(200)에 포함된 DSP(201)들의 곱셈 연산 결과값을 모두 더한 덧셈 연산 결과값을 보정할 수 있다. 즉, 메인누적기(211)는 파이프라인의 최하단 DSP(201)의 덧셈기(205)에서의 덧셈 연산 결과값을 보정할 수 있다.The main accumulator 211 may correct the result of the addition operation that is the sum of the multiplication operation results of the DSPs 201 included in the SIMD MAC unit 200. [ That is, the main accumulator 211 can correct the result of the addition operation in the adder 205 of the bottom-stage DSP 201 of the pipeline.

덧셈 연산 결과값 중 가드비트를 중심으로 상위 2N비트는 각 DSP(201)에 입력된 승수와 제1 피승수의 곱셈 연산 결과값의 총합일 수 있다. The upper 2N bits around the guard bits of the addition operation result values may be the sum of the multiplier operation result of the multiplier input to each DSP 201 and the first multiplicand.

다만, SIMD MAC 유닛(200)의 DSP(201)들 중 음수인 제2 피승수가 입력된 경우, 덧셈 연산 결과값 중 가드비트를 중심으로 하위 2N비트는 각 DSP(201)에 입력된 승수와 제2 피승수의 곱셈 연산 결과값의 총합과 일치하지 않을 수 있어 보정이 필요할 수 있다..When the second multiplicand, which is a negative number among the DSPs 201 of the SIMD MAC unit 200, is input, the lower 2N bits centering on the guard bits in the result of the addition operation are multiplied by the multiplier 2 multiplication operation may not match the sum of the resultant values and may need correction.

메인누적기(211)는 카운트값과 덧셈 연산 결과값을 연관시킨(Concatenated) 제1 연관값에서, 서브누적기(209)에 의해 누적승수값에 기반한 제2 연관값의 뺄셈 연산을 통해 덧셈 연산 결과값을 보정할 수 있다. 보정된 결과값은 SIMD MAC 유닛(200)의 각 DSP(201)에 입력된 승수와 제2 피승수의 곱셈 연산 결과값의 총합일 수 있다.The main accumulator 211 performs a subtraction operation by subtracting the second associative value based on the cumulative multiplier value by the sub accumulator 209 from the first associative value associated with the count value and the result of the addition operation, The result value can be corrected. The corrected result value may be a sum of multiplication results input to each DSP 201 of the SIMD MAC unit 200 and multiplication result of the second multiplicand.

제1 연관값은 카운트값을 2N비트 만큼 왼쪽 시프트를 시킨 결과에 덧셈 연산 결과값의 하위 2N비트를 더하여 생성된 결과값일 수 있다. The first associative value may be a result value obtained by adding the lower 2N bits of the result of addition operation to the result of left-shifting the count value by 2N bits.

가드비트에 발생하는 오버플로우는 덧셈 연산 과정 중에서 하위 2N비트의 덧셈 연산에 의하여 발생되는 것이기 때문에, 덧셈 연산 결과값 보정시 카운트값을 덧셈 연산 결과의 하위 2N비트와 연관시키는 것이다. 즉, 제1 연관값의 비트수는 2N+α일 수 있다.Since the overflow occurring in the guard bits is generated by the addition operation of the lower 2N bits in the addition operation process, it is possible to associate the count value with the lower 2N bits of the addition operation result when correcting the addition operation result value. That is, the number of bits of the first associated value may be 2N + alpha.

제2 연관값은 누적승수값을 N비트 만큼 왼쪽 시프트 시킨 값일 수 있다. 즉, 제2 연관값의 비트수는 2N+α일 수 있다.The second associated value may be a value obtained by left-shifting the cumulative multiplier value by N bits. That is, the number of bits of the second associated value may be 2N + alpha.

서브누적기(209)에서 입력된 제2 피승수가 음수인 DSP(201)에 입력된 승수들을 누적한 누적승수값의 비트수는 N+α이하이므로, 제2 연관값의 비트수는 2N+α이하일 수 있다. 다만, 입력된 제2 피승수가 음수인 DSP(201)가 없는 경우, 누적승수값이 0이므로 제2 연관값은 0일 수 있다.The number of bits of the accumulated value multiplied by the multiplier input to the DSP 201 having the second multiplicand input from the sub accumulator 209 is equal to or less than N + &Lt; / RTI > However, if there is no DSP 201 whose input second multiplicand is negative, the cumulative multiplier value is 0, so the second associated value may be zero.

4N+1비트수를 갖는 덧셈 연산 결과값에서 상위 2N비트(2N+2번째 비트부터 4N+1번째 비트까지)는 SIMD MAC 유닛(200)에 포함된 각 DSP(201)에 입력된 승수와 제1 피승수의 곱셈 연산의 결과의 총합일 수 있다.The upper 2N bits (from the (2N + 2) th bit to the (4N + 1) th bit) of the addition operation result having the number of 4N + 1 bits are multiplied by the multiplier inputted to each DSP 201 included in the SIMD MAC unit 200 Lt; / RTI > multiplication operation.

보정된 결과값은 SIMD MAC 유닛(200)에 포함된 각 DSP(201)에 입력된 승수와 제2 피승수의 곱셈 연산 결과값의 총합일 수 있다.The corrected result value may be a sum of multiplication results input to each DSP 201 included in the SIMD MAC unit 200 and multiplication result of the second multiplicand.

도 3은 본 발명의 일 실시예에 따른 SIMD MAC 유닛(200)의 구조를 도시한 도면이다.3 is a diagram illustrating a structure of a SIMD MAC unit 200 according to an embodiment of the present invention.

도 3을 참조하면, DSP(201)는 N비트의 승수(in), 제1 피승수(

) 및 제2 피승수(

)의 피연산자들을 입력받을 수 있다.Referring to FIG. 3, the DSP 201 includes an N-bit multiplier (in), a first multiplicand

) And the second multiplicand (

) Can be input.

correction indicator는 제2 피승수의 MSB(Most Significant Byte)이다. 제2 피승수가 음수이면, MSB는 1이므로 승수와의 &(and) 연산을 통해 서브누적기(209)에는 승수가 누적될 수 있다. The correction indicator is the MSB (Most Significant Byte) of the second multiplicand. If the second multiplicand is negative, the MSB is 1, so that the multiplier can be accumulated in the sub accumulator 209 by & (and) operation with the multiplier.

제2 피승수가 0이거나 양수이면, MSB는 0이므로 승수와의 &(and) 연산을 통해 서브누적기(209)에는 0이 입력될 수 있다. If the second multiplicand is 0 or a positive number, the MSB is 0, so 0 can be input to the sub accumulator 209 through the & (and) operation with the multiplier.

따라서 서브누적기(209)는 제2 피승수가 음수인 경우에 승수를 누적할 수 있다. 서브누적기(209)를 통해 누적된 승수는 N+α 비트수를 갖으며, 누적승수값은 C₂로 도시되었다.Therefore, the sub accumulator 209 can accumulate the multiplier when the second multiplicand is negative. The multiplier accumulated through the sub accumulator 209 has N +? Bits, and the cumulative multiplier value is shown as C ?.

덧셈기(205)에서 파이프라인에 연결된 순서에 따라 각 DSP(201)에서의 곱셈 연산 결과값의 덧셈 연산을 수행할 수 있다. 덧셈 연산의 결과로 가드비트에 오버플로우가 발생하면 카운터(207)는 가드비트에 발생한 오버플로우를 카운트하여 α비트수를 갖는 카운트값을 저장할 수 있다. 저장된 카운트값은 C₁로 도시되었다.The adder 205 may perform the addition operation of the result of multiplication operation in each DSP 201 according to the order connected to the pipeline. When an overflow occurs in the guard bits as a result of the addition operation, the counter 207 counts an overflow occurring in the guard bits and stores a count value having a number of bits. The stored count value is shown as C1.

덧셈기(205)에서의 덧셈 연산 결과값은 파이프라인에 연결된 다음 차례의 DSP(201)의 덧셈기(205)로 전달될 수 있다. 전달된 덧셈 연산 결과값은 다음 차례의 DSP의 곱셈 연산 결과값과 더해질 수 있다.The result of the addition operation in the adder 205 may be transferred to the adder 205 of the next DSP 201 connected to the pipeline. The result of the addition operation can be added to the result of multiplication operation of the next DSP.

도 4는 본 발명의 일 실시예에 따른 SIMD MAC 유닛을 포함한 콘볼루션 신경망 가속기의 구조를 도시한 도면이다.4 is a diagram illustrating a structure of a convolution neural network accelerator including a SIMD MAC unit according to an embodiment of the present invention.

도 4를 참조하면, 콘볼루션 신경망 가속기(300)에 포함된 각 SIMD MAC 유닛(200)은 Tn개의 DSP(201)들을 포함할 수 있다. 즉, Tn은 인풋값의 개수로서, SIMD MAC 유닛(200)에 포함된 DSP(201)들의 개수일 수 있다.Referring to FIG. 4, each SIMD MAC unit 200 included in the convolution neural network accelerator 300 may include Tn DSPs 201. That is, Tn may be the number of input values, and may be the number of DSPs 201 included in the SIMD MAC unit 200. [

콘볼루션 신경망 가속기(300)는 Tm/2개의 출력값을 갖을 수 있다. 여기서 Tm은 Tn의 인풋 및 동일한 가중치 행렬(Weight matrix)에 대한 종래의 콘볼루션 신경망 가속기(100)에서 출력되는 아웃풋의 개수이다.Convolution Neural Network Accelerator 300 may have Tm / 2 output values. Where Tm is the number of outputs output from the conventional convolution neural network accelerator 100 for the input of Tn and the same weight matrix.

즉, 본 발명의 SIMD MAC 유닛(200)의 배열을 이용한 콘볼루션 신경망 가속기(300)는 종래의 콘볼루션 신경망 가속기(100)에 비하여 출력의 수를 줄일 수 있으므로, 곱셈 누적 연산의 속도를 높일 수 있는 장점이 있다.That is, the convolution neural network accelerator 300 using the arrangement of the SIMD MAC unit 200 of the present invention can reduce the number of outputs as compared with the conventional convolution neural network accelerator 100, There is an advantage.

또한, 콘볼루션 신경망 가속기(300)는 입력 및 출력값을 언사인드형으로 변환할 수 있는 입력 및 출력 특성 맵(input and output feature maps)을 갖을 수 있어, 음수인 승수가 입력되면 이를 언사인드형(unsigned)으로 변환하여 연산을 할 수 있다.In addition, the convolution neural network accelerator 300 may have input and output feature maps that can convert input and output values to an asserted type. When a negative multiplier is input, unsigned) to perform an operation.

즉, 음수인 승수는 양수로 변환되어 연산될 수 있으며, 이러한 변환과정은 FPGA의 트레이닝 과정에서 이루어지므로 언사인드형 변환과정에서는 런타임 오버헤드(runtime overhead)가 발생하지 않을 수 있다.In other words, a negative multiplier can be converted to a positive number, and the conversion process is performed during the training process of the FPGA, so runtime overhead may not occur in the assertion type conversion process.

콘볼루션 신경망 가속기(300)에 포함된 SIMD MAC 유닛(200)의 파이프라인의 최상단 DSP(201)는 곱셈기(203)만을 포함하며, 덧셈기(205)는 포함하지 않을 수 있다. The top stage DSP 201 of the pipeline of the SIMD MAC unit 200 included in the convolution neural network accelerator 300 includes only the multiplier 203 and may not include the adder 205. [

덧셈기(205)는 입력된 피연산자들의 곱셈 연산 결과값과 이전 차례의 DSP(201)의 덧셈 연산 결과를 더하는 연산을 수행하기 때문에, 파이프라인의 최상단 DSP(201)는 이전의 덧셈 연산의 결과를 전달받을 수 없으므로 곱셈기(203)만을 포함할 수 있다. 이 경우에 파이프라인은 최상단의 곱셈기(203)에서 시작하여 다음 순서로 연결된 DSP(201)의 덧셈기(205)에 연결될 수 있다.Since the adder 205 performs an operation of adding the multiplication operation result of the input operands and the addition result of the previous DSP 201, the DSP 201 at the top of the pipeline transfers the result of the previous addition operation It can include only the multiplier 203 since it can not receive. In this case, the pipeline may be connected to the adder 205 of the connected DSP 201 starting from the uppermost multiplier 203 and in the following order.

SIMD MAC 유닛(200)의 파이프라인의 최하단 DSP(201)의 덧셈기(205)는 메인누적기(211)와 연결될 수 있다. 최하단 DSP(201)의 덧셈기(205)에서의 덧셈 연산 결과값은 SIMD MAC 유닛(200)의 각 DSP(201)의 곱셈 연산 결과값의 총합이므로 메인누적기(211)로 전달되어 보정을 거쳐 보정된 결과값을 얻을 수 있다.The adder 205 of the lowermost DSP 201 of the pipeline of the SIMD MAC unit 200 may be connected to the main accumulator 211. The result of the addition operation in the adder 205 of the lowermost DSP 201 is the sum of the result of multiplication operation of each DSP 201 of the SIMD MAC unit 200 and is transmitted to the main accumulator 211, Can be obtained.

메인누적기(211)에는 덧셈 연산 결과값의 보정을 위하여, 카운터(207)에서 카운트된 카운트값과 서브누적기(209)에서 누적된 누적승수값이 전달될 수 있다.The count value counted by the counter 207 and the cumulative multiplier value accumulated in the sub accumulator 209 may be transmitted to the main accumulator 211 for correcting the result of the addition operation.

메인누적기(211)에서는 카운트값과 덧셈 연산 결과값의 하위 2N비트를 연관시킨 제1 연관값에서 누적승수값을 통해 생성된 제2 연관값의 뺄셈 연산을 통해서 SIMD MAC 유닛(200)의 각 DSP(201)의 승수와 제2 피승수의 곱셈 연산 결과값의 총합을 보정하여 보정된 결과값을 얻을 수 있다.The main accumulator 211 subtracts the second associative value generated through the cumulative multiplier value from the first associative value that associates the lower 2N bits of the result of the addition operation with the count value, It is possible to obtain a corrected result value by correcting the sum of the multiplication operation result values of the DSP 201 and the multiplicative operation result of the second multiplicand.

도 5는 본 발명의 일 실시예에 따른 SIMD MAC 유닛의 동작 방법을 간략히 도시한 순서도이다.5 is a flowchart briefly showing a method of operating the SIMD MAC unit according to an embodiment of the present invention.

도 5를 참조하면, SIMD MAC 유닛(200)의 동작 방법은 피연산자들을 곱셈 연산하는 단계(S100), 곱셈 연산 결과값을 덧셈 연산하는 단계(S102), 가드비트에 발생한 오버플로우를 카운트하는 단계(S104), DSP에 입력된 승수의 값을 누적하는 단계(S106) 및 덧셈 연산 결과값을 보정하는 단계(S108)을 포함할 수 있다.5, an operation method of the SIMD MAC unit 200 includes a multiplication operation (S100) of operands, a addition operation (S102) of a multiplication operation result value, a step of counting an overflow occurring in a guard bit S104), accumulating the value of the multiplier input to the DSP (S106), and correcting the result of addition operation (S108).

피연산자들을 곱셈 연산하는 단계(S100)는 SIMD MAC 유닛(200)의 각 DSP(201)에 입력된 피연산자들의 곱셈 연산을 수행하는 단계로서, 제1 피승수 및 제2 피승수에 기반하여 생성된 결합피승수를 승수와 곱셈 연산을 할 수 있다.The step of multiplying the operands (SlOO) is a step of performing a multiplication operation of the operands inputted to each DSP 201 of the SIMD MAC unit 200, wherein the multiplicative operation result generated based on the first multiplicand and the second multiplicand Multiplication and multiplication operations can be performed.

결합피승수는 피승수들 사이에 복수의 비트를 추가하여 결함함으로써 생성될 수 있다. 구체적으로, 결합피승수는 제1 피승수를 2N+1비트 만큼 왼쪽 시프트를 시킨 결과에 제2 피승수를 더하여 생성될 수 있다. 따라서 결합피승수의 비트수는 3N+1비트일 수 있다.The combined multiplicand can be generated by adding and adding a plurality of bits between the multiplicand. Specifically, the combined multiplicand may be generated by adding the second multiplicand to the result of left shifting the first multiplicand by 2N + 1 bits. Therefore, the number of bits of the combined multiplicand may be 3N + 1 bits.

곱셈 연산 결과값을 덧셈 연산하는 단계(S102)는 SIMD MAC 유닛(200)의 각 DSP(201)에서 곱셈 연산을 수행한 결과값의 덧셈 연산을 수행하는 단계로서, 파이프라인의 최상단 DSP의 곱셈 연산 결과값을 시작으로 이후의 DSP들의 곱셈 연산 결과값의 덧셈 연산을 순차적으로 수행할 수 있다.The step of summing up the multiplication operation result values (S102) is a step of performing an addition operation of a result of performing a multiplication operation in each DSP 201 of the SIMD MAC unit 200, It is possible to sequentially perform the addition operation of the multiplication operation result values of the subsequent DSPs starting from the result value.

SIMD MAC 유닛(200)의 최상단 DSP(201)의 곱셈 연산 결과값을 파이프라인을 통해 연결된 다음 순서의 DSP(201)의 곱셈 연산 결과값과 덧셈 연산을 수행할 수 있다. 다음 순서의 DSP(201)에 수행된 덧셈 연산 결과값은 그 다음 순서의 DSP로 전달되어 곱셈 연산 결과값과 덧셈 연산을 수행할 수 있다.The result of the multiplication operation of the top DSP 201 of the SIMD MAC unit 200 may be added to the result of multiplication operation of the DSP 201 connected in the pipeline. The result of the addition operation performed on the DSP 201 in the next step may be transferred to the DSP of the next step to perform the multiplication operation result and the addition operation.

즉, 파이프라인에 연결된 최상단 DSP(201)부터 파이프라인에 연결된 순서에 따라 각 DSP(201)의 곱셈 연산 결과값을 누적할 수 있다.That is, the result of multiplication operation of each DSP 201 can be accumulated according to the order of connection from the top-level DSP 201 connected to the pipeline to the pipeline.

가드비트에 발생한 오버플로우를 카운트하는 단계(S104)는 곱셈 연산 결과값을 덧셈 연산하는 단계(S102)의 과정 중에 가드비트에 발생하는 오버플로우를 카운트하는 단계이다. The step of counting the overflow occurring in the guard bits (S104) is a step of counting the overflow occurring in the guard bits during the process of adding the multiplication operation result value (S102).

가드비트는 초기에 0으로 설정된 1bit의 값이기 때문에, 오버플로우가 발생했을 때, 카운터(207)는 가드비트에 발생한 오버플로우를 카운트하고, 카운트가 완료된 이후 가드비트를 0으로 초기화할 수 있다.Since the guard bit is a 1-bit value initially set to 0, when an overflow occurs, the counter 207 counts the overflow occurring in the guard bit and initializes the guard bit to 0 after the count is completed.

DSP에 입력된 승수의 값을 누적하는 단계(S106)는 피연산자들을 곱셈 연산하는 단계(S100)와 병렬적으로 수행될 수 있다. The step of accumulating the value of the multiplier input to the DSP (S106) may be performed in parallel with the multiplication operation (SlOO) of the operands.

서브누적기(209)에는 입력된 제2 피승수가 음수인 DSP(201)에 입력된 승수의 값이 누적될 수 있다.The value of the multiplier input to the DSP 201 having the second multiplicand inputted as a negative number may be accumulated in the sub accumulator 209. [

덧셈 연산 결과값을 보정하는 단계(S108)는 SIMD MAC 유닛(200)의 적어도 하나 이상의 DSP(201)에 음수인 제2 피승수가 존재하는 경우에 결과값을 보정하는 단계이다.The step of correcting the result of addition operation (step S108) is a step of correcting the result when there is a second multiplicand, which is negative, in at least one DSP 201 of the SIMD MAC unit 200. [

덧셈 연산 결과값을 보정하는 단계(S108)는 파이프라인의 최하단 DSP의 덧셈기(205)의 덧셈 연산 결과값 및 카운터(207)의 카운트값을 연관시켜(Concatenated) 제1 연관값을 생성하는 단계를 포함할 수 있다.The step of correcting the addition operation result value (S108) includes concatenating the result of the addition operation of the adder 205 of the bottommost DSP of the pipeline and the count value of the counter 207 to generate a first association value .

또한, 서브누적기(209)에서 누적된 누적승수값을 N비트 만큼 왼쪽 시프트를 수행하여 제2 연관값을 생성하는 단계를 포함할 수 있다.And generating a second associated value by performing a left shift of the cumulative multiplier value accumulated in the sub accumulator 209 by N bits.

제1 연관값은 카운트 값을 2N비트 만큼 왼쪽 시프트 시킨 결과에 덧셈 연산 결과값의 하위 2N비트를 더하는 단계이다. The first association value is a step of adding the lower 2N bits of the result of the addition operation to the result of left-shifting the count value by 2N bits.

덧셈 연산의 과정 중 가드비트에 발생한 오버플로우는 가드비트를 중심으로 하위 2N비트의 연산과정에서 발생한 것이므로, 오버플로우가 발생한 값을 따로 저장하여 가드비트를 중심으로 상위 2N비트의 연산과정에는 영향을 끼치지 않을 수 있다.Since the overflow occurring in the guard bit during the addition operation is generated in the operation of the lower 2N bits around the guard bit, the overflow occurred is stored separately and the influence of the upper 2N bits on the guard bit is affected It may not.

덧셈 연산 결과값의 상위 2N비트(2N+2번째부터 4N+1번째까지의 비트)는 SIMD MAC 유닛(200)의 각 DSP(201)에 입력된 승수 및 제1 피승수의 곱셈 연산 결과값의 총합일 수 있다.The upper 2N bits (bits 2N + 2 to 4N + 1) of the result of the addition operation are the summation of the multiplication result of the multiplier and the first multiplicand inputted to each DSP 201 of the SIMD MAC unit 200 Lt; / RTI >

제1 연관값에서 제2 연관값의 뺄셈 연산을 통해 제2 피승수가 음수인 DSP(201)가 적어도 하나 이상 존재하는 경우, 덧셈 연산 결과값의 하위 2N비트(1번째부터 2N번째까지의 비트)를 보정할 수 있다. When there is at least one DSP 201 having a second multiplicand of negative number through a subtraction operation of a second associated value in the first associated value, the lower 2N bits (first through 2Nth bits) Can be corrected.

메인누적기(211)를 통해 보정된 결과값은 SIMD MAC 유닛(200)의 각 DSP(201)에 입력된 승수 및 제2 피승수의 곱셈 연산 결과값의 총합일 수 있다.The result corrected through the main accumulator 211 may be the sum of the multiplication result of the multiplier input to each DSP 201 of the SIMD MAC unit 200 and the second multiplicand.

콘볼루션 신경망 가속기(300)의 SIMD MAC 유닛(200)의 각 DSP(201)은 제1 피승수 및 제2 피승수를 연결시킨 결합피승수의 연산을 수행하기 때문에 Tm/2개의 결과를 출력시킬 수 있다.Each DSP 201 of the SIMD MAC unit 200 of the convolutional neural network accelerator 300 performs a calculation of a multiplicand combining the first multiplicand and the second multiplicand so that it can output Tm / 2 results.

도 6은 본 발명의 일 실시예에 따른 SIMD MAC 유닛의 동작 방법에 있어서, 덧셈 연산의 결과를 보정하는 과정을 간략히 도시한 도면이다.6 is a diagram schematically illustrating a process of correcting a result of an addition operation in an operation method of an SIMD MAC unit according to an embodiment of the present invention.

도 6을 참조하면, SIMD MAC 유닛(200)의 DSP(201)들의 덧셈 연산 결과값(401)은 가드비트를 중심으로

와

으로 나눌 수 있다.6, an addition operation result value 401 of the DSPs 201 of the SIMD MAC unit 200 is set to a value

Wow

.

은 SIMD MAC 유닛(200)의 각 DSP(201)의 승수와 제1 피승수의 곱셈 연산 결과값의 총합일 수 있다.

May be the sum of the product of the multiplier of each DSP 201 of the SIMD MAC unit 200 and the multiplicand of the first multiplicand.

는 SIMD MAC 유닛(200)의 각 DSP(201)의 승수와 제2 피승수의 곱셈 연산 결과값의 총합일 수 있다. 다만, 적어도 하나 이상의 제2 피승수가 음수인 경우에는 보정이 필요한 값이다.

May be the sum of the product of the multiplier of each DSP 201 of the SIMD MAC unit 200 and the multiplicand of the second multiplicand. However, if at least one second multiplicand is negative, correction is necessary.

제1 연관값(403)은 덧셈 연산 결과값의 하위 2N비트의 값인

에 카운터(207)에서 카운트된 카운트값을 2N비트 만큼 왼쪽 시프트 시킨 결과인 C₁를 더해주어 생성될 수 있다.The first association value 403 is a value of the lower 2N bits of the addition operation result value

And C 1, which is a result of left-shifting the count value counted by the counter 207 by 2N bits.

제2 연관값(405)은 서브누적기(209)에서 누적승수값인 C₂는 N비트 만큼 왼쪽 시프트 시켜 생성될 수 있다.The second associative value 405 may be generated by shifting C 2, which is the cumulative multiplier value in the sub accumulator 209, to the left by N bits.

보정된 결과값(407)은 제1 연관값(403)에서 제2 연관값(405)의 뺄셈 연산을 수행하여 생성될 수 있다.The calibrated result value 407 may be generated by subtracting the second associated value 405 from the first associated value 403.

보정된 결과값(407)은 SIMD MAC 유닛(200)의 각 DSP(201)의 승수와 제2 피승수의 곱셈 연산 결과값의 총합일 수 있다.The calibrated result value 407 may be the sum of the product of the multiplier of each DSP 201 of the SIMD MAC unit 200 and the multiplicative result of the second multiplicand.

비록 본 명세서에서의 설명은 예시적인 몇 가지 양상으로 나타났지만, 다양한 수정이나 변경이 후술되는 특허청구범위에 의해 정의되는 범주로부터 이루어질 수 있으며, 본 발명의 기술적인 보호범위는 다음의 특허청구범위에 의하여 정해져야 할 것이다.Although the description herein has been made in some illustrative aspects, various modifications and variations can be made from the categories defined by the following claims, and the technical scope of the invention is defined in the following claims It should be decided by.

100 : 종래의 콘볼루션 신경망 가속기
200 : SIMD MAC 유닛
201 : DSP
203 : 곱셈기
205 : 덧셈기
207 : 카운터
209 : 서브누적기
211 : 메인누적기
300 : 콘볼루션 신경망 가속기
401 : 덧셈 연산 결과값
403 : 제1 연관값
405 : 제2 연관값
407 : 보정된 결과값100: conventional convolution neural network accelerator
200: SIMD MAC unit
201: DSP
203: multiplier
205: adder
207: Counter
209: Sub accumulator
211: main accumulator
300: Convolution Neural Network Accelerator
401: Addition result value
403: first associative value
405: second associative value
407: the corrected result

Claims

A plurality of DSPs including a multiplier for performing multiplication of operands including multipliers and multiplicands;
A counter for obtaining a count value obtained by counting an overflow generated as a result of addition operation of multiplication operation result values performed in each DSP;
A sub accumulator for accumulating the multipliers input to each of the plurality of DSPs to obtain a cumulative multiplier value; And
And a main accumulator for correcting the addition operation result value based on the count value and the cumulative multiplier value.
SIMD MAC unit with improved operation speed.

The method according to claim 1,
The plurality of DSPs
Pipelined,
A top DSP of the pipeline including a multiplier; And
And a DSP including an adder that performs an addition operation of the multiplication result and the multiplication result,
SIMD MAC unit with improved operation speed.

The method according to claim 1,
The plurality of DSPs
Said multiplier being an unsigned type; And
Receiving the multiplicand, which is a signed type,
SIMD MAC unit with improved operation speed.

The method according to claim 1,
The plurality of DSPs
Generating a combined multiplicand through combining the multiplicand,
SIMD MAC unit with improved operation speed.

The method according to claim 1,
The plurality of DSPs
Generating a combined multiplicand by adding and combining a plurality of bits between the multiplicand,
SIMD MAC unit with improved operation speed.

5. The method according to claim 4 or 5,
The plurality of DSPs
And inputting the generated combined multiplicand to a multiplier of each DSP,
SIMD MAC unit with improved operation speed.

5. The method of claim 4,
Wherein the multiplier comprises:
Performing a multiplication operation of the multiplier and the combined multiplicand,
SIMD MAC unit with improved operation speed.

3. The method of claim 2,
The adder comprises:
Sequentially executing addition operations of multiplying operation result values of subsequent DSPs starting from a result of multiplying operation of the uppermost DSP,
SIMD MAC unit with improved operation speed.

The method according to claim 1,
The above-
Counting an overflow occurring in a guard bit of the addition operation result value,
The guard bit is 1 bit located at the (2N + 1) -th place of the result of the addition operation,
N is the number of bits of the operands,
SIMD MAC unit with improved operation speed.

The method according to claim 1,
The count value is a value

Lt; th > bit,
Where K is the size of the one-dimensional convolution filter, N is the number of bits in the operands, and Tn is the number of DSPs receiving the operands in one SIMD MAC array.
SIMD MAC unit with improved operation speed.

The method according to claim 1,
The sub-
Accumulates the multipliers inputted to the DSPs to which negative multiplicands are input,
SIMD MAC unit with improved operation speed.

5. The method of claim 4,
The sub-
When generating the combined multiplicand, the multipliers, which are negative numbers included in the lower N bits, accumulate multipliers input to the DSPs to which the multiplicand is input,
N is the number of bits of the operands,
SIMD MAC unit with improved operation speed.

The method according to claim 1,
Wherein the main accumulator comprises:
And performing a correction on the addition operation result value by subtracting a first associative value associated with the count value and an addition operation result value and a second associative value based on the cumulative multiplier value,
SIMD MAC unit with improved operation speed.

14. The method of claim 13,
Wherein the first associated value comprises:
A value obtained by adding the lower 2N bits of the result of the addition operation to the result of performing left shift of the count value by 2N bits,
Wherein the second associated value comprises:
A value obtained by left-shifting the accumulated multiplier value by N bits,
N is the number of bits of the operands,
SIMD MAC unit with improved operation speed.

The method according to claim 1,
Wherein the main accumulator comprises:
Performing a correction on an addition operation result value obtained by adding all the multiplication operation result values performed in each DSP,
SIMD MAC unit with improved operation speed.

Multiplying the operands by each of a plurality of DSPs to which operands including multipliers and multiplicands are input;
Adding the result of the multiplication operation;
Counting an overflow occurring in the addition operation and obtaining a count value;
Obtaining a cumulative multiplier value by accumulating values of multipliers input to each DSP in parallel with the multiplication operation; And
And correcting the addition operation result value based on the count value and the cumulative multiplier value.
A method of operating an SIMD MAC unit with improved computational speed.

17. The method of claim 16,
Wherein the multiplying operation includes:
And adding a plurality of bits between the multiplicands to generate a combined multiplicand.
A method of operating an SIMD MAC unit with improved computational speed.

18. The method of claim 17,
Wherein generating the combined multiplicand comprises:
Generating a combined multiplicand by adding a second multiplicand to the result of left-shifting the first multiplicand of the multiplicand by 2N + 1 bits,
N is the number of bits of the operands,
A method of operating an SIMD MAC unit with improved computational speed.

18. The method of claim 17,
Wherein the multiplying operation includes:
Further comprising multiplying the multiplier by the multiplicand,
A method of operating an SIMD MAC unit with improved computational speed.

17. The method of claim 16,
Wherein the step of performing the addition comprises:
Sequentially executing addition operations of multiplication operation result values of subsequent DSPs starting from a multiplication operation result value of a top DSP connected to a pipeline,
A method of operating an SIMD MAC unit with improved computational speed.

17. The method of claim 16,
The step of obtaining the count value includes:
And counting the overflow occurring in guard bits located at 2N + 1th in the result of the addition operation.
A method of operating an SIMD MAC unit with improved computational speed.

17. The method of claim 16,
The step of obtaining the cumulative multiplier value comprises:
Accumulating values of multipliers inputted to DSPs to which negative multiplicands are input,
A method of operating an SIMD MAC unit with improved computational speed.

18. The method of claim 17,
The step of obtaining the cumulative multiplier value comprises:
And accumulating multipliers inputted to the DSPs to which the multiplicand, which is a negative number included in the lower N bits of the combined multiplicand, is input,
N is the number of bits of the operands,
A method of operating an SIMD MAC unit with improved computational speed.

17. The method of claim 16,
Wherein the correcting comprises:
And correcting an addition operation result value obtained by adding all the multiplication operation result values of the plurality of DSPs.
A method of operating an SIMD MAC unit with improved computational speed.

17. The method of claim 16,
Wherein the correcting comprises:
Generating a concatenated first association value that associates the count value and the result of the addition operation; And
And generating a second association value based on the cumulative multiplier value.
A method of operating an SIMD MAC unit with improved computational speed.

26. The method of claim 25,
Wherein generating the first associated value comprises:
Adding the lower 2N bits of the result of the addition operation to the result of left-shifting the count value by 2N bits to generate a first associated value,
N is the number of bits of the operands,
A method of operating an SIMD MAC unit with improved computational speed.

26. The method of claim 25,
Wherein generating the second association value comprises:
Generating a second associative value in which the accumulated multiplier value is left-shifted by N bits,
N is the number of bits of the operands,
A method of operating an SIMD MAC unit with improved computational speed.

A SIMD MAC unit comprising: a plurality of SIMD MAC units,
A convolution neural network accelerator using an array of SIMD MAC units.

29. The method of claim 28,
When a negative multiplier is input, converting the negative multiplier to an unsigned value,
A convolution neural network accelerator using an array of SIMD MAC units.