KR20210126506A

KR20210126506A - Supporting floating point 16 (fp16) in dot product architecture

Info

Publication number: KR20210126506A
Application number: KR1020210042233A
Authority: KR
Inventors: 함자 아메드 알리 아브데라지즈; 알리 샤피 아데스타니; 조셉 에이치. 하쏜
Original assignee: 삼성전자주식회사
Priority date: 2020-04-10
Filing date: 2021-03-31
Publication date: 2021-10-20
Also published as: EP3893105A1; EP3893105B1; US20210319079A1; CN113515261A

Abstract

Disclosed are a dot-product architecture for calculating floating-point dot-products of two vectors and a method thereof. The architecture includes: an array of multiplier units which each include an integer logic which multiplies integer values of corresponding elements of the two vectors; an exponent logic which adds exponent values of the corresponding elements of the two vectors to form an unbiased exponent values; and a local shifter which forms a first shifted value by shifting a product-integer value by a number of bits in a predetermined direction based on a difference value between an unbiased exponent value corresponding to the product-integer value and a maximum unbiased exponent value for the array of multiplier units. An adder tree adds shifted values output from local shifters of the array of multiplier units to form an output, and an accumulator accumulates the output of the addition unit.

Description

Computing unit architecture and method for calculating the dot product of floating point {SUPPORTING FLOATING POINT 16 (FP16) IN DOT PRODUCT ARCHITECTURE}

본 개시는 컴퓨팅 유닛에 관한 것으로, 보다 상세하게는 부동 소수점의 내적을 계산하기 위한 컴퓨팅 유닛 아키텍쳐 및 그 방법에 관한 것이다.The present disclosure relates to a computing unit, and more particularly, to a computing unit architecture and method for calculating a dot product of a floating point.

활성화 값과 가중 값의 내적은 심층 신경망(deep neural network: DNN) 가속기에 의해 일반적으로 계산되는 연산이다. 활성화 값과 가중 값은 16 비트 반 정밀 부동 소수점(FP16) 값으로 표현될 수 있다. FP16 값은 부호(sign), 지수(exponent), 및 가수(fraction, mantissa) 비트로 표현될 수 있다. 예를 들어, 도 1a는 1 개의 부호 비트, 5 개의 지수 비트 (바이어스 = 15) 및 10 개의 가수 비트 (1 개의 히든 비트 포함)를 포함하는 반 정밀 FP16 표현을 도시한다. 도 1b는 상이한 유형의 FP 수들에 대한 예시 표현들을 나타내는 표를 도시한다. 도 1b의 표현들은 0 또는 무한대 (1…1)와 같지 않은 지수와, FP 16의 경우 15 (127)의 바이어스를 나타낸다. 약어 NaN은 수 아님(Not a Number)을 의미한다.The dot product of an activation value and a weight value is an operation commonly computed by a deep neural network (DNN) accelerator. The activation value and the weight value can be expressed as a 16-bit half-precision floating-point (FP16) value. The FP16 value may be expressed in sign, exponent, and fraction, mantissa bits. For example, Figure 1a shows a semi-precision FP16 representation containing 1 sign bit, 5 exponent bits (bias = 15) and 10 mantissa bits (with 1 hidden bit). 1B shows a table showing example representations for different types of FP numbers. The representations in Figure 1b show an exponent not equal to 0 or infinity (1...1), and a bias of 15 (127) for FP 16. The abbreviation NaN stands for Not a Number.

본 개시의 목적은 부동 소수점의 내적을 계산하기 위한 컴퓨팅 유닛 아키텍쳐 및 그 방법을 제공하는데 있다.It is an object of the present disclosure to provide a computing unit architecture and method for calculating a dot product of a floating point.

본 개시의 일 실시 예는 제1 벡터 및 제2 벡터의 내적을 계산하기 위한 장치를 제공하며, 상기 장치는 승산기 어레이, 최대 트리 유닛, 가산기 트리, 및 누산기를 포함할 수 있다. 상기 제1 벡터는 활성화 값일 수 있고 상기 제2 벡터는 가중 값일 수 있다. 상기 승산기 어레이의 승산기는 정수 로직, 지수 로직, 및 로컬 시프터를 포함할 수 있다. 상기 정수 로직은 제1 벡터 및 제2 벡터의 대응하는 요소들의 정수 값들을 곱하여 제1 벡터 및 제2 벡터가 부동 소수점 값들을 포함할 수 있는 곱 정수 값을 형성할 수 있다. 상기 지수 로직은 상기 두 벡터들의 대응하는 요소들의 정수 값들에 대응하는 지수 값들을 더하여 상기 곱 정수 값에 대응하는 미편향된 지수 값을 형성할 수 있다. 상기 로컬 시프터는 상기 곱 정수 값에 대응하는 상기 미편향 지수 값과 상기 로컬 시프터의 미리 설정된 최대 비트 시프트 용량보다 작거나 같은 상기 승산기 어레이에 대한 최대 미편향 지수 값 사이의 차이 값에 기반하여, 상기 곱 정수 값을 비트 수만큼 미리 설정된 방향으로 시프트하여 제1 시프트된 값을 형성할 수 있다. 상기 최대 트리 유닛은 상기 승산기 어레이에 대한 상기 최대 미편향 지수 값을 결정할 수 있다. 상기 가산기 트리는 상기 승산기 어레이의 로컬 시프터들로부터 출력된 제1 시프트된 값들을 더하여 제1 출력을 형성할 수 있고, 상기 누산기는 상기 가산기 트리의 상기 제1 출력을 누적할 수 있다. 일 실시 예에서, 상기 장치는 상기 곱 정수 값에 대응하는 상기 미편향 지수 값과 상기 제1 시프트된 값에 대응하는 상기 로컬 시프터의 상기 미리 설정된 최대 비트 시프트 용량보다 작거나 같은 상기 최대 미편향 지수 값 사이의 상기 차이 값에 기반하여, 상기 제1 시프트된 값을 상기 가산기 트리에 연결하는 제1 마스크를 생성하는 마스크 생성기를 더 포함할 수 있고, 상기 가산기 트리는 상기 승산기 어레이의 상기 로컬 시프터들로부터 출력된 상기 제1 시프트된 값들을 더하고, 상기 제1 마스크에 의해 상기 가산기 트리에 연결되어 상기 제1 출력을 형성할 수 있다. 상기 마스크 생성기는 제1 사이클 동안 제1 마스크를 생성할 수 있다. 다른 실시 예에서, 상기 마스크 생성기는 상기 곱 정수 값에 대응하는 상기 미편향 지수 값과 상기 제1 시프트된 값에 대응하는 상기 로컬 시프터의 상기 미리 설정된 최대 비트 시프트 용량보다 큰 상기 최대 미편향 지수 값 사이의 상기 차이 값에 기반하여, 상기 제1 시프트된 값을 상기 가산기 트리에 연결하는 제2 마스크를 생성할 수 있고, 상기 가산기 트리는 상기 승산기 어레이의 상기 로컬 시프터들로부터 출력된 상기 제1 시프트된 값들을 더하고, 상기 제2 마스크에 의해 상기 가산기 트리에 연결되어 상기 제2 출력을 형성할 수 있다. 상기 장치는 상기 가산기 트리에 연결되고 상기 가산기 트리로부터의 상기 제2 출력을 상기 로컬 시프터의 상기 미리 설정된 최대 비트 시프트 용량만큼 시프트하여 제2 시프트된 값을 형성하는 보조 시프터를 더 포함할 수 있고, 상기 누산기는 상기 가산기 트리의 상기 제2 출력을 더 누적할 수 있다. 상기 마스크 생성기는 제2 사이클 동안에 상기 제2 마스크를 생성할 수 있다. 일 실시 예에서, 상기 활성화 값과 상기 가중 값은 16 비트 부동 소수점(FP16) 값들을 포함할 수 있다. 다른 실시 예에서, 상기 활성화 값과 상기 가중 값은 32 비트 부동 소수점(FP32) 값들일 수 있다. An embodiment of the present disclosure provides an apparatus for calculating a dot product of a first vector and a second vector, and the apparatus may include a multiplier array, a maximum tree unit, an adder tree, and an accumulator. The first vector may be an activation value and the second vector may be a weight value. The multipliers of the multiplier array may include integer logic, exponential logic, and local shifters. The integer logic may multiply the integer values of corresponding elements of the first vector and the second vector to form a product integer value in which the first vector and the second vector may contain floating point values. The exponent logic may add exponent values corresponding to integer values of corresponding elements of the two vectors to form an unbiased exponent value corresponding to the product integer value. the local shifter is based on a difference value between the unbiased exponent value corresponding to the product integer value and a maximum unbiased exponent value for the multiplier array less than or equal to a preset maximum bit shift capacity of the local shifter; A first shifted value may be formed by shifting the product integer value in a preset direction by the number of bits. The maximum tree unit may determine the maximum unbiased exponent value for the multiplier array. The adder tree may add first shifted values output from local shifters of the multiplier array to form a first output, and the accumulator may accumulate the first output of the adder tree. In an embodiment, the device is configured to configure the maximum unbiased index value that is less than or equal to the preset maximum bit shift capacity of the local shifter corresponding to the first shifted value and the unbiased index value corresponding to the product integer value. based on the difference value between values, a mask generator for generating a first mask coupling the first shifted value to the adder tree, the adder tree from the local shifters of the multiplier array. The outputted first shifted values may be added and connected to the adder tree by the first mask to form the first output. The mask generator may generate a first mask during a first cycle. In another embodiment, the mask generator is configured to generate the maximum unbiased index value greater than the preset maximum bit shift capacity of the local shifter corresponding to the first shifted value and the unbiased index value corresponding to the product integer value. Based on the difference value between The values may be added and connected to the adder tree by the second mask to form the second output. the apparatus may further comprise an auxiliary shifter coupled to the adder tree and configured to shift the second output from the adder tree by the preset maximum bit shift capacity of the local shifter to form a second shifted value; The accumulator may further accumulate the second output of the adder tree. The mask generator may generate the second mask during a second cycle. In an embodiment, the activation value and the weight value may include 16-bit floating point (FP16) values. In another embodiment, the activation value and the weight value may be 32-bit floating point (FP32) values.

본 개시의 일 실시 예는 정수 로직, 지수 로직, 및 로컬 시프터를 포함할 수 있는 승산기를 제공한다. 상기 정수 로직은 제1 벡터의 요소들의 정수 값들을 제2 벡터의 대응하는 요소들의 정수 값들과 곱하여 곱 정수 값을 형성할 수 있다. 상기 제1 벡터는 활성화 값일 수 있고 상기 제2 벡터는 가중 값일 수 있다. 상기 지수 로직은 상기 두 벡터들의 대응하는 요소들의 상기 정수 값들에 대응하는 지수 값들을 더하여 상기 곱 정수 값에 대응하는 미편향 지수 값을 형성할 수 있다. 상기 로컬 시프터는 상기 곱 정수 값에 대응하는 상기 미편향 지수 값과 로컬 시프터의 미리 설정된 최대 비트 시프트 용량보다 작거나 같은 미리 설정된 값 사이의 차이 값에 기반하여, 상기 곱 정수 값을 비트 수만큼 미리 설정된 방향으로 시프트하여 제1 시프트된 값을 형성할 수 있다. 상기 승산기는 최대 트리 유닛, 가산기 트리, 및 누산기와 함께 승산기 어레이의 일부일 수 있다. 상기 최대 트리 유닛은 상기 승산기 어레이에 대한 최대 미편향 지수 값을 포함하는 상기 미리 설정된 값을 결정할 수 있다. 상기 가산기 트리는 상기 승산기 어레이의 로컬 시프터들로부터 출력된 제1 시프트된 값들을 더하여 제1 출력을 형성할 수 있다. 상기 누산기는 상기 가산기 트리의 상기 제1 출력을 누적할 수 있다. 마스크 생성기는 상기 곱 정수 값에 대응하는 상기 미편향 지수 값과 상기 제1 시프트된 값에 대응하는 상기 로컬 시프터의 상기 미리 설정된 최대 비트 시프트 용량보다 작거나 같은 상기 최대 미편향 지수 값 사이의 상기 차이 값에 기반하여, 상기 제1 시프트된 값을 상기 가산기 트리에 연결하는 제1 마스크를 생성할 수 있다. 상기 가산기 트리는 상기 승산기 어레이의 상기 로컬 시프터들로부터 출력된 상기 제1 시프트된 값들을 더하고, 상기 제1 마스크에 의해 상기 가산기 트리에 연결되어 상기 제1 출력을 형성할 수 있다. 상기 마스크 생성기는 제1 사이클 동안에 상기 제1 마스크를 생성할 수 있다. 상기 마스크 생성기는 상기 곱 정수 값에 대응하는 상기 미편향 지수 값과 상기 제1 시프트된 값에 대응하는 상기 로컬 시프터의 상기 미리 설정된 최대 비트 시프트 용량보다 큰 상기 최대 미편향 지수 값 사이의 차이 값에 기반하여, 상기 제1 시프트된 값을 상기 가산기 트리에 연결하는 제2 마스크를 생성할 수 있다. 상기 마스크 생성기는 제2 사이클 동안에 상기 제2 마스크를 생성할 수 있다. 상기 가산기 트리는 상기 승산기 어레이의 상기 로컬 시프터들로부터 출력된 상기 제1 시프트된 값들을 더하고, 상기 제2 마스크에 의해 상기 가산기 트리에 연결되어 제2 출력을 형성할 수 있다. 상기 승산기 어레이는 상기 가산기 트리에 연결되고 상기 가산기 트리로부터의 상기 제2 출력을 상기 로컬 시프터의 상기 미리 설정된 최대 비트 시프트 용량만큼 시프트하여 제2 시프트된 값을 형성하는 보조 시프터를 더 포함할 수 있고, 상기 누산기는 상기 가산기 트리의 상기 제2 출력을 더 누적할 수 있다. 일 실시 예에서, 상기 활성화 값과 상기 가중 값은 16 비트 부동 소수점(FP16) 값들을 포함할 수 있다. 다른 실시 예에서, 상기 활성화 값과 상기 가중 값은 32 비트 부동 소수점(FP32) 값들일 수 있다.An embodiment of the present disclosure provides a multiplier that may include integer logic, exponential logic, and a local shifter. The integer logic may multiply integer values of elements of a first vector with integer values of corresponding elements of a second vector to form a product integer value. The first vector may be an activation value and the second vector may be a weight value. The exponent logic may add exponent values corresponding to the integer values of corresponding elements of the two vectors to form an unbiased exponent value corresponding to the product integer value. The local shifter pre-sets the product integer value by the number of bits based on a difference value between the unbiased exponent value corresponding to the product integer value and a preset value less than or equal to a preset maximum bit shift capacity of the local shifter. A first shifted value may be formed by shifting in a set direction. The multiplier may be part of a multiplier array along with a maximum tree unit, an adder tree, and an accumulator. The maximum tree unit may determine the preset value including the maximum unbiased exponent value for the multiplier array. The adder tree may add first shifted values output from local shifters of the multiplier array to form a first output. The accumulator may accumulate the first output of the adder tree. The mask generator is configured to determine the difference between the maximum unbiased index value corresponding to the product integer value and the maximum unbiased index value less than or equal to the preset maximum bit shift capacity of the local shifter corresponding to the first shifted value. Based on the value, a first mask may be generated that connects the first shifted value to the adder tree. The adder tree may add the first shifted values output from the local shifters of the multiplier array, and may be connected to the adder tree by the first mask to form the first output. The mask generator may generate the first mask during a first cycle. The mask generator is configured to determine a difference value between the unbiased exponent value corresponding to the product integer value and the maximum unbiased exponent value greater than the preset maximum bit shift capacity of the local shifter corresponding to the first shifted value. Based on this, a second mask may be generated that connects the first shifted value to the adder tree. The mask generator may generate the second mask during a second cycle. The adder tree may add the first shifted values output from the local shifters of the multiplier array, and may be connected to the adder tree by the second mask to form a second output. The multiplier array may further include an auxiliary shifter coupled to the adder tree and configured to shift the second output from the adder tree by the preset maximum bit shift capacity of the local shifter to form a second shifted value, and , the accumulator may further accumulate the second output of the adder tree. In an embodiment, the activation value and the weight value may include 16-bit floating point (FP16) values. In another embodiment, the activation value and the weight value may be 32-bit floating point (FP32) values.

본 개시의 일 실시 예는 부동 소수점 값들에 대한 내적을 계산하는 방법을 제공한다. 상기 방법은: 승산기 어레이의 정수 로직에 의해, 제1 벡터의 요소들의 정수 값들을 제2 벡터의 대응하는 요소들의 정수 값들과 곱하여 정수 곱 값들을 형성하는 요소별 승산 단계, 상기 제1 벡터는 16 비트 부동 소수점 값들의 'n' 요소들을 포함하고 상기 제2 벡터는 16 비트 부동 소수점 값들의 'n' 요소들을 포함하고, 'n' 은 1보다 큰 정수이며; 상기 승산기 어레이의 지수 로직에 의해, 상기 제1 벡터의 상기 요소들의 지수 값들을 상기 제2 벡터의 대응하는 상기 요소들의 지수 값들과 더하여 상기 정수 곱 값들에 각기 대응하는 지수 합계 값을 형성하는 요소별 가산 단계; 상기 지수 합계 값들의 최대 지수 합계 값을 결정하는 단계; 상기 지수 로직에 의해, 상기 지수 합계 값들의 각각에서 상기 최대 지수 합계 값을 빼서 상기 정수 곱 값들에 각각 대응하는 상대 지수 값들을 형성하는 감산 단계; 상기 승산기 어레이의 제1 로컬 시프터들에 의해, 제1 정수 곱 값들을 대응하는 상대적 지수 값만큼 우측 시프팅하여 상기 최대 지수 합계 값에 대응하는 정수 곱과 정렬된 정수 곱 값들을 형성하는 우측 시프팅 단계, 상기 제1 로컬 시프터들은 각기 상기 제1 벡터 및 상기 제2 벡터의 지수 합계 값들의 전체 비트 범위보다 작은 제1 미리 설정된 최대 비트 시프트 수와 상기 제1 미리 설정된 최대 비트 시프트 수보다 작거나 같은 상대 지수 값들에 대응하는 상기 제1 정수 곱 값들을 포함하고; 및 상기 최대 지수 합계 값에 대응하는 상기 정수 곱 값과 정렬된 상기 제1 정수 곱 값들을 더하여 상기 제1 벡터와 상기 제2 벡터의 내적을 형성하는 가산 단계를 포함할 수 있다. 일 실시 예에 있어서, 상기 방법은, 보조 시프터에 의해, 제2 정수 곱 값들을 제2 미리 설정된 비트 시프트 수만큼 우측 비트 시프팅하여 상기 최대 지수 합계 값에 대응하는 상기 정수 곱 값과 정렬되는 제2 정수 곱 값들을 형성하는 단계를 더 포함할 수 있고, 상기 제2 미리 설정된 비트 시프트 수는 상기 제1 미리 설정된 최대 비트 시프트 수를 포함할 수 있다. 상기 제1 벡터 및 상기 제2 벡터의 지수 값들의 전체 비트 범위는 58 비트를 포함할 수 있고, 상기 제1 미리 설정된 최대 비트 시프트 수와 상기 제2 미리 설정된 비트 시프트 수의 합은 58 비트 이하일 수 있다. 일 실시 예에서, 상기 제1 벡터의 요소들과 상기 제2 벡터의 요소들은 16 비트 부동 소수점(FP16) 값들을 포함할 수 있다. 다른 실시 예에서, 상기 제1 벡터의 요소들과 상기 제2 벡터의 요소들은 32 비트 부동 소수점(FP32) 값들을 포함할 수 있다.An embodiment of the present disclosure provides a method of calculating a dot product for floating-point values. The method comprises: an element-wise multiplication step of multiplying, by integer logic of a multiplier array, integer values of elements of a first vector with integer values of corresponding elements of a second vector to form integer product values, wherein the first vector is 16 contains 'n' elements of bit floating point values and the second vector contains 'n' elements of 16 bit floating point values, where 'n' is an integer greater than one; By exponent logic of the multiplier array, the exponent values of the elements of the first vector are added to the exponent values of the corresponding elements of the second vector to form an exponent sum value respectively corresponding to the integer product values. addition step; determining a maximum exponential sum value of the exponential sum values; a subtraction step of subtracting, by the exponent logic, the maximum exponent sum value from each of the exponent sum values to form relative exponent values respectively corresponding to the integer product values; Right shifting, by first local shifters of the multiplier array, right shifting first integer product values by a corresponding relative exponential value to form integer product values aligned with the integer product corresponding to the maximum exponential sum value. step, wherein the first local shifters are respectively equal to or less than a first preset maximum number of bit shifts less than the entire bit range of the exponential sum values of the first vector and the second vector and less than or equal to the first preset maximum number of bit shifts. the first integer product values corresponding to relative exponent values; and adding the first integer product values aligned with the integer product value corresponding to the maximum exponential sum value to form a dot product of the first vector and the second vector. In one embodiment, the method includes, by an auxiliary shifter, a second integer product value aligned with the integer product value corresponding to the maximum exponential sum value by right bit-shifting by a second preset number of bit shifts. The method may further include forming two integer product values, and the second preset number of bit shifts may include the first preset maximum number of bit shifts. The total bit range of the exponent values of the first vector and the second vector may include 58 bits, and the sum of the first preset maximum number of bit shifts and the second preset number of bit shifts may be less than or equal to 58 bits. have. In an embodiment, the elements of the first vector and the elements of the second vector may include 16-bit floating point (FP16) values. In another embodiment, the elements of the first vector and the elements of the second vector may include 32-bit floating point (FP32) values.

본 개시의 실시 예들에 따르면, 부동 소수점의 내적을 계산하기 위한 컴퓨팅 유닛 아키텍쳐 및 그 방법이 제공된다. According to embodiments of the present disclosure, a computing unit architecture and a method for calculating a dot product of a floating point are provided.

도 1a는 부호 비트, 지수 비트, 및 가수 비트를 포함하는 반 정밀 FP16 표현을 도시한다.
도 1b는 상이한 유형의 FP 수들에 대한 예시 표현들을 나타내는 표이다.
도 2는 본 개시의 실시 예에 따라, 면적과 전력에 최적화되어 있는 동안 가수 곱들의 지수들을 정렬하여 가수가 추가되도록 하는 내적 계산 아키텍처(200)의 제1 실시 예이다.
도 3a는 딥 러닝 FP16 데이터에 대한 최대 지수 값들의 분포 히스토그램을 도시한다.
도 3b는 딥 러닝 FP16 데이터에 대한 최대 마이너스 지수 값들의 분포 히스토그램을 도시한다.
도 3c는 딥 러닝 FP16 데이터에 대한 곱 지수 값들의 분포 히스토그램을 도시한다.
도 4a 및 도 4b는 본 개시의 실시 예에 따라, 딥 러닝 네트워크에 대한 내적을 계산하기 위해 최적화될 수 있는 내적 계산 아키텍처의 실시 예를 도시한다.
도 5a 및 도 5b는 본 개시의 실시 예에 따라, 다중 사이클 연산을 제공하는 내적 계산 아키텍처의 또 다른 실시 예의 추가적 상세를 도시한다.
도 6은 본 개시의 실시 예에 따라, 딥 러닝 네트워크에 대한 내적을 계산하기 위한 방법의 흐름도이다.
도 7은 본 명세서에 개시된 하나 이상의 내적 계산 아키텍처를 구비하는 CNN 가속기를 포함하는 전자 장치를 도시한다.1A shows a semi-precision FP16 representation comprising a sign bit, an exponent bit, and a mantissa bit.
1B is a table showing example representations for different types of FP numbers.
FIG. 2 is a first embodiment of a dot product calculation architecture 200 that aligns exponents of mantissa products so that mantissa is added while optimized for area and power, according to an embodiment of the present disclosure.
3A shows a distribution histogram of maximum exponential values for deep learning FP16 data.
3B shows a distribution histogram of maximum negative exponential values for deep learning FP16 data.
3C shows a distribution histogram of product exponent values for deep learning FP16 data.
4A and 4B illustrate an embodiment of a dot product calculation architecture that may be optimized to calculate a dot product for a deep learning network, according to an embodiment of the present disclosure.
5A and 5B illustrate additional details of another embodiment of a dot product computation architecture that provides a multi-cycle operation, in accordance with an embodiment of the present disclosure.
6 is a flowchart of a method for calculating a dot product for a deep learning network, according to an embodiment of the present disclosure.
7 illustrates an electronic device including a CNN accelerator having one or more dot product computation architectures disclosed herein.

이하의 상세한 설명에서, 본 개시의 완전한 이해를 제공하기 위해 다수의 특정 상세들이 제시된다. 그러나, 개시된 양상들이 이러한 특정 상세들이 없이도 실시될 수 있다는 것이 본 분야의 당업자에게 이해될 수 있을 것이다. 다른 예들에서, 잘 알려진 방법, 절차, 구성 요소, 및 회로는 본 개시의 기술 요지를 모호하지 않도록 하기 위해 상세하게 설명되지 않는다. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be understood by one of ordinary skill in the art that the disclosed aspects may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail in order not to obscure the subject matter of the present disclosure.

본 명세서 전체에서 "일 실시 예"또는 "실시 예"에 대한 언급은 실시 예와 관련하여 설명된 특정 특징, 구조, 또는 특성이 본 명세서에 개시된 적어도 하나의 실시 예에 포함될 수 있음을 의미한다. 따라서, 본 명세서 전반에 걸쳐 다양한 곳에서 "일 실시 예에서" 또는 "일 실시 예에 따른"(또는 유사한 의미를 갖는 다른 어구) 문구의 출현은 반드시 모두 동일한 실시 예를 지칭하는 것이 아닐 수 있다. 더욱이, 특정 특징, 구조, 또는 특성은 하나 이상의 실시 예에서 임의의 적절한 방식으로 결합될 수 있다. 이와 관련하여, 본 명세서에서 사용된 바와 같이, "예시적인"이라는 단어는 "예시, 실례, 또는 예시로서의 역할"을 의미한다. 본 명세서에서 "예시적인"것으로 설명된 임의의 실시 예는 다른 실시 예에 비해 반드시 바람직하거나 유리한 것으로 해석되어서는 안된다. 추가로, 특정 특징, 구조, 또는 특성은 하나 이상의 실시 예에서 임의의 적절한 방식으로 결합될 수 있다. 또한, 본 명세서에서 논의의 맥락에 따라, 단수 용어는 대응하는 복수형을 포함할 수 있고, 복수 용어는 대응하는 단수형을 포함할 수 있다. 유사하게, 하이픈으로 연결한 용어(예: "2-차원", "미리-설정된", "픽셀- 특정"등)는 때때로 대응하는 하이픈으로 연결되지 않은 버전 (예: "2 차원", "미리 설정된", "픽셀 특정"등)과 상호 교환 적으로 사용될 수 있고, 대문자 항목(예: “Counter Clock”,“Select", "PIXOUT" 등)은 대응하는 비 대문자 버전 (예: “counter clock”,“select”, “pixout”등)과 상호 교환 적으로 사용될 수 있다. 이러한 간헐적인 상호 교환 사용은 서로 일치하는 것으로 간주될 수 있다.Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment disclosed herein. Thus, the appearances of the phrases "in one embodiment" or "according to an embodiment" (or other phrases having a similar meaning) in various places throughout this specification may not necessarily all refer to the same embodiment. Moreover, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In this regard, as used herein, the word “exemplary” means “serving as an illustration, instance, or illustration.” Any embodiment described herein as “exemplary” should not be construed as necessarily preferred or advantageous over other embodiments. Additionally, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Also, depending on the context of the discussion herein, singular terms may include a corresponding plural and plural terms may include a corresponding singular. Similarly, hyphenated terms (e.g., "2-dimensional", "pre-set", "pixel-specific", etc.) are sometimes referred to as corresponding unhyphenated versions (e.g., "two-dimensional", "pre-set", etc.) set", "pixel specific", etc.) can be used interchangeably, and uppercase entries (e.g. “Counter Clock”, “Select”, “PIXOUT”, etc.) have a corresponding non-capital version (e.g. “counter clock”). .

또한, 본 명세서에서 논의의 맥락에 따라, 단수 용어는 대응하는 복수형을 포함할 수 있고, 복수 용어는 대응하는 단수형을 포함할 수 있다. 본 명세서에 도시되고 논의된 다양한 도면들(구성 요소 다이어그램을 포함)은 단지 도시를 위한 것이며, 비율대로 그려지지 않았다는 점에 유의해야 한다. 마찬가지로 다양한 파형과 타이밍 다이어그램은 도시의 목적으로 만 표시된다. 예를 들어, 일부 구성 요소의 치수는 명확성을 위해 다른 요소에 비해 과장될 수 있다. 또한, 적절하다고 여겨지는 경우, 도면들 간에 참조 번호가 반복되어 대응 및/또는 유사한 구성 요소들을 표시한다.Also, depending on the context of the discussion herein, singular terms may include a corresponding plural and plural terms may include a corresponding singular. It should be noted that the various drawings (including component diagrams) shown and discussed herein are for illustration only and are not drawn to scale. Likewise, the various waveforms and timing diagrams are shown for illustrative purposes only. For example, the dimensions of some components may be exaggerated relative to others for clarity. Further, where deemed appropriate, reference numerals are repeated between the drawings to indicate corresponding and/or similar components.

본 명세서에서 사용되는 용어는 일부 실시 예들 만을 설명하기 위한 것이며 청구된 주제를 제한하려는 의도가 아니다. 본 명세서에서 사용된 바와 같이, 단수 형태 "한(a)", "한(an)"및 "그(the)"는 달리 문맥상 명확하게 나타내지 않는 한 또한 복수 형태도 포함하도록 의도된다. 본 명세서에서 사용될 때 "포함하다"및/또는 "포함하는"이라는 용어는 언급된 특징, 정수, 단계, 동작, 요소 및/또는 구성 요소의 존재를 명시하지만, 하나 이상의 다른 특징, 정수, 단계, 동작, 요소, 구성 요소, 및/또는 그 그룹의 존재 또는 추가를 배제하지 않음이 추가로 이해될 것이다. 본 명세서에서 사용되는 용어 "제1", "제2"등은 선행하는 명사에 대한 레이블로서 사용되고, 명시적으로 정의하지 않는 한 어떤 유형의 순서(예: 공간, 시간, 논리적 등)도 의도하지 않는다. 또한, 동일하거나 유사한 기능을 갖는 부품, 구성 요소, 블록, 회로, 유닛 또는 모듈을 지칭하기 위해 동일한 참조 번호가 둘 이상의 도면들에 걸쳐 사용될 수 있다. 그러나 이러한 사용은 설명의 단순성과 논의의 용이성을 위한 것이며, 그러한 구성 요소 또는 유닛의 구성 또는 구조적 세부 사항이 모든 실시 예에 걸쳐 동일하거나 그러한 공통 참조 부품/모듈들이 본 명세서에 개시된 실시 예 중 일부를 구현하는 유일한 방법이라는 것을 의도하는 것은 아니다. The terminology used herein is for the purpose of describing some embodiments only and is not intended to limit the claimed subject matter. As used herein, the singular forms "a", "an" and "the" are intended to also include the plural forms unless the context clearly dictates otherwise. The terms "comprises" and/or "comprising" as used herein specify the presence of a recited feature, integer, step, operation, element and/or component, but include one or more other features, integers, steps, It will be further understood that this does not exclude the presence or addition of acts, elements, components, and/or groups thereof. As used herein, the terms "first", "second", etc. are used as labels for the preceding nouns and are not intended to have any type of ordering (e.g., spatial, temporal, logical, etc.) unless explicitly defined. does not Also, the same reference number may be used throughout two or more drawings to refer to a part, component, block, circuit, unit, or module having the same or similar function. However, such use is for simplicity of description and ease of discussion, and the configuration or structural details of such components or units are the same throughout all embodiments, or such common reference parts/modules may be used in some of the embodiments disclosed herein. It is not intended to be the only way to implement it.

어떤 구성 요소 또는 층이 다른 구성 요소 또는 층 상에 존재하거나, "연결된"또는 "결합 된"것으로 언급될 때, 그것은 다른 요소 또는 층에 직접적으로 존재, 연결, 또는 결합될 수 있거나 또는 중간 개재 요소나 층이 존재할 수 있음이 이해될 것이다. 대조적으로, 어떤 구성 요소가 다른 구성 요소 또는 층에 "직접 적으로 상에", "직접적으로 연결된"또는 "직접적으로 결합된" 이라고 언급되는 경우, 중간 개재 구성 요소 또는 중간 개재 층은 존재하지 않는다. 동일한 참조번호는 명세서 전체에 걸쳐 동일한 구성 요소를 나타낸다. 본 명세서에서 사용되는 용어 "및/또는"은 하나 이상의 연관된 열거 항목의 임의의 및 모든 조합을 포함한다.When an element or layer is on, or referred to as "connected" or "coupled to" another element or layer, it can be directly present, connected, or coupled to the other element or layer, or an intervening element. It will be understood that layers may exist. In contrast, when an element is referred to as being "directly on," "directly connected to," or "directly coupled to" another element or layer, no intervening element or intervening layer is present. . Like reference numbers refer to like elements throughout. As used herein, the term “and/or” includes any and all combinations of one or more associated enumerated items.

본 명세서에서 사용되는 용어 "제1", "제2"등은 선행하는 명사에 대한 레이블로서 사용되고, 명시적으로 정의하지 않는 한 어떤 유형의 순서(예: 공간, 시간, 논리적 등)도 의도하지 않는다. 또한, 동일하거나 유사한 기능을 갖는 부품, 구성 요소, 블록, 회로, 유닛 또는 모듈을 지칭하기 위해 동일한 참조 번호가 둘 이상의 도면들에 걸쳐 사용될 수 있다. 그러나 이러한 사용은 설명의 단순성과 논의의 용이성을 위한 것이며, 그러한 구성 요소 또는 유닛의 구성 또는 구조적 세부 사항이 모든 실시 예에 걸쳐 동일하거나 그러한 공통 참조 부품/모듈들이 본 명세서에 개시된 실시 예 중 일부를 구현하는 유일한 방법이라는 것을 의도하는 것은 아니다. As used herein, the terms "first", "second", etc. are used as labels for the preceding nouns and are not intended to have any type of ordering (e.g., spatial, temporal, logical, etc.) unless explicitly defined. does not Also, the same reference number may be used throughout two or more drawings to refer to a part, component, block, circuit, unit, or module having the same or similar function. However, such use is for simplicity of description and ease of discussion, and the configuration or structural details of such components or units are the same throughout all embodiments, or such common reference parts/modules may be used in some of the embodiments disclosed herein. It is not intended to be the only way to implement it.

달리 정의되지 않는 한, 본 명세서에서 사용되는 모든 용어(기술적 및 과학적인 용어를 포함)는 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의된 용어와 같은 용어는 관련 기술의 맥락에서 그 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며 본 명세서에서 명시적으로 정의되어 있지 않는 한 이상화되거나 지나치게 형식적인 의미로 해석되지 않아야 함이 더 이해될 것이다. Unless defined otherwise, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. Terms such as commonly used dictionary-defined terms should be construed as having a meaning consistent with their meaning in the context of the relevant art, and should not be construed in an idealized or overly formal sense unless explicitly defined herein. It will be more understood that it should not.

본 명세서에서 사용되는 용어 "모듈"은 모듈과 관련하여 본 명세서에 설명된 기능을 제공하도록 구성된 소프트웨어, 펌웨어 및/또는 하드웨어의 임의의 조합을 의미한다. 예를 들어, 소프트웨어는 소프트웨어 패키지, 코드, 및/또는 명령어 세트 또는 명령어로 구현될 수 있고, 본 명세서에 설명된 임의의 구현에서 사용되는 용어 "하드웨어"는 예를 들어, 단독 또는 임의의 조합으로, 어셈블리, 하드 와이어드 회로, 프로그래밍 가능 회로, 상태 머신 회로, 및/또는 프로그래밍 가능 회로에 의해 실행되는 명령을 저장하는 펌웨어를 포함할 수 있다. 모듈은 집합 적으로 또는 개별적으로, 한정되는 것은 아니지만 예를 들어 집적 회로 (IC), 시스템 온칩 (SoC), 어셈블리 등과 같은 더 큰 시스템의 일부를 형성하는 회로로서 구현될 수 있다.As used herein, the term “module” refers to any combination of software, firmware and/or hardware configured to provide the functionality described herein in connection with a module. For example, software may be implemented as a software package, code, and/or set of instructions or instructions, and the term "hardware" as used in any implementation described herein means, for example, alone or in any combination. , assembly, hard-wired circuitry, programmable circuitry, state machine circuitry, and/or firmware that stores instructions to be executed by the programmable circuitry. Modules, collectively or individually, may be implemented as circuitry forming part of a larger system, such as, but not limited to, for example, but not limited to, an integrated circuit (IC), a system on a chip (SoC), an assembly, and the like.

본 명세서에 개시된 주제(subject matter)는 일반적으로 발생하는 경우에 대해 면적 및 전력에서 최적화된 딥 러닝 데이터의 부동 소수점 값의 내적을 계산하기 위한 아키텍처를 제공한다. 아키텍처의 최적화는 처리될 것으로 예상되는 지수 값 범위의 분포에 기반할 수 있다. 지수 범위의 분포에 따라 아키텍처는 가수를 정렬하고 한 사이클에서 내적을 계산하도록 최적화될 수 있다. 다른 실시 예에서, 아키텍처는 가수를 정렬하고 2 또는 3 사이클 내에 내적을 계산하도록 최적화될 수 있다. 일 실시 예에서, 부동 소수점 값은 FP16 값일 수 있다. 다른 실시 예에서, 부동 소수점 값은 FP32 값이거나 bfloat 16 값일 수 있다. 아키텍처의 영역 최적화 양상은 특정 부동 소수점 형식과 관련된 지수 값의 전체 범위를 포괄하는 아키텍처에 의해 사용되는 공간보다 상대적으로 더 작은 공간으로 구현될 수 있다.The subject matter disclosed herein provides an architecture for computing the dot product of floating point values of deep learning data optimized in area and power for the generally occurring case. The optimization of the architecture may be based on the distribution of the range of exponential values expected to be addressed. Depending on the distribution of the exponential range, the architecture can be optimized to sort the mantissa and compute the dot product in one cycle. In other embodiments, the architecture may be optimized to align the mantissa and compute the dot product within 2 or 3 cycles. In one embodiment, the floating point value may be an FP16 value. In another embodiment, the floating point value may be an FP32 value or a bfloat 16 value. The domain optimization aspect of the architecture can be implemented in a space that is relatively smaller than the space used by the architecture to cover the entire range of exponent values associated with a particular floating-point format.

몇몇 경우에, 상대적으로 큰 부동 소수점 값을 가진 상대적으로 작은 부동 소수점 값의 가수 정렬은 상당한 성능 저하없이, 무시되거나 잘림으로 근사화 될 수 있다. 이는 큰 부동 소수점 값과 함께 상대적으로 작은 부동 소수점 값을 추가해도 내적 계산에 큰 영향을 주지 않을 것이기 때문이다. 일 실시 예에서, 본 명세서에서 개시된 아키텍처의 정렬 능력의 범위는 처리될 것으로 예상되는 지수 값의 범위보다 작을 수 있으며 상대적으로 작은 부동 소수점 값은 무시되거나 부분적으로 잘릴 수 있으며 여전히 충분히 정확한 결과를 제공한다. 부분적으로 잘리는 경우에도 올바른 정렬이 발생해야 하지만 정렬된 곱(product)의 일부만 다른 값과 함께 추가된다. In some cases, mantissa alignment of relatively small floating-point values with relatively large floating-point values can be ignored or approximated by truncation, without significant performance penalty. This is because adding a relatively small floating-point value along with a large floating-point value will not significantly affect the calculation of the dot product. In one embodiment, the range of alignment capabilities of the architecture disclosed herein may be smaller than the range of exponential values expected to be processed, and relatively small floating point values may be ignored or partially truncated and still provide sufficiently accurate results. . Correct alignment should occur even with partial truncation, but only part of the sorted product is added along with the other values.

DNN 가속기는 일반적으로 FP16 활성화 값과 대응하는 FP16 가중 값(weight value)의 내적을 계산한다. 정규화된 FP16 수만 고려하면

는 활성화 FP16 값의 벡터로 정의될 수 있으며,

는 가중 FP16 값의 벡터로 정의될 수 있다. FP16 값의 내적 계산에는 두 벡터의 부호 가수의 요소 별 승산과 곱 지수를 계산하기 위한 지수의 가산이 개입된다. 예를 들어, X와 W의 내적 p는 다음과 같이 결정될 수 있다.The DNN accelerator generally computes the dot product of the FP16 activation value and the corresponding FP16 weight value. Considering only the normalized number of FP16

can be defined as a vector of activation FP16 values,

may be defined as a vector of weighted FP16 values. Calculation of the dot product of the FP16 value involves multiplication for each element of the sign mantissa of two vectors and addition of exponents to calculate the product exponent. For example, the dot product p of X and W may be determined as follows.

여기서, i는 인덱스 이고, s_wi는 가중 값 i의 부호이며, e_wi는 가중 값 i의 지수이고, m_wi는 가중 값 i의 가수 값(mantissa value)이고, s_wi는 활성화 값 i의 부호이고, e_xi는 활성화 값 i의 지수이며, m_xi은 활성화 값의 가수 값이고, s_i은 곱값 i의 부호이며, e_i은 곱값 i의 지수이고, 그리고 m_i는 곱값 i의 가수이다. where i is the index, s _wi is the sign of the weight value i, e _wi is the exponent of the weight value i, m _wi is the mantissa value of the weight value i, and s _wi is the sign of the activation value i and, e _xi is the index of the activation value i, m _xi is the mantissa value of the activation values, s _i is a sign of gopgap i, e _i is the index of gopgap i, and m _i is the valence of gopgap i.

FP16 내적 계산(dot-product calculation)에서 곱(가수)은 곱의 지수가 최대 값을 갖는 지수와 정렬되면(전형적으로) 가산될 수 있다. 도 2는 본 개시의 실시 예에 따라, 면적과 전력에 최적화되어 있는 동안 가수 곱들의 지수들을 정렬하여 가수가 추가되도록 하는 내적 계산 아키텍처(200)의 제1 실시 예를 도시한다. 일 실시 예에서, 아키텍처(200)는 승산기 어레이(201₀-201_n-1) (여기서 n은 정수), 가산 유닛(즉, 가산기 트리) (202), 및 누산 로직 유닛(203)을 포함할 수 있다. 최대 트리 로직 유닛(207)은 각각의 승산기 유닛(201)의 지수 로직 섹션(205)에 연결될 수 있다. 승산기 유닛(201)은 승산기 유닛(201)과 관련하여 여기에 설명된 기능을 제공하도록 구성된 하드웨어를 포함할 수 있는 모듈로서 구성될 수 있다.In FP16 dot-product calculations, the product (mantissa) can be added (typically) if the exponent of the product is aligned with the exponent having the maximum value. FIG. 2 shows a first embodiment of a dot product calculation architecture 200 that aligns exponents of mantissa products so that mantissa is added while optimized for area and power, according to an embodiment of the present disclosure. In one embodiment, architecture 200 may include a multiplier array 201 ₀ -201 _n-1 (where n is an integer), an addition unit (ie, an adder tree) 202 , and an accumulation logic unit 203 . can The maximum tree logic unit 207 may be coupled to the exponent logic section 205 of each multiplier unit 201 . The multiplier unit 201 may be configured as a module that may include hardware configured to provide the functionality described herein with respect to the multiplier unit 201 .

각 승산기 유닛(201)은 정수(가수) 로직 섹션(204), 지수 로직 섹션(205), 및 시프터(206)를 포함한다. 정수 로직 섹션(204), 지수 로직 섹션(205), 및 시프터(206)는 트랜지스터, 상호 연결 도전체, 바이어싱 구성 요소, 및/또는 개별 로직 구성 요소와 같은 개별적 구성 요소로 형성될 수 있다. 각 승산기 유닛(201)은 X입력 벡터의 요소 및 W입력 벡터의 대응하는 요소에 대한 가수 및 지수를 수신한다. 일 실시 예에서, X벡터는 활성화 값인 요소를 포함할 수 있고, W벡터는 가중 값인 요소를 포함할 수 있다. 정수 로직 섹션(204)은 예를 들어 활성화 값 1.m_x 및 대응하는 가중 값 1.m_w에 대한 값을 수신하고, 곱 가수를 형성하기 위해 2 개의 가수 값을 곱한다. 곱 가수 값은 시프터(206)로 출력된다. 비록, 활성화 값 및 대응하는 가중 값이 예를 들어 정규화된 수로 제공되지만, 비정규(subnormal) 값(0.m_x 및/또는 0.m_w)도 승산기 유닛(201)에 의해 지원된다. Each multiplier unit 201 includes an integer (mantissa) logic section 204 , an exponent logic section 205 , and a shifter 206 . Integer logic section 204 , exponential logic section 205 , and shifter 206 may be formed of separate components such as transistors, interconnecting conductors, biasing components, and/or discrete logic components. Each multiplier unit 201 receives the mantissa and exponent for the elements of the X input vector and the corresponding elements of the W input vector. In an embodiment, the X vector may include an element that is an activation value, and the W vector may include an element that is a weight value. The integer logic section 204 receives, for example, a value for an activation value 1.m _x and a corresponding weight value 1.m _w , and multiplies the two mantissa values to form a product mantissa. The product mantissa value is output to the shifter 206 . Although the activation values and corresponding weight values are provided for example as normalized numbers, subnormal values (0.m _x and/or 0.m _w ) are also supported by the multiplier unit 201 .

가수의 정렬은 승산기 유닛(201)의 어레이에서 곱-가수 값의 지수 값과 곱-가수 값의 최대 지수 값 사이의 차이만큼 곱-가수 값을 시프팅하는 시프터(206)에 의해 달성될 수 있다. FP16의 경우 곱에 대한 지수 범위는 [-28, 30]이므로 극단적 인 경우 FP16에 대한 정렬 시프팅은 최대 58비트가 될 수 있다. 이 경우 FP16 지수의 전체 범위를 포괄하는 내적 계산 아키텍처에 대한 정렬 오버 헤드가 중요 할 수 있으며 (최대 58 비트) 가산 로직에는 예를 들어 80 비트 가산기 트리가 포함될 수 있다. Alignment of the mantissa may be achieved by a shifter 206 that shifts the product-mantissa values by the difference between the exponent value of the product-mantissa value and the maximum exponent value of the product-mantissa value in the array of multiplier units 201 . . For FP16, the exponent range for the product is [-28, 30], so in the extreme case the alignment shifting for FP16 can be up to 58 bits. In this case, the sorting overhead for the dot product computation architecture covering the full range of FP16 exponents can be significant (up to 58 bits) and the addition logic can include, for example, an 80-bit adder tree.

딥 러닝과 관련된 FP16 데이터에 대한 지수 값들과 최대 지수 값 간의 차이 분포는 일반적으로 FP16에 대한 전체 58 비트 지수 값 범위를 포함하지 않는다. 대신에 지수 값들과 최대 지수 값 간의 차이 분포는 일반적으로 훨씬 더 작은 범위의 값을 포함하는 경향이 있다. 예를 들어, 도 3a는 딥 러닝 FP16 데이터에 대한 최대 지수 값들의 분포 히스토그램을 도시한다. 도 3b는 딥 러닝 FP16 데이터에 대한 최대 마이너스 지수 값들의 분포 히스토그램을 도시한다. 마지막으로, 도 3c는 딥 러닝 FP16 데이터에 대한 곱 지수 값들의 분포 히스토그램을 도시한다.The distribution of differences between exponent values and maximum exponent values for FP16 data related to deep learning generally does not cover the entire 58-bit exponent value range for FP16. Instead, the distribution of differences between the exponent values and the maximum exponent value generally tends to cover a much smaller range of values. For example, FIG. 3A shows a distribution histogram of maximum exponential values for deep learning FP16 data. 3B shows a distribution histogram of maximum negative exponential values for deep learning FP16 data. Finally, Fig. 3c shows a distribution histogram of product exponent values for deep learning FP16 data.

도 3a 내지 도 3c에 도시된 딥 러닝 FP16 데이터의 벨형 분포(bell-type distribution)의 상대적으로 제한된 범위는 딥 러닝 데이터에 대한 FP16 연산을 계산하기 위해 일반적으로 발생하는 경우에 대해 면적 및 전력에서 최적화될 수 있는 내적 아키텍처를 제공하도록 본 명세서에서 개시된 주제에 의해 사용될 수 있다. 특히, 도 3a 내지 도 3c에 도시된 딥 러닝 FP16 데이터의 특성은 정규화된 FP16 데이터에 대한 지수 값의 전체 범위를 수용하는 데 필요할 수 있는 58 비트 시프트보다 훨씬 적은 우측 시프트 기능을 갖는 내적 아키텍처를 제공하는 데 사용될 수 있다.The relatively limited range of the bell-type distribution of deep learning FP16 data shown in Figs. 3a to 3c is optimized in area and power for the case that normally occurs to compute FP16 operations on deep learning data. may be used by the subject matter disclosed herein to provide an inner architecture that can be In particular, the properties of the deep learning FP16 data shown in Figs. 3A-3C provide a dot product architecture with right-shifting capabilities significantly less than the 58-bit shifts that may be needed to accommodate the full range of exponential values for normalized FP16 data. can be used to

도 2로 돌아가서, 지수 로직 섹션(205)은 승산기 유닛(201)에 의해 제공되는 곱셈 연산의 일부로서 W및 W벡터들의 대응하는 요소의 지수들을 가산한다. 지수 로직 섹션(205)은 제1 가산기(209) 및 제2 가산기(210)를 포함할 수 있다. 제1 가산기(209)는 FP16 활성화 지수 값 e_x및 대응하는 FP16 가중 지수 값 e_w의 합에 기초하여 미편향 지수 값(unbiased exponent value)e를 결정한다. 제2 가산기(210)는 미편향 지수 값 e에서 최대 지수 값 e_m을 감산하여 상대(relative) 지수 값 e'를 형성한다. 최대 지수 값 e_m은 최대 트리(207)에 의해 결정될 수 있다. 상대 지수 값 e'는 시프터(206)에 의해 가수 곱 값에 적용되는 우측 시프트(right shift)의 양을 제어하는 데 사용될 수 있다.2 , the exponent logic section 205 adds the exponents of the corresponding elements of the W and W vectors as part of the multiplication operation provided by the multiplier unit 201 . The exponent logic section 205 may include a first adder 209 and a second adder 210 . The first adder 209 determines an unbiased exponent value e based on the sum of the FP16 activation exponent value e _x and the corresponding FP16 weighted exponent value e _{w .} _{The second adder 210 subtracts the maximum exponent value e m} from the unbiased exponent value e to form a relative exponent value e'. The maximum exponent value e _m may be determined by the maximum tree 207 . The relative exponent value e' may be used by shifter 206 to control the amount of right shift applied to the mantissa product value.

함께, 승산기 유닛(201) 및 최대 트리 로직 유닛(207) 각각의 지수 로직 섹션(205)은 아키텍처(200)에 대한 지수 처리 유닛(208)을 형성할 수 있다. 최대 트리 로직 유닛(207)은 최대 지수 값 e_m을 결정한다. 최대 트리 로직 유닛(207)은 각각의 승산기 로직 유닛(201_a-201_n)에 연결되고 어레이의 지수 로직 섹션(205)으로부터 미편향 지수 값 e를 각기 수신한다. 최대 트리 로직 유닛(207)은 수신된 미편향 지수 값 e의 최대 지수 값 e_m을 결정하고, 최대 지수 값 e_m을 각 지수 로직 섹션(205)내의 제2 가산기(210)의 입력으로 출력한다.Together, the exponent logic section 205 of each of the multiplier unit 201 and the maximum tree logic unit 207 may form the exponent processing unit 208 for the architecture 200 . The maximum tree logic unit 207 determines the maximum exponent value e _m . A maximum tree logic unit 207 _{is coupled to each multiplier logic unit 201 a} -201 _n and receives, respectively, an unbiased exponent value e from the exponent logic section 205 of the array. _{The maximum tree logic unit 207 determines the maximum exponent value e m} of the received unbiased exponent value e, and outputs the maximum exponent value e _m as an input of the second adder 210 in each exponent logic section 205 . .

시프터(206)는 내적 계산 아키텍처(200)에 의해 처리될 것으로 예상되는 딥 러닝 FP16 데이터의 지수 값 범위의 분포를 기반으로 R-Shift_max가 선택될 수 있는 최대 우측 비트 시프트 R-Shift_max까지 가수 곱 값을 우측 비트 시프트하도록 구성될 수 있다. (R-Shift_max보다 큰 지수 값이 발견되는 상황에서는 아래에 설명되는 다중 사이클 기법을 사용하여 곱 값을 정렬할 수 있다.) 지수 로직 섹션(205)으로부터 출력된 상대 지수 값 e'는 주어진 내적 계산을 위해 시프터(206)에 의해 제공되는 우측 비트 시프트의 수를 제어하는데 사용된다. 일 실시 예에서, R-Shift_max는 아키텍처(200)에 의해 처리될 딥 러닝 FP16 데이터의 예시적인 범위의 지수 값을 고려하기 위해 8 비트로 선택될 수 있다. 예를 들어, 8 비트로 시프터(206)를 제한함으로써, 최적화된 내적 아키텍처(200)는 면적 및 전력에서 최적화된 내적 계산 연산을 제공할 수 있다. R-Shift_max는 임의의 정수 값으로 선택될 수 있음을 이해되어야 한다.Shifter 206 singer up right bit shift R-Shift _max with a deep learning R-Shift _max distributed based on the index value of the range of FP16 data may be selected which is expected to be processed by the inner product calculation architecture 200 It may be configured to right bit shift the product value. (In situations where exponent values greater than R-Shift _max are found, the multi-cycle technique described below can be used to sort the product values.) The relative exponent value e' output from the exponent logic section 205 is the given dot product. Used to control the number of right bit shifts provided by shifter 206 for calculation. In one embodiment, R-Shift _max may be chosen to be 8 bits to account for exponential values of an exemplary range of deep learning FP16 data to be processed by architecture 200 . For example, by limiting the shifter 206 to 8 bits, the optimized dot product architecture 200 may provide an optimized dot product calculation operation in area and power. It should be understood that R-Shift _max may be chosen to be any integer value.

각 시프터(206)로부터 출력된 정렬된 곱 값들은 가산기 유닛(202)에 입력되어 더해진다. 가산기 유닛(202)의 출력은 누산 로직 유닛(203)에서 누적된다. 최대 지수 값 e_m은 가산기 트리(202)의 합산의 지수가 최대 지수 값 e_m이기 때문에 누산 로직 유닛(203)에도 입력된다. 그 다음, 합산은 누산 로직 유닛(203)에 저장된 값과 함께 더해진다(즉, 누산 로직 유닛(203)에 누적됨). 예를 들어, 누산 로직 유닛(203)이 이전 합산을 저장하고 있다고 가정하자. 합산은 정수 값(가산 트리 출력)과 지수 값(최대 지수)을 갖는 부동 숫자(floating number)이다. 다음 합산은 다른 지수(즉, 또 다른 최대 지수 값 e_m)를 가질 수 있다. 최대 지수 값 e_m은 누산 로직 유닛(203)에 입력되어 가산기 트리로부터의 새로운 합산 값과 누산기에 저장된 값을 정렬/가산(align/add)한다.The sorted product values output from each shifter 206 are input to the adder unit 202 and added. The output of the adder unit 202 is accumulated in an accumulation logic unit 203 . The maximum exponent value e _m is also input to the accumulation logic unit 203 because the exponent of the summation of the adder tree 202 is the maximum exponent value e _{m .} The summation is then added with the value stored in the accumulation logic unit 203 (ie, accumulated in the accumulation logic unit 203 ). For example, assume that the accumulation logic unit 203 is storing the previous summation. Sum is a floating number with integer values (addition tree output) and exponent values (maximum exponent). The following summation may have a different exponent (ie, another maximum exponent value e _m ). The maximum exponent value e _m is input to the accumulation logic unit 203 to align/add the new sum value from the adder tree with the value stored in the accumulator.

도 4a 및 도 4b는 본 개시의 주제에 따라, 가수가 추가될 수 있고 또한 면적 및 전력에 대해 최적화될 수 있도록, 다중 사이클 기법을 사용하여 가수 곱의 지수를 정렬하는 내적 계산 아키텍처(400)의 제2 실시 예를 도시한다. 아키텍처(400)는 승산기 어레이(401₀-401_n-1(n은 정수)), 가산 유닛(즉, 가산기 트리) (402), 보조 시프터(411), 및 누산 로직 유닛(403)을 포함할 수 있다. 최대 트리 로직 유닛(407)은 각각의 승산기 유닛(401)의 지수 로직 섹션(405)에 연결될 수 있다. 승산기 유닛(401)은 승산기 유닛(401)과 관련하여 본 명세서에서 설명된 기능을 제공하도록 구성된 하드웨어를 포함할 수 있는 모듈로서 구성될 수 있다.4A and 4B are diagrams of a dot product computation architecture 400 for aligning exponents of mantissa products using a multi-cycle technique such that mantissa can be added and optimized for area and power, in accordance with the subject matter of this disclosure. A second embodiment is shown. Architecture 400 may include a multiplier array 401 ₀ -401 _n-1 (where n is an integer), an addition unit (ie, an adder tree) 402 , an auxiliary shifter 411 , and an accumulation logic unit 403 . can The maximum tree logic unit 407 may be coupled to the exponent logic section 405 of each multiplier unit 401 . The multiplier unit 401 may be configured as a module that may include hardware configured to provide the functionality described herein with respect to the multiplier unit 401 .

각 승산기 유닛(401)은 정수 로직 섹션(404), 지수 로직 섹션(405) 및 로컬 시프터(406)를 포함할 수 있다. 정수 로직 섹션(404), 지수 로직 섹션(405) 및 시프터(406)는 트랜지스터, 상호 연결 도전체, 바이어싱 구성 요소, 및/또는 개별 로직 구성 요소와 같은 개별 구성 요소로부터 형성될 수 있다. 각 승산기 유닛(401)은 X 입력 벡터의 요소 및 W 입력 벡터의 대응하는 요소에 대한 가수 및 지수를 수신한다. 정수 로직 섹션(404)은 예를 들어 활성화 값 1.m_x및 대응하는 가중 값 1.m_w에 대한 값을 수신하고, 2 개의 가수 값을 곱하여 곱-가수 값을 형성한다. 곱-가수 값은 로컬 시프터(406)로 출력된다. 부호 승산기(413)는 또한 부호 신호를 로컬 시프터(406)에 입력한다. 도 2와 유사하게, 도 4에서의 예시적인 활성화 값 및 대응하는 가중 값은 정규화된 숫자로 주어지며, 비정규 값(0.m_x 및/또는 0.m_w)도 승산기 유닛(401)에 의해 지원된다. 곱-가수 값은 로컬 시프터 (409)로 출력된다.Each multiplier unit 401 may include an integer logic section 404 , an exponent logic section 405 and a local shifter 406 . Integer logic section 404, exponential logic section 405, and shifter 406 may be formed from discrete components such as transistors, interconnecting conductors, biasing components, and/or discrete logic components. Each multiplier unit 401 receives the mantissa and exponent for the elements of the X input vector and the corresponding elements of the W input vector. The integer logic section 404 receives, for example, a value for an activation value 1.m _x and a corresponding weight value 1.m _w , and multiplies the two mantissa values to form a product-mantissa value. The product-mantissa value is output to a local shifter 406 . Sign multiplier 413 also inputs a sign signal to local shifter 406 . Similar to FIG. 2 , the exemplary activation values and corresponding weight values in FIG. 4 are given as normalized numbers, and non-normal values (0.m _x and/or 0.m _w ) are also given by the multiplier unit 401 by the multiplier unit 401 . Supported. The product-mantissa value is output to a local shifter 409 .

지수 로직 섹션(405)은 제1 가산기(409) 및 제2 가산기(410)를 포함할 수 있다. 제1 가산기(409)는 FP16 활성화 지수 값 e_x 및 대응하는 FP16 가중 지수 값 e_w의 합계에 기초하여 미편향 지수 값 e를 결정한다. 제2 가산기(410)는 미편향 지수 값 e에서 최대 지수 값 e_m을 감산하여 상대 지수 값 e'를 형성한다. 최대 지수 값 e_m은 최대 트리 로직 유닛(407)에 의해 결정될 수 있다. 상대 지수 값 e'는 로컬 시프터(406)에 의해 가수 곱에 적용되는 우측 시프트의 양을 제어하는 로컬 시프트 양

를 생성하는 데 사용된다.The exponent logic section 405 may include a first adder 409 and a second adder 410 . The first adder 409 determines an unbiased index value e based on the sum of the FP16 activation index value e _x and the corresponding FP16 weighted index value e _{w .} _{The second adder 410 subtracts the maximum exponent value e m} from the unbiased exponent value e to form the relative exponent value e'. The maximum exponent value e _m may be determined by the maximum tree logic unit 407 . The relative exponent value e' is the local shift amount controlling the amount of right shift applied by the local shifter 406 to the mantissa product.

is used to create

로컬 시프터(406)는 내적 계산(dot-product computation)아키텍처(400)에 의해 처리될 것으로 예상되는 딥 러닝 FP16 데이터의 지수 값 범위의 분포에 기반하여 R-Shift_max가 선택될 수 있는 최대 우측 비트 시프트 R-Shift_max까지 가수 곱 값을 우측 비트 시프트하도록 구성된다. 아키텍처(400)는 곱 값을 정렬하기 위해 다중 사이클 기법(multi-cycle technique)을 사용함으로써 R-Shift_max보다 큰 지수 값이 만나게 되는 상황을 처리할 수 있다. 일 실시 예에서, R-Shift_max는 아키텍처(400)에 의해 처리될 딥 러닝 FP16 데이터의 지수 값 범위의 예시적인 분포를 고려하기 위해 8 비트로 선택될 수 있다. R-Shift_max는 임의의 정수 값으로 선택될 수 있음을 이해하여야 한다. The local shifter 406 is the largest right bit from which _{the R-Shift max} can be selected based on the distribution of the exponential value range of the deep learning FP16 data expected to be processed by the dot-product computation architecture 400 . It is configured to right bit shift the mantissa product value up to shift R-Shift _max. Architecture 400 can handle situations where exponential values greater than _{R-Shift max} are encountered by using a multi-cycle technique to align product values. In one embodiment, R-Shift _max may be chosen to be 8 bits to account for an exemplary distribution of a range of exponential values of deep learning FP16 data to be processed by architecture 400 . It should be understood that R-Shift _max may be chosen to be any integer value.

함께, 승산기 유닛(401)및 최대 트리 로직 유닛(407) 각각의 지수 로직 섹션(405)은 아키텍처(400)에 대한 지수 처리 유닛(408)을 형성할 수 있다. 최대 트리 로직 유닛(407)은 최대 지수 값 e_m을 결정한다. 최대 트리 로직 유닛(407)은 각각의 승산기 로직 유닛(401_a-401_n)에 연결되고 어레이의 지수 로직 섹션(405)으로부터 미편향 지수 값 e각각을 수신한다. 최대 트리 로직 유닛(407)은 수신된 미편향 지수 값 e의 최대 지수 값 e_m을 결정하고, 각 지수 로직 섹션(405)의 제2 가산기(410)의 입력으로 최대 지수 값 e_m을 출력한다.Together, the exponent logic section 405 of each of the multiplier unit 401 and the maximum tree logic unit 407 may form the exponent processing unit 408 for the architecture 400 . The maximum tree logic unit 407 determines the maximum exponent value e _m . A maximum tree logic unit 407 _{is coupled to each multiplier logic unit 401 a} _{- 401 n} and receives each of the unbiased exponent values e from the exponent logic section 405 of the array. _{The maximum tree logic unit 407 determines the maximum exponent value e m} of the received unbiased exponent value e, and outputs the _{maximum exponent value e m} as an input of the second adder 410 of each exponent logic section 405 . .

각 로컬 시프터(406)에서 출력된 정렬 가수 곱 값은 가산 유닛(402)에 입력되어 가산된다. 가산 유닛(402)의 출력은 후술하는 바와 같이 다중 사이클 기법에서 보조 시프터(411)에 의해 제공될 수 있는 임의의 추가 시프팅 후에 누산 로직 유닛(403)에서 누적된다. 보조 시프터(411)는 아키텍처(400)에서 접할 수 있는 지수 차이 값의 증가된 범위를 제공하는 한편 FP16 값에 대한 지수의 전체 범위를 포괄할 수 있는 58 비트 시프터에 비해 상대적으로 작은 지수 값 시프팅으로 전용된 물리적 영역을 유지한다. 예를 들어, 로컬 시프터(406) 및 보조 시프터(411)가 8 비트 시프터인 경우, 아키텍처(400)에 대한 시프터들에 전용된 총 물리적 영역은 8 비트 로컬 시프터(406)의 면적에 하나의 8 비트 비트 보조 시프터(411)의 면적을 더한 면적의 n 배가될 것이다. 즉, 이는 (n + 1) * (8 비트 시프터의 면적)과 같다. 한편, 아키텍처(400)에 대한 시프터들에 전용된 영역은 n * (58 비트 시프터의 면적)이다. 보조 시프터(411)는 8 비트 시프터로 제한되지 않고 임의의 비트 시프팅 크기의 시프터일 수 있음을 이해하여야 한다. 예를 들어, 일 실시 예에서 보조 시프터(411)는 32 비트 시프터일 수 있다.The ordered mantissa product value output from each local shifter 406 is input to the addition unit 402 and added. The output of the addition unit 402 is accumulated in the accumulation logic unit 403 after any further shifting that may be provided by the auxiliary shifter 411 in a multi-cycle technique as described below. Auxiliary shifter 411 provides an increased range of exponent difference values encountered in architecture 400 while shifting relatively small exponent values compared to a 58-bit shifter which may cover the full range of exponents for FP16 values. to maintain a dedicated physical area. For example, if local shifter 406 and auxiliary shifter 411 are 8-bit shifters, then the total physical area dedicated to the shifters for architecture 400 is one 8 in the area of 8-bit local shifter 406. It will be n times the area plus the area of the bit bit auxiliary shifter 411 . That is, it is equal to (n + 1) * (area of an 8-bit shifter). On the other hand, the area dedicated to shifters for architecture 400 is n * (area of a 58-bit shifter). It should be understood that the auxiliary shifter 411 is not limited to an 8-bit shifter and may be a shifter of any bit-shifting size. For example, in one embodiment, the auxiliary shifter 411 may be a 32-bit shifter.

보조 시프터(411)의 출력은 누산 로직 유닛(403)에 누적된다. 최대 지수 값 e_m은 가산 유닛(402)의 합산의 지수와 보조 시프터(411)로부터 출력되는 지수가 최대 지수 값 e_m이기 때문에 누산 로직 유닛(403)에도 입력된다. 그 다음, 합산은 누산 로직 유닛(403)에 저장된 값과 함께 더해진다(즉, 누산 로직 유닛(403)에 누적됨). 예를 들어, 누적 로직 유닛(403)이 이전 합계를 저장하고 있다고 가정하자. 합계는 정수 값(가산 트리 출력)과 지수 값 (최대 지수)을 갖는 부동 숫자이다. 다음 합계는 다른 지수(즉, 또 다른 최대 지수 값 e_m)를 가질 수 있다. 최대 지수 값 e_m은 누산 로직 유닛(403)에 입력되어 가산기 트리로부터의 새로운 합산 값과 누산기에 저장된 값과 정렬/가산된다.The output of the auxiliary shifter 411 is accumulated in the accumulation logic unit 403 . The maximum exponent value e _m is also input to the accumulation logic unit 403 because the exponent of the summation of the addition unit 402 and the exponent output from the auxiliary shifter 411 are the maximum exponent value e _{m .} The summation is then added with the value stored in the accumulation logic unit 403 (ie, accumulated in the accumulation logic unit 403 ). For example, assume that the accumulation logic unit 403 is storing a previous sum. Sum is a floating number with integer values (addition tree output) and exponent values (maximum exponent). The following sum may have a different exponent (ie, another maximum exponent value e _m ). The maximum exponent value e _m is input to the accumulation logic unit 403 and aligned/added with the new sum value from the adder tree and the value stored in the accumulator.

도 4b는도 4a에 도시되지 않은 내적 계산 아키텍처 (400)의 제2 실시 예의 추가 상세를 도시한다. 보다 구체적으로, 도 4b는 본 명세서에 개시된 주제에 따른 지수 처리 유닛(EHU) (408) 및 마스크 유닛(414)의 실시 예의 상세를 나타낸다. 도 4b에 도시 된 바와 같이, EHU(408)는 최대 지수 값 e_m및 각각의 상대적 지수 값 e'출력에 연결된 마스크 생성기/사이클 카운터(412)를 포함한다. 마스크 생성기/사이클 카운터(412)는 최대 지수 값 e_m및 각각의 상대적 지수 값 e'출력을 사용하여 각 곱에 대한 마스크 및 로컬 시프트 양

를 결정한다. 마스크 생성기/사이클 카운터(412)는 또한 사이클 #신호를 생성한다. 마스크 유닛(414)은 n 개의 AND 게이트(415₀-415_n)를 포함할 수 있다. 마스크 유닛(414)은 마스크 생성기/사이클 카운터(412)로부터 출력된 mask_i신호를 수신한다.4B shows further details of a second embodiment of the dot product computation architecture 400 not shown in FIG. 4A . More specifically, FIG. 4B shows details of an embodiment of an exponential processing unit (EHU) 408 and a mask unit 414 in accordance with the subject matter disclosed herein. As shown in Figure 4b, EHU 408 includes _{a mask generator/cycle counter 412 coupled to the maximum exponent value e m} and each relative exponent value e' output. Mask generator/cycle counter 412 uses the maximum exponent value e _m and each relative exponent value e' output to generate the mask and local shift amounts for each product.

to decide Mask generator/cycle counter 412 also generates a cycle # signal. The mask unit 414 may include n AND gates 415 _{0 -} 415 _n . _{The mask unit 414 receives the mask i} signal output from the mask generator/cycle counter 412 .

동작시, 사이클 #k동안, k * R-Shift_max와 (k + 1) * R-Shift_max사이의 범위에서 상대 지수 값 e'를 갖는 곱이 정렬될 수 있고, 마스크 생성기/사이클 카운터(412)는 해당 곱에 대해 1의 값을 갖는 mask_i신호를 출력한다. mask_i신호는 AND 게이트 (415_i)의 일 입력에 입력된다. 사이클 #k 동안 k * R-Shift_max와 (k + 1) * R-Shift_max사이의 범위에 있지 않은 상대 지수 값 e'를 갖는 곱은 순환하는 내적 계산에서 마스킹되고, 마스크 생성기/사이클 카운터(412)는 그들의 곱에 대해 0의 값을 갖는 mask_i신호를 출력한다. 마스킹되지 않은 곱에 대한 시프트 양

는 마스크 생성기/사이클 카운터(412)에 의해

= e' - k * R-Shift_max로 결정된다. 사이클 #k의 값은 보조 시프터(411)에 의해 나머지 k * R-Shift_max를 시프트하는 데 사용된다.In operation, during cycle #k, products with relative exponent values e' in the range between _{k * R-Shift max} and (k + 1) * R-Shift _{max may be sorted, mask generator/cycle counter 412} outputs _{a mask i} signal having a value of 1 for the product. The mask _i signal is input to one input of the AND gate 415 _{i .} During cycle #k, products with relative exponent values e' that are not in the range between k * R-Shift _max and (k + 1) * R-Shift _max are masked in the cyclic dot product calculation, and mask generator/cycle counter 412 ) outputs the _{mask i} signal with a value of 0 for their products. Shift amount for unmasked product

is by mask generator/cycle counter 412

= e' - k * R-Shift _max . The value of cycle #k is used by the auxiliary shifter 411 to shift the _{remaining k * R-Shift max .}

일 실시 예에서, 큰 부동 소수점 값을 갖는 비교적 작은 부동 소수점 값의 가산이 내적 계산에 현저하게 악영향을 미치지 않기 때문에 매우 작은 부동 소수점 값을 마스킹하기 위해 마스크 신호가 생성될 수 있다.In one embodiment, a mask signal may be generated to mask very small floating point values because the addition of relatively small floating point values with large floating point values does not significantly adversely affect the dot product calculation.

도 5a 및 5b는 본 명세서에 개시된 주제에 따라 한 쌍의 4 개 요소 벡터를 처리하는 내적 계산 아키텍처(500)에 대한 2-사이클 정렬 과정의 예를 도시한다. 아키텍처 (500)가 한 쌍의 4 개 요소 벡터를 처리하는 것으로 도시되어 있지만, 도 5a 및 5b와 관련하여 설명된 동작 세부 사항은 일반적으로 아키텍처 (500)에 의해 처리되는 벡터 요소의 수에 관계없이 동일하다는 것을 이해하여야 한다.5A and 5B show an example of a two-cycle alignment process for a dot product computation architecture 500 that processes a pair of four element vectors in accordance with the subject matter disclosed herein. Although architecture 500 is shown processing a pair of four element vectors, the operational details described with respect to FIGS. 5A and 5B are generally irrespective of the number of vector elements processed by architecture 500 . It should be understood that the same

아키텍처(500)는 4 개의 승산기 유닛(미도시), 가산기 트리(502), 누산 로직 유닛(503), 지수 처리 유닛(EHU) (508), 및 보조 시프터(511)를 포함한다. 마스크 유닛은 도 5a 및 5b에서 명시적으로 도시되지 않는다. 각 승산기 유닛은 로컬 시프터(506) 및 AND 게이트(515)를 포함할 수 있다. 이 예에서, 로컬 시프터(506) 및 보조 시프터(511)는 5 비트와 동일한 R-Shift_max를 갖도록 구성될 수 있다. EHU (508)는 도 4b에서 EHU(408)에 대해 도시된 것과 유사하게 가산기, 최대 트리, 및 마스크 생성기/사이클 카운터를 포함할 수 있다.The architecture 500 includes four multiplier units (not shown), an adder tree 502 , an accumulation logic unit 503 , an exponent processing unit (EHU) 508 , and an auxiliary shifter 511 . The mask unit is not explicitly shown in FIGS. 5A and 5B . Each multiplier unit may include a local shifter 506 and an AND gate 515 . In this example, local shifter 506 and auxiliary shifter 511 may be configured to have an _{R-Shift max equal to 5 bits.} EHU 508 may include an adder, a maximum tree, and a mask generator/cycle counter similar to that shown for EHU 408 in FIG. 4B .

도 5a 및 5b를 참조하면, 4 개의 승산기 유닛이 각각 지수 10, 2, 3 및 8을 갖는 곱 A, B, C 및 D를 생성하는 상황을 고려하자. 곱 A 내지 D는 로컬 시프터들 (506₀-506₃)에 각기 입력된다. 도 5a에 도시된 제1 사이클(즉, 사이클 # 0) 동안 EHU(508)의 최대 트리는 곱 A의 지수 값 10이 최대 지수 e_m이라고 결정한다. 4 개의 곱에 대한 상대 지수 e'는 각각 0, 8, 7 및 2이다. 즉, 곱 A의 경우 상대 지수 e'_A는 e_A　- e_m = 10 - 10 = 0이다. 곱 B의 경우 상대 지수 e'_B는 e_B - e_m = 2 - 10 = -8이다. 곱 C의 경우 상대 지수 e

_C는 e_C - e_m = 3 - 10 = -7이고, 곱 D의 경우 상대 지수 e'_D는 e_D - e_m = 8 - 10 = -2이다. 지수가 e_m과 정렬되고 가수의 시프트 방향이 항상 우측이므로 상대 지수에 대해 계산된 부호는 음수이다. 상대 지수 e'의 절대 값은 각각의 로컬 시프터(506)에 입력된다.5A and 5B, consider a situation where four multiplier units produce products A, B, C and D with

exponents

10, 2, 3 and 8, respectively. The products A to D are input to the _{local shifters 506 0} -506 _{3 respectively.} During the first cycle (ie, cycle # 0) shown in FIG. 5A , the maximum tree of EHU 508 determines that _{the exponent value 10 of product A is the maximum exponent e m .} The relative exponents e' for the four products are 0, 8, 7 and 2, respectively. That is, for product A, the relative exponent e' _A is e _A - e _m = 10 - 10 = 0. For product B, the relative exponent e' _B is e _B - e _m = 2 - 10 = -8. Relative exponent e for product C

_C is e _C - e _m = 3 - 10 = -7, and for product D the relative exponent e' _D is e _D - e _m = 8 - 10 = -2. Because the index _m and e aligned and the shift direction of the mantissa is always right is the negative sign is calculated for the relative index. The absolute value of the relative exponent e' is input to each local shifter 506 .

여전히 제1 사이클 동안, 마스크 신호 mask₀-mask₃은 주어진 곱에 대한 상대 지수 값 e'에 기초하여 마스크 생성기(508)에 의해 생성된다. 마스크 신호 mask₀-mask₃은 각각의 AND 게이트 515₀-515₃의 입력에 인가된다. 상대 지수 e'가 로컬 시프터(506)의 R-Shift_max보다 큰 절대 값을 가지면 마스크 신호 값 0이 생성된다. 상대 지수 e'가 로컬 시프터(506)의 R-Shift_max보다 작거나 같은 절대 값을 갖는 경우, 마스크 신호 값 1이 생성된다. 이 예에서, 로컬 시프터(506)는 5 비트의 최대 시프트 능력을 갖고, 곱 A 및 D의 지수 값은 5 비트 시프트 (각각 0 및 2 비트 시프트)내에 있다. 마스크 신호 값 1은 이들 두 곱에 대해 마스크 생성기(508)에 의해 생성될 것이다. 곱 B 및 C의 지수 값은 모두 5 비트 시프트 (각각 8 및 7 비트 시프트)를 초과하므로 마스크 신호 값 0이 곱 B 및 C에 대한 마스크 생성기/사이클 카운터 (508)에 의해 생성될 것이다.Still during the first cycle, the mask signal mask ₀ -mask ₃ is generated by the mask generator 508 based on the relative exponent value e' for the given product. A mask signal mask ₀ -mask ₃ is applied to the input of each AND gate 515 ₀ -515 _{3 .} A mask signal value of 0 is generated if the relative exponent e' has an _{absolute value greater than the R-Shift max of the local shifter 506 .} When the relative exponent e' has an _{absolute value less than or equal to the R-Shift max} of the local shifter 506, a mask signal value of 1 is generated. In this example, local shifter 506 has a maximum shift capability of 5 bits, and the exponential values of products A and D are within 5 bit shifts (0 and 2 bit shifts, respectively). A mask signal value of 1 will be generated by the mask generator 508 for these two products. The exponent values of products B and C both exceed 5 bit shifts (8 and 7 bit shifts, respectively) so a mask signal value of 0 will be generated by mask generator/cycle counter 508 for products B and C.

A 및 D 곱에 대한 마스크 신호 값은 로컬 시프터(5506₀) (출력 A >> 0) 및 5506₃(출력 D >> 2)의 출력이 가산기 트리(502)에 출력되도록 한다. 정렬을 위해 보조 시프터(511)에 의한 추가 시프팅은 필요 없다. The mask signal values for the A and D products cause the _{outputs of the local shifters 5506 0} (output A >> 0) and 5506 ₃ (output D >> 2) to be output to the adder tree 502 . No additional shifting by auxiliary shifter 511 for alignment is required.

도 5b에 도시된 제2 사이클(즉, 사이클 # 0) 동안, EHU(508)는 보조 시프터 (511)와 결합하여 나머지 두 곱 B 및 C가 로컬 시프터(506)를 사용하여 처리될 수 있다고 결정한다. 곱 B 및 C는 각기 시프터(506₁,506₂)에 의해 우측으로 3 비트 및 우측으로 2 비트 시프트 된다. 보조 시프터(511)는 가수가 정렬되도록 두 시프터 (506₁,506₂)의 출력을 우측으로 5 비트 더 시프트 할 것이다. 마스크 생성기/사이클 카운터(508)는 보조 시프터(511)를 제어하기 위해 사용될 수 있는 1과 동일한 사이클 # 신호를 출력한다.During the second cycle (ie, cycle #0) shown in FIG. 5B , the EHU 508 determines that the remaining two products B and C can be processed using the local shifter 506 in combination with the auxiliary shifter 511 . do. The products B and C are shifted 3 bits to the right and 2 bits to the right by _{shifters 506 1} ,506 _{2 , respectively.} _{Auxiliary shifter 511 will shift the output of both shifters 506 1} ,506 ₂ to the right 5 more bits so that the mantissa is aligned. The mask generator/cycle counter 508 outputs a cycle # signal equal to one that can be used to control the auxiliary shifter 511 .

따라서, 제2 사이클 동안, 마스크 신호 mask₁ 및 mask₂는 곱 B 및 C에 대한 마스크 생성기/사이클 카운터(508)에 의해 1의 값으로 생성되는데, 이는 이들 두 곱 모두에 대해 상대 지수 e'에서 로컬 시프터(506)의 R-Shift_max를 뺀 절대 값은 R-Shift_max보다 작거나 같은 값과 같기 때문이다. 즉, 곱 B의 경우 절대 값은 8 - 5 = 3과 같으며 이는 5 비트 미만이다. 곱 C의 경우 절대 값은 7 - 5 = 2와 같으며 이는 5 비트 미만이다. 마스크 신호 mask₁ 및 mask₂는 AND 게이트(515₁,515₂)의 입력에 인가되어 B 및 C 곱이 제2 사이클 동안 가산기 트리(502)로 출력된다. 마스크 신호 mask₀ 및 mask₃은 곱 A 및 D에 대해 0으로 마스크 생성기(508)에 의해 생성되는데, 이는 이들 곱이 이미 가산기 트리(502)에 출력되었기 때문이다. 마스크 신호 mask₀ 및 mask₃은 AND 게이트 (515₀ 및 515₃)의 입력에 인가된다.Thus, during the second cycle, the mask signals mask ₁ and mask ₂ are generated with a value of 1 by the mask generator/cycle counter 508 for products B and C, which for both products at the relative exponent e' This is because the absolute value obtained by subtracting _{R-Shift max} of the local shifter 506 is equal to a value less than or _{equal to R-Shift max.} That is, for product B, the absolute value is equal to 8 - 5 = 3, which is less than 5 bits. For product C, the absolute value is equal to 7 - 5 = 2, which is less than 5 bits. The mask signals mask ₁ and mask ₂ are applied to the inputs of AND gates 515 ₁ , 515 ₂ so that the B and C products are output to the adder tree 502 for a second cycle. The mask signals mask ₀ and mask ₃ are generated by mask generator 508 with zeros for products A and D since these products have already been output to adder tree 502 . Mask signals mask ₀ and mask ₃ are applied to the inputs of AND gates 515 ₀ and 515 _{3 .}

도 5a 및 5b에 설명된 동작 예는 두 사이클 내에 내적 계산을 완료한다. 본 명세서에 개시된 시퀀스는 처리될 것으로 예상되는 곱의 지수 값 범위, 로컬 시프터(506)에 의해 제공되는 R-Shift_max, 및 보조 시프터(511)에 의해 제공되는 R-Shift_max에 기초한다. 스펙트럼의 일단에서, 로컬 시프터(506) 및 보조 시프터(511)에 의해 제공되는 시프트 수는, 내적 계산이 한 주기내에서 일어날 수 있도록, 주어진 딥 러닝 데이터 세트에 대해 처리될 것으로 예상되는 지수 값의 전체 범위를 고려하여 선택될 수 있다. 스펙트럼의 타단에서, 로컬 시프터(506) 및 보조 시프터(511)에 의해 제공되는 R-Shift_max는 내적 계산이 주어진 사이클 수 이하로 일어날 수 있도록, 주어진 딥 러닝 데이터 세트에 대해 처리될 것으로 예상되는 지수 값 범위의 분포를 고려하여 선택될 수 있다. 다른 사이클 구성이 선택될 수 있다. 또한 로컬 시프터(506)에 대한 R-Shift_max는 보조 시프터(511)의 R-Shift_max와 다를 수 있음을 이해하여야 한다.The operational example illustrated in Figures 5a and 5b completes the dot product calculation within two cycles. Sequences disclosed herein is based on the R-Shift _max provided by the R-Shift _max, and the auxiliary shifter 511 provided by the index value range, the local shifter 506 of the product is expected to be treated. At one end of the spectrum, the number of shifts provided by local shifter 506 and auxiliary shifter 511 is the number of exponential values expected to be processed for a given deep learning data set such that the dot product calculation can occur within one period. It may be selected in consideration of the entire range. At the other end of the spectrum, the R-Shift _max provided by the local shifter 506 and the auxiliary shifter 511 is the index that is expected to be processed for a given deep learning data set, such that the dot product calculation can occur with a given number of cycles or less. It may be selected in consideration of the distribution of the range of values. Other cycle configurations may be selected. In addition, R-Shift _max for the local shifter 506. It is to be understood that this may vary from the R-Shift _max of the auxiliary shifter 511.

추가적으로, 본 명세서에 개시된 내적 계산 아키텍처의 상이한 그룹 또는 클러스터는 처리될 것으로 예상되는 상이한 범위의 지수 값에 기초하여 형성될 수 있으므로, 다중 사이클 처리 동안 발생할 수 있는 그룹 또는 클러스터 사이의 임의의 지연(stalling)이 CNN 가속기의 전체 설계에서 최소화되거나 최적으로 활용될 수 있다. Additionally, different groups or clusters of the dot product computation architecture disclosed herein may be formed based on the different ranges of exponential values expected to be processed, so that any stalling between groups or clusters that may occur during multi-cycle processing may occur. ) can be minimized or optimally utilized in the overall design of the CNN accelerator.

도 6은 본 명세서에 개시된 주제에 따라 딥 러닝 네트워크에 대한 내적을 계산하기 위한 방법(600)의 흐름도이다. 도 4 내지 도 6을 참조하면, 방법은 도 6의 601에서 시작한다. 602에서, 사이클 #값은 0으로 설정된다. 603에서, 승산기 유닛(401)의 어레이의 정수 로직 섹션(404)은 제1 벡터 X의 요소의 가수 값을 제2 벡터 W의 대응하는 요소의 가수 값과 요소 별로 곱하여 가수 곱 값을 형성한다. 일 실시 예에서, 제1 벡터 X는 16 비트 부동 소수점 값의 n 개의 요소를 포함할 수 있고, 제2 벡터 W는 n이 1보다 큰 정수인 16 비트 부동 소수점 값의 n 개의 요소를 포함할 수 있다. 제1 벡터 및 제2 벡터는 딥 러닝 데이터 세트의 벡터 일 수 있다.6 is a flow diagram of a method 600 for computing a dot product for a deep learning network in accordance with the subject matter disclosed herein. 4 to 6 , the method starts at 601 of FIG. 6 . At 602, the cycle # value is set to zero. At 603, the integer logic section 404 of the array of the multiplier unit 401 multiplies the mantissa value of the element of the first vector X by the mantissa value of the corresponding element of the second vector W element-wise to form a mantissa product value. In an embodiment, the first vector X may contain n elements of a 16-bit floating-point value, and the second vector W may contain n elements of a 16-bit floating-point value, where n is an integer greater than 1. . The first vector and the second vector may be vectors of a deep learning data set.

604에서, 승산기 유닛(401)의 어레이의 지수 로직 섹션(405)은 제1 벡터 X 의 요소의 지수 값과 제2 벡터 W의 대응하는 요소의 지수 값을 요소 별로 더하여 각각 가수 곱 값에 대응하는 지수 합계 값 e_i를 형성한다.At 604, the exponent logic section 405 of the array of the multiplier unit 401 adds, element-wise, the exponent value of the element of the first vector X and the exponent value of the corresponding element of the second vector W, each corresponding to a mantissa product value. to form the exponential sum value e _{i .}

605에서, 최대 트리 로직 유닛(407)은 최대 지수 합계 값 e_m을 결정한다.At 605 , the maximum tree logic unit 407 determines a maximum exponential sum value e _m .

606에서, 지수 로직 섹션(405)은 각각의 지수 합계 값으로부터 최대 지수 합계 값 e_m을 감산하여 가수 곱 값에 각각 대응하는 상대 지수 값 e'₀을 형성한다.At 606 , the exponent logic section 405 _{subtracts the maximum exponent sum value e m} from the respective exponent sum value to form _{a relative exponent value e′ 0} , each corresponding to the mantissa product value.

607에서, 승산기 유닛(401)의 어레이의 로컬 시프터(406)는 제1 가수 곱 값을 대응하는 시프트 양 m만큼 시프트하여 최대 지수 합계에 대응하는 가수 곱과 정렬된 가수 곱 값을 형성한다. 각각의 로컬 시프터(406)는 제1 벡터 및 제2 벡터의 지수 합계 값의 전체 비트 범위보다 작은 최대 비트 시프트 수 R-Shift_max를 갖도록 구성될 수 있다. R-Shift_max보다 작거나 같은 대응하는 시프트 양 m_i를 갖는 가수 곱 값은 마스킹 해제된 채로 남겨지고 R-Shift_max보다 큰 대응하는 시프트 양 m_i를 갖는 가수 곱 값은 마스킹 된다. 추가적으로, 로컬 시프터(406)의 최대 비트 시프트 수보다 작거나 같은 상대 지수 값에 대응하는 제1 가수 곱 값은 마스킹 해제된다. At 607 , the local shifter 406 of the array of multiplier units 401 shifts the first mantissa product value by a corresponding shift amount m to form a mantissa product value aligned with the mantissa product corresponding to the largest exponential sum. Each local shifter 406 may be configured to have a _{maximum number of bit shifts R-Shift max} that is less than the entire bit range of the exponential sum values of the first vector and the second vector. Mantissa multiplication value with the R-Shift _max shift amount m _i corresponding to less than or equal to the mantissa is multiplied value is left while masking the off having a shift amount of m _i larger than the corresponding R-Shift _max is masked. Additionally, a first mantissa product value corresponding to a relative exponent value less than or equal to the maximum number of bit shifts of the local shifter 406 is unmasked.

608에서, 가산 유닛(402)은 최대 지수 합계 값에 대응하는 가수 곱 값과 정렬된 가수 곱 값을 더하여 제1 벡터 X 및 제2 벡터 W의 부분 내적을 형성한다. 마스크된 가수 곱 값은 이 시점에서 부분 내적에 더해지지 않는다. 로컬 시프터(406)의 최대 비트 시프트 수 R-Shift_max보다 작거나 같은 대응하는 상대 지수 값 e'를 갖는 모든 가수 곱 값이 제1 사이클에서 정렬되고 더해진다. 로컬 시프터(406)의 최대 비트 시프트 수 R-Shift_max보다 큰 대응하는 상대 지수 값 e'를 갖는 가수 곱 값은 후속 사이클에서 다음과 같이 정렬되고 더해지도록 더 시프트된다.At 608 , the adding unit 402 adds the mantissa product value corresponding to the maximum exponent sum value and the aligned mantissa product value to form a partial dot product of the first vector X and the second vector W . The masked mantissa product is not added to the partial dot product at this point. All mantissa product values having a corresponding relative exponent value e' less than or equal to the maximum bit shift number R-Shift _{max of the local shifter 406 are aligned and added in the first cycle.} Mantissa product values having a corresponding relative exponent value e' greater than the maximum number of bit shifts R-Shift _max of the local shifter 406 are further shifted in subsequent cycles to be aligned and added as follows.

609에서, 모든 가수 곱 값이 누산 로직 유닛(403)에 의해 가산되고 누적되었는지 여부가 결정된다. 모든 가수 곱 값이 가산되고 누적되었다면, 방법은 610에서 종료된다. 그렇지 않은 경우 흐름은 사이클 #이 1씩 증가하는 611로 진행된다. 612에서, 이전 사이클에서 마스킹된 각 가수 곱 값은 상대 시프트 양 e' - (Cycle # * R-Shift_max)와 같은 시프트 양 m_i만큼 우측 비트 시프트된다. 이전에 더해진 가수 곱 값은 마스킹 된다. 또한 (현재 Cycle # + 1) * R Shift 보다 큰 상대 이동량 e'를 갖는 가수 곱 값도 마스킹된다.At 609 , it is determined whether all mantissa product values have been added and accumulated by the accumulation logic unit 403 . If all mantissa product values have been added and accumulated, the method ends at 610 . Otherwise, flow proceeds to 611, where cycle # is incremented by 1. At 612 , each mantissa product value masked in the previous cycle is bit shifted right by a shift amount m _i equal to the relative shift amount e' - (Cycle # * R-Shift _{max ).} The previously added mantissa product value is masked. Also, the mantissa product value with the relative shift e' greater than (Current Cycle # + 1) * R Shift is masked.

613에서, 가산 유닛(402)은 최대 지수 합계 값에 대응하는 가수 곱 값과 현재 정렬된 가수 곱 값을 더하여 제1 벡터 X 및 제2 벡터 W의 부분 내적을 형성한다. 마스킹된 가수 곱 값은 부분 내적에 가산되지 않는다. 흐름은 모든 가수 곱 값이 누산 로직 유닛(403)에 의해 가산되고 누적되었는지 여부가 결정되는 609로 진행된다. 가산되고 누적되었다면 방법은 610에서 끝난다. 그렇지 않은 경우 흐름은 611로 계속된다.At 613 , the addition unit 402 adds the mantissa product value corresponding to the maximum exponential sum value and the currently ordered mantissa product value to form a partial dot product of the first vector X and the second vector W. The masked mantissa product value is not added to the partial dot product. The flow proceeds to 609 where it is determined whether all mantissa product values have been added and accumulated by the accumulation logic unit 403 . If added and accumulated, the method ends at 610 . Otherwise, the flow continues to 611.

도 7은 본 명세서에 개시된 하나 이상의 내적 계산 아키텍처를 구비하는 CNN 가속기를 포함하는 전자 장치(700)를 도시한다. 전자 장치(700)는 제한되는 것은 아니나, 컴퓨팅 장치, PDA (Personal Digital Assistant), 랩톱 컴퓨터, 모바일 컴퓨터, 웹 태블릿, 무선 전화, 휴대폰, 스마트 폰, 디지털 음악 플레이어, 또는 유선 또는 무선 전자 장치에도 사용될 수 있다. 전자 장치(700)는 제한되는 것은 아니지만, 컨트롤러(710), 키패드, 키보드, 디스플레이, 터치 스크린 디스플레이, 카메라, 및/또는 이미지 센서와 같은 입출력 장치(720), 메모리(730), 인터페이스(740), GPU(750), 및 이미징 처리 유닛 (760)을 포함할 수 있다. 상기한 구성 요소들은 버스(770)를 통해 서로 연결된다. 컨트롤러(710)는 예를 들어, 적어도 하나의 마이크로 프로세서, 적어도 하나의 디지털 신호 프로세서, 적어도 하나의 마이크로 컨트롤러 등을 포함할 수 있다. 메모리(730)는 컨트롤러(710)가 사용할 명령 코드 또는 사용자 데이터를 저장하도록 구성될 수 있다. GPU(750) 및 이미지 처리 유닛(760) 중 하나 또는 둘 모두는 여기에 개시된 하나 이상의 내적 계산 아키텍처를 포함할 수 있다.7 illustrates an electronic device 700 including a CNN accelerator having one or more dot product computation architectures disclosed herein. The electronic device 700 may be used in, but not limited to, a computing device, a personal digital assistant (PDA), a laptop computer, a mobile computer, a web tablet, a cordless phone, a cell phone, a smart phone, a digital music player, or a wired or wireless electronic device. can Electronic device 700 includes, but is not limited to, input/output device 720 , memory 730 , interface 740 such as, but not limited to, controller 710 , keypad, keyboard, display, touch screen display, camera, and/or image sensor. , a GPU 750 , and an imaging processing unit 760 . The above components are connected to each other through a bus 770 . The controller 710 may include, for example, at least one microprocessor, at least one digital signal processor, or at least one microcontroller. The memory 730 may be configured to store command codes or user data for use by the controller 710 . One or both of GPU 750 and image processing unit 760 may include one or more dot product computation architectures disclosed herein.

전자 장치(700) 및 전자 장치(700)의 다양한 시스템 구성 요소는 이미지 처리 유닛(760)을 포함할 수 있다. 인터페이스(740)는 RF 신호를 사용하여 무선 통신 네트워크와 데이터를 송수신하도록 구성된 무선 인터페이스를 포함하도록 구성될 수 있다. 무선 인터페이스(740)는 예를 들어, 안테나를 포함할 수 있다. 전자 장치(700)는 또한 제한되는 것은 아니지만, Code Division Multiple Access (CDMA), Global System for Mobile Communications (GSM), North American Digital Communications (NADC), Extended Time Division Multiple Access (E-TDMA), Wideband CDMA (WCDMA), CDMA2000, Wi-Fi, Municipal Wi-Fi (Muni Wi-Fi), Bluetooth, Digital Enhanced Cordless Telecommunications (DECT), Wireless Universal Serial Bus (Wireless USB), Fast low-latency access with seamless handoff Orthogonal Frequency Division Multiplexing (Flash-OFDM), IEEE 802.20, General Packet Radio Service (GPRS), iBurst, Wireless Broadband (WiBro), WiMAX, WiMAX-Advanced, Universal Mobile Telecommunication Service - Time Division Duplex (UMTS-TDD), High Speed Packet Access (HSPA), Evolution Data Optimized (EVDO), Long Term Evolution - Advanced (LTE-Advanced), Multichannel Multipoint Distribution Service (MMDS), Fifth-Generation Wireless (5G), 등과 같은 통신 시스템의 통신 인터페이스 프로토콜에 사용될 수 있다.The electronic device 700 and various system components of the electronic device 700 may include an image processing unit 760 . Interface 740 may be configured to include a wireless interface configured to transmit and receive data to and from a wireless communication network using RF signals. Air interface 740 may include, for example, an antenna. Electronic device 700 may also include, but is not limited to, Code Division Multiple Access (CDMA), Global System for Mobile Communications (GSM), North American Digital Communications (NADC), Extended Time Division Multiple Access (E-TDMA), Wideband CDMA (WCDMA), CDMA2000, Wi-Fi, Municipal Wi-Fi (Muni Wi-Fi), Bluetooth, Digital Enhanced Cordless Telecommunications (DECT), Wireless Universal Serial Bus (Wireless USB), Fast low-latency access with seamless handoff Orthogonal Frequency Division Multiplexing (Flash-OFDM), IEEE 802.20, General Packet Radio Service (GPRS), iBurst, Wireless Broadband (WiBro), WiMAX, WiMAX-Advanced, Universal Mobile Telecommunication Service - Time Division Duplex (UMTS-TDD), High Speed Packet Can be used for communication interface protocols of communication systems such as Access (HSPA), Evolution Data Optimized (EVDO), Long Term Evolution - Advanced (LTE-Advanced), Multichannel Multipoint Distribution Service (MMDS), Fifth-Generation Wireless (5G), etc. have.

본 명세서에 기술된 주제 및 동작의 실시 예는 디지털 전자 회로, 또는 본 명세서에 개시된 구조 및 그 구조적 등가물을 포함하는 컴퓨터 소프트웨어, 펌웨어, 또는 하드웨어, 또는 이들의 하나 이상의 조합으로 구현될 수 있다. 본 명세서에 설명된 주제의 실시 예는 하나 이상의 컴퓨터 프로그램, 즉, 데이터 처리 장치에 의해 실행되거나 데이터 처리 장치의 동작을 제어하기 위해 컴퓨터 저장 매체에 인코딩된 컴퓨터 프로그램 명령의 하나 이상의 모듈로서 구현될 수 있다. 대안적으로 또는 추가로, 프로그램 명령은 인공적으로 생성된 전파 신호, 예를 들어, 데이터 처리 장치에 의한 실행을 위해 적절한 수신기 장치로의 전송을 위해 정보를 인코딩하도록 생성된 기계-생성 전기, 광학 또는 전자기 신호에 인코딩될 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독 가능 저장 장치, 컴퓨터 판독 가능 저장 기판, 랜덤 또는 직렬 액세스 메모리 어레이 또는 장치, 또는 이들의 조합 일 수 있거나 여기에 포함될 수 있다. 더욱이, 컴퓨터 저장 매체는 전파된 신호가 아니지만, 컴퓨터 저장 매체는 인위적으로 생성된 전파 신호로 인코딩된 컴퓨터 프로그램 명령의 소스 또는 목적지일 수 있다. 컴퓨터 저장 매체는 또한 하나 이상의 개별 물리적 구성 요소 또는 매체(예를 들어, 다중 CD, 디스크 또는 기타 저장 장치)일 수 있거나 여기에 포함될 수 있다. 또한, 본 명세서에서 설명되는 동작은 하나 이상의 컴퓨터 판독 가능 저장 장치에 저장되거나 다른 소스로부터 수신된 데이터에 대해 데이터 처리 장치에 의해 수행되는 동작으로 구현될 수 있다.Embodiments of the subject matter and operations described herein may be implemented in digital electronic circuitry, or computer software, firmware, or hardware including the structures disclosed herein and structural equivalents thereof, or a combination of one or more thereof. Embodiments of the subject matter described herein may be implemented as one or more computer programs, ie, one or more modules of computer program instructions executed by a data processing device or encoded in a computer storage medium for controlling the operation of the data processing device. have. Alternatively or additionally, the program instructions may be an artificially generated radio signal, eg, a machine-generated electrical, optical or machine-generated electrical, optical or, generated to encode information for transmission to a receiver device suitable for execution by a data processing device. It may be encoded in an electromagnetic signal. A computer storage medium may be or be included in a computer readable storage device, a computer readable storage substrate, a random or serial access memory array or device, or a combination thereof. Moreover, while a computer storage medium is not a propagated signal, a computer storage medium may be a source or destination of computer program instructions encoded in an artificially generated propagated signal. Computer storage media may also be or be included in one or more separate physical components or media (eg, multiple CDs, disks, or other storage devices). Additionally, the operations described herein may be implemented as operations performed by a data processing device on data stored in one or more computer-readable storage devices or received from other sources.

본 명세서는 많은 특정 구현 세부 사항을 포함할 수 있지만, 구현 세부 사항은 어느 청구된 주제의 범위에 대한 제한으로 해석되어서는 안되며, 오히려 특정 실시 예에 특정한 특징의 설명으로 해석되어야 한다. 별도의 실시 예의 맥락에서 본 명세서에 설명된 특정한 특징은 또한 단일 실시 예에서의 조합으로 구현될 수 있다. 반면에, 단일 실시 예의 맥락에서 설명된 다양한 특징은 다중 실시 예에서 개별적으로 또는 임의의 적절한 하위 조합으로 또한 구현될 수 있다. 더욱이, 비록 특징이 특정 조합으로 작용하는 것으로 위에서 설명될 수 있고 심지어 처음에는 그렇게 주장될 수도 있지만, 청구된 조합에서 하나 이상의 특징이 경우에 따라 조합에서 제외될 수 있고, 청구된 조합은 하위 조합 또는 하위 조합의 변형에 대한 것일 수 있다.Although this specification may contain many specific implementation details, the implementation details should not be construed as limitations on the scope of any claimed subject matter, but rather as descriptions of features specific to particular embodiments. Certain features that are described herein in the context of separate embodiments may also be implemented in combination in a single embodiment. On the other hand, various features that are described in the context of a single embodiment may also be implemented in multiple embodiments individually or in any suitable sub-combination. Moreover, although features may be described above and even initially claimed as acting in a particular combination, one or more features in a claimed combination may occasionally be excluded from the combination, and the claimed combination may be a sub-combination or It may be for a variant of a sub-combination.

유사하게, 동작이 특정 순서로 도면에 도시되어 있지만, 이는 그러한 동작이 도시된 특정 순서 또는 순차적인 순서로 수행되거나, 바람직한 결과들을 얻기 위해 모든 도시된 동작이 수행될 수 있을 것을 요구하는 것으로 이해되어서는 안된다. 특정 상황에서는 멀티 태스킹 및 병렬 처리가 유리할 수 있다. 더욱이, 위에서 설명된 실시 예에서 다양한 시스템 구성 요소의 분리는 모든 실시 예에서 그러한 분리를 요구하는 것으로 이해되어서는 안되며, 설명된 프로그램 구성 요소 및 시스템은 일반적으로 단일 소프트웨어 곱으로 함께 통합되거나 다중 소프트웨어 곱으로 패키징될 수 있다. Similarly, although acts are shown in the figures in a particular order, it is to be understood that it is required that such acts be performed in the specific order or sequential order shown, or that all depicted acts may be performed to achieve desirable results. should not Multitasking and parallel processing can be advantageous in certain situations. Moreover, the separation of various system components in the embodiments described above should not be construed as requiring such separation in all embodiments, and the described program components and systems are generally integrated together into a single software product or multiple software products. can be packaged as

따라서, 본 개시의 주제의 특정 실시 예가 본 명세서에서 설명되었다. 다른 실시 예는 다음의 청구 범위 내에 있다. 일부 경우에, 청구 범위들에 명시된 동작은 다른 순서로 수행될 수 있으며 여전히 바람직한 결과를 얻을 수 있다. 추가로, 첨부된 도면에 도시된 프로세스는 바람직한 결과를 얻기 위해 도시된 특정 순서 또는 순차적인 순서를 반드시 필요로 하지 않는다. 특정 구현에서, 멀티 태스킹 및 병렬 처리가 유리할 수 있다.Accordingly, specific embodiments of the subject matter of the present disclosure have been described herein. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims may be performed in a different order and still obtain desirable results. Additionally, the processes shown in the accompanying drawings do not necessarily require the specific order shown or sequential order to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.

본 기술 분야에서의 통상의 기술자에 의해 인식되는 바와 같이, 본 개시에서 설명된 혁신적인 개념은 광범위한 응용에 걸쳐 수정 및 변경될 수 있다. 따라서, 청구되는 주제의 범위는 상술된 특정 예시의 교시 중 어느 것에 제한되어서는 안되며, 대신 후술되는 청구 범위에 의해 정의된다.As will be appreciated by those skilled in the art, the innovative concepts described in this disclosure can be modified and varied over a wide range of applications. Accordingly, the scope of the claimed subject matter should not be limited by any of the specific example teachings set forth above, but rather is defined by the claims set forth below.

Claims

An apparatus for calculating the dot product of a first vector and a second vector, comprising:
A multiplier array and a multiplier of the multiplier array include:
integer logic for multiplying integer values of corresponding elements of a first vector and a second vector to form a product integer value, the first vector and the second vector comprising floating point values;
exponent logic for adding exponent values corresponding to the integer values of corresponding elements of the first vector and the second vector to form an unbiased exponent value corresponding to the product integer value; and
bits the product integer value based on a difference value between the unbiased exponent value corresponding to the product integer value and the maximum unbiased exponent value for the multiplier array less than or equal to a preset maximum bit shift capacity of a local shifter the local shifter shifting in a preset direction by a number to form a first shifted value;
a maximum tree unit for determining the maximum unbiased exponent value for the multiplier array;
an adder tree that adds first shifted values output from the local shifters of the multiplier array to form a first output; and
and an accumulator for accumulating the first output of the adder tree.

The method of claim 1,
based on the difference value between the unbiased exponent value corresponding to the product integer value and the maximum unbiased exponent value less than or equal to the preset maximum bit shift capacity of the local shifter corresponding to the first shifted value further comprising a mask generator for generating a first mask coupling the first shifted value to the adder tree;
wherein the adder tree adds the first shifted values output from the local shifters of the multiplier array, and is coupled to the adder tree by the first mask to form the first output.

3. The method of claim 2,
wherein the mask generator generates the first mask during a first cycle.

3. The method of claim 2,
The mask generator is configured to determine a difference value between the unbiased exponent value corresponding to the product integer value and the maximum unbiased exponent value greater than the preset maximum bit shift capacity of the local shifter corresponding to the first shifted value. based on, generate a second mask connecting the first shifted value to the adder tree,
the adder tree adds the first shifted values output from the local shifters of the multiplier array, and is coupled to the adder tree by the second mask to form the second output;
the apparatus further comprising an auxiliary shifter coupled to the adder tree and configured to shift the second output from the adder tree by the preset maximum bit shift capacity of the local shifter to form a second shifted value;
and the accumulator further accumulates the second output of the adder tree.

5. The method of claim 4,
wherein the mask generator generates the second mask during a second cycle.

The method of claim 1,
The first vector includes an activation value and the second vector includes a weight value.

7. The method of claim 6,
wherein the activation value and the weight value comprise 16 bit floating point (FP16) values.

7. The method of claim 6,
wherein the activation value and the weight value comprise 32 bit floating point (FP32) values.

For the multiplier:
integer logic for multiplying integer values of elements of a first vector with integer values of corresponding elements of a second vector to form a product integer value;
exponent logic for adding exponent values corresponding to the integer values of corresponding elements of the first vector and the second vector to form an unbiased exponent value corresponding to the product integer value; and
Shifting the product integer value in a preset direction by the number of bits based on a difference value between the unbiased exponent value corresponding to the product integer value and a preset value less than or equal to a preset maximum bit shift capacity of a local shifter and the local shifter to form a first shifted value.

10. The method of claim 9,
The multiplier is part of a multiplier array, the multiplier array comprising:
a maximum tree unit for determining the preset value including a maximum unbiased exponent value for the multiplier array;
an adder tree that adds first shifted values output from the local shifters of the multiplier array to form a first output; and
and an accumulator for accumulating the first output of the adder tree.

11. The method of claim 10,
based on the difference value between the unbiased exponent value corresponding to the product integer value and the maximum unbiased exponent value less than or equal to the preset maximum bit shift capacity of the local shifter corresponding to the first shifted value further comprising a mask generator for generating a first mask coupling the first shifted value to the adder tree;
the adder tree adds the first shifted values output from the local shifters of the multiplier array and is coupled to the adder tree by the first mask to form the first output;
The mask generator is a multiplier that generates the first mask during a first cycle.

12. The method of claim 11,
The mask generator is configured to determine a difference value between the unbiased exponent value corresponding to the product integer value and the maximum unbiased exponent value greater than the preset maximum bit shift capacity of the local shifter corresponding to the first shifted value. based on, generate a second mask connecting the first shifted value to the adder tree,
the mask generator generates the second mask during a second cycle;
the adder tree adds the first shifted values output from the local shifters of the multiplier array and is coupled to the adder tree by the second mask to form a second output;
the multiplier array further comprises an auxiliary shifter coupled to the adder tree and configured to shift the second output from the adder tree by the preset maximum bit shift capacity of the local shifter to form a second shifted value;
and the accumulator further accumulates the second output of the adder tree.

10. The method of claim 9,
A multiplier wherein the first vector contains an activation value and the second vector contains a weight value.

14. The method of claim 13,
wherein the activation value and the weight value comprise 16 bit floating point (FP16) values.

14. The method of claim 13,
wherein the activation value and the weight value comprise 32 bit floating point (FP32) values.

A method for calculating a dot product for floating point values, comprising:
an element-wise multiplication step, by integer logic of a multiplier array, multiplying integer values of elements of a first vector with integer values of corresponding elements of a second vector to form integer product values, wherein the first vector is a 16-bit floating point value 'n' elements of 'n' and the second vector contains 'n' elements of 16-bit floating point values, where 'n' is an integer greater than one;
By exponent logic of the multiplier array, the exponent values of the elements of the first vector are added to the exponent values of the corresponding elements of the second vector to form an exponent sum value respectively corresponding to the integer product values. addition step;
determining a maximum exponential sum value of the exponential sum values;
a subtraction step of subtracting, by the exponent logic, the maximum exponent sum value from each of the exponent sum values to form relative exponent values respectively corresponding to the integer product values;
Right shifting, by first local shifters of the multiplier array, right shifting first integer product values by a corresponding relative exponential value to form integer product values aligned with the integer product corresponding to the maximum exponential sum value. step, wherein the first local shifters are respectively equal to or less than a first preset maximum number of bit shifts less than the entire bit range of the exponential sum values of the first vector and the second vector and less than or equal to the first preset maximum number of bit shifts. the first integer product values corresponding to relative exponent values; and
and adding the first integer product values aligned with the integer product value corresponding to the maximum exponential sum value to form a dot product of the first vector and the second vector.

17. The method of claim 16,
right bit-shifting, by the auxiliary shifter, second integer product values by a second preset number of bit shifts to form second integer product values aligned with the integer product value corresponding to the maximum exponential sum value wherein the second preset number of bit shifts includes the first preset maximum number of bit shifts.

18. The method of claim 17,
the total bit range of the exponent values of the first vector and the second vector comprises 58 bits;
The sum of the first preset maximum number of bit shifts and the second preset number of bit shifts is 58 bits or less.

18. The method of claim 17,
The elements of the first vector and the elements of the second vector comprise 16 bit floating point (FP16) values.

18. The method of claim 17,
The elements of the first vector and the elements of the second vector comprise 32 bit floating point (FP32) values.