KR20210059623A

KR20210059623A - Electronic device and method for inference Binary and Ternary Neural Networks

Info

Publication number: KR20210059623A
Application number: KR1020200148127A
Authority: KR
Inventors: 아르납 로이; 사프타라시 다스; 안쿠르 데쉬왈; 키란 콜라르 찬드라세카란; 이세환
Original assignee: 삼성전자주식회사
Priority date: 2019-11-15
Filing date: 2020-11-06
Publication date: 2021-05-25

Abstract

An embodiment provides a method of calculating a dot product for binary data, ternary data, non-binary data, and non-ternary data using an electronic device. The method includes designing to compute the dot product on ternary data. The method includes designing a fused bitwise data path to support dot product calculations for binary data and ternary data. The method includes designing an FPL data path to compute a dot product between one of the non-binary data and the non-ternary data and one of the binary data and the ternary data. The method includes distributing the dot product of the binary data and the ternary data and the dot product of one of the non-binary data and the non-ternary data and one of the binary data and the ternary data to the fused bitwise data path and the FPL data path.

Description

Electronic device and method for inference Binary and Ternary Neural Networks}

본 개시는 전자 장치에 관한 것으로, 보다 상세하게는 BNN(Binary Neural Network) 및 TNN(Ternary Neural Network) 추론을 위한 전자 장치 및 방법에 관한 것이다.The present disclosure relates to an electronic device, and more particularly, to an electronic device and a method for inferring a binary neural network (BNN) and a ternary neural network (TNN).

컨볼루션 신경망(Convolutional Neural Network, CNN)은 컴퓨터 비전 영역에서 가장 일반적으로 적용되는 심층 신경망의 한 종류이다. 그러나 CNN을 사용하는 기존의 중앙 처리 장치/그래픽 처리 장치(CPU/GPU) 기반 시스템에서 컴퓨팅의 높은 정확도는 상당한 에너지 소비와 오버헤드에 의해 달성된다. CNN 기반 컴퓨팅으로 인한 문제를 완화하기 위해 모델 양자화 기술이 사용된다. 이진 신경망(BNN)과 삼진 신경망(TNN)은 모델 데이터(즉, 파라미터, 액티배이션 등)를 각각 1비트와 2비트로 표현할 수 있는 양자화 기술들이다. 다양한 BNN 및 TNN 모델들(즉, 데이터 유형)과 각 모델에 사용되는 오퍼랜드(operand)는 표 1과 같을 수 있다.Convolutional Neural Network (CNN) is a kind of deep neural network most commonly applied in computer vision. However, in conventional central processing unit/graphic processing unit (CPU/GPU) based systems using CNNs, the high accuracy of computing is achieved by significant energy consumption and overhead. Model quantization technology is used to alleviate the problem caused by CNN-based computing. Binary neural networks (BNN) and three-dimensional neural networks (TNN) are quantization techniques that can express model data (ie, parameters, activations, etc.) in 1 bit and 2 bits, respectively. Various BNN and TNN models (ie, data types) and operands used in each model may be shown in Table 1.

데이터 유형Data type 오퍼랜드 1Operand 1 오퍼랜드 2Operand 2 BNN_ABNN_A {0,1}{0,1} {0,1}{0,1} BNN_BBNN_B {-1,+1}{-1,+1} {-1,+1}{-1,+1} TNNTNN {-1,0,+1}{-1,0,+1} {-1,0,+1}{-1,0,+1} TNN_BNN_ATNN_BNN_A {-1,0,+1}{-1,0,+1} {0,1}{0,1} TNN_BNN_BTNN_BNN_B {-1,0,+1}{-1,0,+1} {-1,+1}{-1,+1}

일반적으로 전자 장치의 하드웨어 가속기는 특정 데이터 유형을 처리할 수 있다. 그러나 하드웨어 가속기는 모든 데이터 유형을 처리할 수 없다. 따라서 여러 데이터 유형을 처리하기 위해서는 전자 장치에 다중 하드웨어 가속기들이 필요하다. 전자 장치의 전자 칩에 다중 하드웨어 가속기들을 구현하려면 더 많은 영역이 요구되며, 이는 전자 장치의 전체 크기를 바람직하지 않게 증가시킨다. 또한, 전자 장치에서 다중 하드웨어 가속기들을 구현하기 위해 많은 수의 구성 요소가 필요하며, 이는 바람직하지 않게 전자 장치의 제조 비용을 증가시킨다. 또한, 전자 장치에서 다중 하드웨어 가속기들을 작동시키기 위한 전력 소모로 인해 전자 장치의 전체적인 전력 소모가 더 증가한다. 따라서, 전술한 단점들을 해결하거나 적어도 유용한 대안을 제공하는 것이 바람직하다.In general, a hardware accelerator in an electronic device can process specific data types. However, hardware accelerators cannot handle all data types. Therefore, multiple hardware accelerators are required in an electronic device to process multiple data types. To implement multiple hardware accelerators on the electronic chip of the electronic device, more areas are required, which undesirably increases the overall size of the electronic device. In addition, a large number of components are required to implement multiple hardware accelerators in an electronic device, which undesirably increases the manufacturing cost of the electronic device. In addition, overall power consumption of the electronic device further increases due to power consumption for operating multiple hardware accelerators in the electronic device. Accordingly, it is desirable to solve the aforementioned drawbacks or at least provide a useful alternative.

본 명세서의 실시 예들의 주요 목적은 이진 데이터, 삼진 데이터, 비 이진 데이터, 및 비 삼진 데이터에 대한 내적을 계산하는 방법 및 전자 장치를 제공하는 것이다.The main object of the embodiments of the present specification is to provide a method and an electronic device for calculating a dot product for binary data, ternary data, non-binary data, and non ternary data.

본 명세서의 실시 예들의 또 다른 목적은 이진 신경망(BNN) 모델 및 삼진 신경망(TNN) 모델을 위한 융합된 비트 단위 데이터 경로를 제공하는 것이다.Another object of the embodiments of the present specification is to provide a fused bitwise data path for a binary neural network (BNN) model and a three-dimensional neural network (TNN) model.

본 명세서의 실시 예들의 또 다른 목적은 삼진 데이터에 대한 내적을 계산하는 것이다.Another object of the embodiments of the present specification is to calculate the dot product for struck out data.

본 명세서의 실시 예들의 또 다른 목적은 이진 데이터 및 삼진 데이터에 대한 내적 계산을 지원하기 위해 융합된 비트 단위 데이터 경로를 제공하는 것이다.Another object of the embodiments of the present specification is to provide a fused bitwise data path to support dot product calculation for binary data and ternary data.

본 명세서의 실시 예들의 또 다른 목적은 비 이진 데이터 또는 비 삼진 데이터와 이진 데이터 또는 삼진 데이터 사이의 내적을 계산하기 위해 FPL(Full Precision Layer) 데이터 경로를 제공하는 것이다.Another object of the embodiments of the present specification is to provide a full precision layer (FPL) data path in order to calculate a dot product between non-binary data or non-struck data and binary data or struck data.

본 명세서의 실시 예들의 또 다른 목적은 이진 데이터 및 삼진 데이터에 대한 내적 계산 및 비 이진 데이터와 비 삼진 데이터 중 하나와 이진 데이터 및 삼진 데이터 중 하나의 내적을 융합된 데이터 경로 및 FPL 데이터 경로에 분배하는 것이다.Another object of the embodiments of the present specification is to calculate the dot product for binary data and ternary data, and distribute the dot product of one of the non-binary and non ternary data and one of the binary and ternary data to the fused data path and the FPL data path. It is to do.

본 명세서의 실시 예들의 또 다른 목적은 더 많은 전력 효율 및 면적 효율을 달성함으로써 다중 데이터 유형을 처리하도록 전자 장치를 구성하는 것이다.Another object of the embodiments of the present specification is to configure an electronic device to process multiple data types by achieving more power efficiency and area efficiency.

본 명세서의 실시 예들의 또 다른 목적은 이진 데이터 또는 삼진 데이터에 대한 히든 레이어의 비트 단위 내적 연산뿐만 아니라 제1 레이어의 풀 정밀도(full precision)를 지원하기위한 전자 장치를 제공하는 것이다.Another object of the embodiments of the present specification is to provide an electronic device for supporting full precision of a first layer as well as a bit-wise dot product operation of a hidden layer for binary data or ternary data.

본 명세서의 실시 예들의 또 다른 목적은 비트 단위 방식으로 삼진 데이터를 처리하는 것이다.Another object of the embodiments of the present specification is to process struck out data in a bit-wise manner.

본 명세서의 실시 예의 또 다른 목적은 2개의 상이한 데이터 유형(예, {0,1}, {-1,+1})에서 동작하는 BNN에 필요한 개별 데이터 경로를 단일 융합된 데이터 경로로 결합하는 것이다.Another object of the embodiments of the present specification is to combine individual data paths required for a BNN operating on two different data types (eg, {0,1}, {-1,+1}) into a single fused data path. .

일 측면에 따른 이진 데이터, 삼진 데이터, 비 이진 데이터, 및 비 삼진 데이터에 대한 내적을 계산하는 방법은, 전자 장치에 의해, 상기 삼진 데이터에 대한 내적을 계산하도록 설계하는 단계; 상기 전자 장치에 의해, 상기 이진 데이터 및 상기 삼진 데이터에 대한 내적 계산을 지원하기 위한 융합된 비트 단위 데이터 경로를 설계하는 단계; 상기 전자 장치에 의해, 상기 비 이진 데이터와 상기 비 삼진 데이터 중 하나 및 상기 이진 데이터와 상기 삼진 데이터 중 하나 사이의 내적을 계산하기 위한 풀 정밀도 계층(Full Precision Layer, FPL) 데이터 경로를 설계하는 단계; 및 상기 전자 장치에 의해, 상기 이진 데이터와 상기 삼진 데이터에 대한 내적 계산과 상기 비 이진 데이터와 상기 비 삼진 데이터 중 하나 및 상기 이진 데이터와 상기 삼진 데이터 중 하나 사이의 내적을 상기 융합된 데이터 경로 및 상기 FPL 데이터 경로에 분배하는 단계를 포함한다.According to an aspect, a method of calculating a dot product for binary data, ternary data, non-binary data, and non ternary data may include: designing, by an electronic device, to calculate a dot product for the ternary data; Designing, by the electronic device, a fused bit-wise data path for supporting dot product calculation of the binary data and the ternary data; Designing, by the electronic device, a full precision layer (FPL) data path for calculating a dot product between one of the non-binary data and the non-three data, and one of the binary data and the ternary data ; And by the electronic device, calculating a dot product of the binary data and the ternary data, one of the non-binary data and the non ternary data, and a dot product between the binary data and the ternary data, the fused data path, and And distributing to the FPL data path.

다른 측면에 따른 이진 데이터, 삼진 데이터, 비 이진 데이터, 및 비 삼진 데이터에 대한 내적을 계산하기위한 전자 장치는, 정적 랜덤 액세스 메모리(SRAM); 상기 SRAM에 주소를 전송하도록 구성된 적어도 하나의 컨트롤러; 프로세싱 엔진(PE) 어레이; 및 적어도 하나의 SRAM 데이터를 수신하고 상기 적어도 하나의 SRAM 데이터를 상기 PE 어레이로 전달하도록 구성된 디스패처를 포함하고, 상기 PE 어레이는 PE 어레이 컨트롤러, 출력 특징 맵(OFM) 결합기, 및 융합된 데이터 경로 엔진을 포함하고, 상기 융합된 데이터 경로 엔진은 이진 신경망(BNN) 모델 및 삼진 신경망(TNN) 모델에 대한 융합된 데이터 경로를 제공하도록 구성된다.According to another aspect, an electronic device for calculating a dot product for binary data, ternary data, non-binary data, and non ternary data includes: a static random access memory (SRAM); At least one controller configured to transmit an address to the SRAM; Processing engine (PE) array; And a dispatcher configured to receive at least one SRAM data and deliver the at least one SRAM data to the PE array, wherein the PE array comprises a PE array controller, an output feature map (OFM) combiner, and a fused data path engine. Including, the fused data path engine is configured to provide a fused data path for a binary neural network (BNN) model and a three-dimensional neural network (TNN) model.

도 1은 BNN_A 데이터 유형에 대한 내적을 결정하는 예시적인 시나리오를 도시한다.
도 2a는 산술 계산의 표현에 기초하여 BNN_B 데이터 유형에서 내적을 결정하는 예시적인 시나리오를 도시한다.
도 2b는 비트 단위 구현에 기초하여 BNN_B 데이터 유형에 대한 내적을 결정하는 예시적인 시나리오를 도시한다.
도 3a는 BNN_A 데이터 유형에 속하는 이진 데이터에 대해 MAC 연산을 수행하는 방법을 나타내는 흐름도이다.
도 3b는 BNN_B 데이터 유형에 속하는 이진 데이터에 대해 MAC 연산을 수행하는 방법을 나타내는 흐름도이다.
도 4는 부호를 포함하는 삼진 데이터를 2비트 표현으로 수정하는 예시적인 시나리오를 도시한다.
도 5a는 일 실시 예에 따른 산술 계산의 표현에 기초하여 TNN 데이터 유형에 대한 내적을 결정하는 시나리오를 도시한다.
도 5b는 일 실시 예에 따른 비트 단위 구현에 기초하여 TNN 데이터 유형에 대한 내적을 결정하는 시나리오를 도시한다.
도 6은 일 실시 예에 따른 TNN 데이터 유형에 속하는 삼진 데이터에 대해 MAC 연산을 수행하는 방법을 나타내는 흐름도이다.
도 7a는 일 실시 예에 따라, 이진 데이터, 삼진 데이터, 비 이진 데이터, 및 비 삼진 데이터에 대한 내적을 계산하기위한 전자 장치의 개요를 도시한다.
도 7b는 일 실시 예에 따른 전자 장치의 PE 어레이의 개요를 도시한다.
도 8은 일 실시 예에 따른, 이진 데이터, 삼진 데이터, 비 이진 데이터, 및 비 삼진 데이터에 대한 내적을 계산하기위한 전자 장치의 블록도이다.
도 9a는 일 실시 예에 따른, MAC 연산을 수행하기위한 융합된 데이터 경로 엔진의 블록도이다.
도 9b는 일 실시 예에 따라, 삼진 모드에서 삼진 데이터에 대한 MAC 연산을 수행하기 위해 융합 데이터 경로 엔진에 의해 사용되는 TNN 모델을 나타낸다.
도 9c는 일 실시 예에 따라, 이진 모드에서 이진 데이터에 대한 MAC 연산을 수행하기 위해 융합된 데이터 경로 엔진에 의해 사용되는 BNN 모델을 나타낸다.
도 10은 일 실시 예에 따른 루프 포맷에서의 순회의 개략도를 나타낸다.
도 11a는 일 실시 예에 따른 PE의 데이터 경로의 개략도를 도시한다.
도 11b는 일 실시 예에 따른, 전자 장치를 사용하여 이진 데이터, 삼진 데이터, 비 이진 데이터, 및 비 삼진 데이터에 대한 내적을 계산하는 방법을 나타내는 흐름도이다.
도 12a는 일 실시 예에 따른, 풀 정밀도 레이어 데이터 경로를 사용하여 데이터 전달을 위해 디스패처에 의해 수행되는 단계들을 예시하는 개략도이다.
도 12b는 융합된 비트 단위 데이터 경로에 대한 데이터 분배 방식을 도시한다.1 shows an example scenario for determining the dot product for the BNN_A data type.
2A shows an example scenario for determining a dot product in a BNN_B data type based on a representation of an arithmetic calculation.
2B shows an example scenario for determining a dot product for a BNN_B data type based on a bitwise implementation.
3A is a flowchart illustrating a method of performing a MAC operation on binary data belonging to the BNN_A data type.
3B is a flowchart illustrating a method of performing a MAC operation on binary data belonging to the BNN_B data type.
4 shows an exemplary scenario in which ternary data including a sign is modified into a 2-bit representation.
5A illustrates a scenario for determining a dot product for a TNN data type based on an expression of arithmetic calculation according to an embodiment.
5B is a diagram illustrating a scenario of determining a dot product for a TNN data type based on a bit-wise implementation according to an embodiment.
6 is a flowchart illustrating a method of performing a MAC operation on ternary data belonging to a TNN data type according to an embodiment.
7A is a schematic diagram of an electronic device for calculating a dot product for binary data, ternary data, non-binary data, and non ternary data, according to an embodiment.
7B is a schematic diagram of a PE array of an electronic device according to an exemplary embodiment.
8 is a block diagram of an electronic device for calculating a dot product for binary data, struck out data, non-binary data, and non-struck out data, according to an exemplary embodiment.
9A is a block diagram of a fused data path engine for performing MAC operations, according to an embodiment.
9B is a diagram illustrating a TNN model used by a fusion data path engine to perform a MAC operation on triplet data in a triplet mode, according to an embodiment.
9C illustrates a BNN model used by a fused data path engine to perform a MAC operation on binary data in a binary mode, according to an embodiment.
10 is a schematic diagram of a traversal in a loop format according to an embodiment.
11A is a schematic diagram of a data path of a PE according to an embodiment.
11B is a flowchart illustrating a method of calculating a dot product for binary data, struck out data, non-binary data, and non-struck out data using an electronic device, according to an exemplary embodiment.
12A is a schematic diagram illustrating steps performed by a dispatcher for data transfer using a full precision layer data path, according to an embodiment.
12B shows a data distribution method for a fused bit-wise data path.

본 명세서의 실시 예 및 이의 다양한 특징 및 세부 사항은 첨부 도면에 예시되고 다음 설명에서 상세하게 설명되는 비 제한적인 실시 예들을 참조하여보다 완전하게 설명된다. 잘 알려진 구성 요소 및 처리 기술에 대한 설명은 본 명세서의 실시 예를 불필요하게 모호하게 하지 않도록 생략된다. 또한, 일부 실시 예는 새로운 실시 예를 형성하기 위해 하나 이상의 다른 실시 예와 조합될 수 있기 때문에, 여기에 설명된 다양한 실시 예들은 반드시 상호 배타적이지 않다. 본원에 사용된 용어 "또는"은 달리 명시되지 않는 한 비 배타적으로 사용된다. 본 명세서에서 사용된 예들은 단지 본 명세서의 실시 예가 실시될 수 있는 방식의 이해를 용이하게 하고 당업자가 본 명세서의 실시 예를 실시할 수 있도록 하기 위한 것이다. 따라서, 예들은 본 명세서의 실시 예의 범위를 제한하는 것으로 해석되어서는 안된다.Embodiments of the present specification and various features and details thereof are described more fully with reference to non-limiting embodiments illustrated in the accompanying drawings and described in detail in the following description. Descriptions of well-known components and processing techniques are omitted so as not to unnecessarily obscure the embodiments of the present specification. Further, since some embodiments may be combined with one or more other embodiments to form a new embodiment, the various embodiments described herein are not necessarily mutually exclusive. As used herein, the term “or” is used non-exclusively unless otherwise specified. The examples used herein are merely intended to facilitate understanding of the manner in which the embodiments of the present specification may be practiced, and to enable those skilled in the art to practice the embodiments of the present specification. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.

실시 예들은 설명된 기능 또는 기능을 수행하는 블록의 관점에서 설명되고 예시될 수 있다. 여기에서 관리자, 유닛, 모듈, 하드웨어 컴포넌트 등으로 지칭 될 수 있는 이러한 블록들은 논리 게이트, 집적 회로, 마이크로 프로세서, 마이크로 컨트롤러, 메모리 회로, 수동 전자 컴포넌트, 능동 전자 컴포넌트, 광학 컴포넌트, 하드 와이어드 회로 등과 같은 아날로그 및/또는 디지털 회로에 의해 물리적으로 구현되거나, 선택적으로 펌웨어에 의해 구동될 수 있다. 예를 들어, 회로는 하나 이상의 반도체 칩 또는 인쇄 회로 기판 등과 같은 기판 지지체 상에 구현될 수 있다. 블록을 구성하는 회로는 전용 하드웨어 또는 프로세서(예: 하나 이상의 프로그래밍 된 마이크로 프로세서 및 관련 회로) 또는 블록의 일부 기능을 수행하는 전용 하드웨어와 블록의 다른 기능을 수행하는 프로세서의 조합에 의해 구현될 수 있다. 실시 예의 각각의 블록은 본 개시의 범위를 벗어나지 않고 둘 이상의 상호 작용 및 개별 블록으로 물리적으로 분리될 수 있다. 마찬가지로, 실시 예의 블록은 본 개시의 범위를 벗어나지 않고 물리적으로 더 복잡한 블록으로 결합될 수 있다.Embodiments may be described and illustrated in terms of a described function or a block performing a function. These blocks, which may be referred to herein as managers, units, modules, hardware components, etc., include logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronic components, active electronic components, optical components, hard wired circuits, etc. It may be physically implemented by analog and/or digital circuitry, or may be optionally driven by firmware. For example, the circuit may be implemented on one or more semiconductor chips or a substrate support, such as a printed circuit board. The circuit constituting the block may be implemented by a combination of dedicated hardware or a processor (eg, one or more programmed microprocessors and related circuits) or a dedicated hardware that performs some functions of the block and a processor that performs other functions of the block. . Each block of the embodiment may be physically separated into two or more interactions and individual blocks without departing from the scope of the present disclosure. Likewise, the blocks of the embodiment may be physically combined into more complex blocks without departing from the scope of the present disclosure.

내적은 신경망에서 MAC(Multiply and Accumulate) 연산을 위한 계산이다. 더 높은 정밀도의 데이터에 대해서 MAC 연산을 수행하려면 많은 비싼 하드웨어 곱셈 로직들이 필요하다. 그러나 이진 데이터, 삼진 데이터 등과 같은 극도로 낮은 정밀도의 데이터에 대해 MAC 연산을 수행하려면, 비트 연산을 수행하는 저렴한 하드웨어 로직으로도 충분하다. 벡터 쌍의 내적은 엘리먼트 별 곱셈 후 계산된 모든 곱셈 결과를 합산함으로써 계산된다. 예를 들면, sum(Ai x Bi)로 계산되며, 여기서 A와 B는 길이가 L 인 데이터 벡터 쌍을, Ai, Bi는 각각 A 벡터와 B 벡터의 i번째 엘리먼트를 나타낸다. 예에서, A와 B의 엘리먼트들은 실수 또는 복소수 일 수 있다. 이진 데이터는 BNN_A ({0,1}) 및 BNN_B ({-1, +1})의 두 가지 유형일 수 있다. 두 이진 데이터 유형 모두 1비트를 사용하여 표현될 수 있다. BNN_B의 경우, 0은 -1, 1은 +1을 나타낼 수 있다. 이 방법론은 이후의 설명에서 더 명확하게 하기 위해 신경망의 맥락에서 설명된다. 본 개시에서 내적 연산을 위해 IFM(Input Feature Map) 벡터와 가중치 또는 커널(W) 벡터인 두 개의 입력 벡터가 사용된다. "데이터", "데이터 벡터", "IFM 벡터", "비트 벡터", 및 "IFM 데이터"는 본 개시에서 상호 교환적으로 사용되며 동일한 의미를 갖는다.The dot product is a calculation for multiply and accumulate (MAC) operations in neural networks. A lot of expensive hardware multiplication logic is required to perform MAC operations on higher precision data. However, in order to perform MAC operations on extremely low-precision data such as binary data and ternary data, inexpensive hardware logic that performs bit operations is sufficient. The dot product of a vector pair is calculated by summing all the multiplication results calculated after element-wise multiplication. For example, it is calculated as sum(Ai x Bi), where A and B represent a pair of data vectors of length L, and Ai and Bi represent the i-th element of vector A and B, respectively. In an example, the elements of A and B may be real or complex. Binary data can be of two types: BNN_A ({0,1}) and BNN_B ({-1, +1}). Both binary data types can be represented using 1 bit. In the case of BNN_B, 0 may represent -1 and 1 may represent +1. This methodology is described in the context of a neural network for clarity in later descriptions. In the present disclosure, an input feature map (IFM) vector and two input vectors, which are weights or kernel (W) vectors, are used for the dot product operation. “Data”, “data vector”, “IFM vector”, “bit vector”, and “IFM data” are used interchangeably in this disclosure and have the same meaning.

도 1은 BNN_A 데이터 유형에 대한 내적을 결정하는 예시적인 시나리오를 도시한다.1 shows an example scenario for determining the dot product for the BNN_A data type.

내적의 정의에 따라 최종 결과는 W 벡터(W vector)와 IFM 벡터(IFM vector) 사이의 엘리먼트 별 곱셈으로 획득된다. 곱셈 하드웨어는 면적이 넓고 전력 오버 헤드가 높다. "W AND IFM" 벡터는 W 벡터(W vector)와 IFM 벡터 간(IFM vector)의 엘리먼트 별 곱셈을 나타낸다. 따라서 곱셈 하드웨어를 사용하는 대신 W 벡터 및 IFM 벡터에 대한 엘리먼트 별 AND (비트 단위 AND: bitwise AND) 연산이 사용된다. 최종 결과는 "W AND IFM" 벡터에서 1의 수를 카운트함으로써 획득된다. 일 예에서 W 벡터(W vector)와 IFM 벡터(IFM vector) 사이의 내적 결과는 아래와 같이 3이다.According to the definition of the dot product, the final result is obtained by multiplying each element between the W vector and the IFM vector. The multiplication hardware has a large area and high power overhead. The "W AND IFM" vector represents the multiplication of each element between the W vector and the IFM vector. Therefore, instead of using multiplication hardware, element-by-element AND (bitwise AND) operations for W vectors and IFM vectors are used. The final result is obtained by counting the number of 1s in the "W AND IFM" vector. In one example, the dot product result between the W vector and the IFM vector is 3 as follows.

W와 IFM 사이의 내적Dot product between W and IFM

= #1 in W AND IFM bit vector= #1 in W AND IFM bit vector

= popcount(#1 in bit vector)= popcount(#1 in bit vector)

= 3= 3

도 2a는 산술 계산의 표현에 기초하여 BNN_B 데이터 유형에서 내적을 결정하는 예시적인 시나리오를 도시한다. 2A shows an example scenario for determining a dot product in a BNN_B data type based on a representation of an arithmetic calculation.

W 벡터(W vector) 및 IFM 벡터(IFM vector)가 BNN_B 데이터 유형으로 표시되어 있다. "W X IFM" 벡터는 W 벡터(W vector)와 IFM 벡터(IFM vector) 간의 엘리먼트 별 곱셈을 나타낸다. 내적의 최종 결과는 A에서 B를 뺌으로써 획득된다. 여기서 A는 "W X IFM" 벡터에서 "1"의 수이고 B는 "W X IFM" 벡터에서 "-1"의 수이다. 일 예에서 최종 결과는 -2이다. "W X IFM"에서 1의 수는 #1로 표시되고, "W X IFM"에서 -1의 수는 #-1로 표시된다.W vector and IFM vector are denoted by the BNN_B data type. The "W X IFM" vector represents a multiplication for each element between a W vector and an IFM vector. The final result of the dot product is obtained by subtracting B from A. Where A is the number of "1" in the "W X IFM" vector and B is the number of "-1" in the "W X IFM" vector. In one example, the final result is -2. In "W X IFM", the number of 1 is indicated by #1, and in "W X IFM" the number of -1 is indicated by #-1.

W와 IFM 사이의 내적Dot product between W and IFM

= #1 + (#-1) * -1= #1 + (#-1) * -1

= 3 - 5= 3-5

= -2= -2

도 2b는 비트 단위 구현에 기초하여 BNN_B 데이터 유형에 대한 내적을 결정하는 예시적인 시나리오를 도시한다. 2B shows an example scenario for determining a dot product for a BNN_B data type based on a bitwise implementation.

도 2b는 곱셈기들을 사용하는 경우와 동일한 최종 결과(즉, -2)를 보여준다. W 벡터 및 IFM 벡터는 1비트로 표시된다. 비트 단위의 내적 연산은 단계 a) 내지 c)와 같이 수행된다.Fig. 2b shows the same final result (i.e. -2) as in the case of using multipliers. The W vector and the IFM vector are denoted by 1 bit. The bit-wise dot product operation is performed as in steps a) to c).

a) 비트 단위 XNOR 연산을 수행한다(W XNOR IFM).a) Performs a bit-wise XNOR operation (W XNOR IFM).

b) "W XNOR IFM" 벡터에서 1의 개수를 카운트한다. 비트 벡터(bit vector)에서 1을 카운트하는 것을 POPCOUNT라고하며, 이 값을 v라고 칭한다.b) Count the number of 1s in the "W XNOR IFM" vector. Counting 1 in a bit vector is called POPCOUNT, and this value is called v.

c) W와 IFM의 비트 벡터의 길이가 L이면, 내적 결과는 (2 X v) - L로 결정되며, (v << 1) - L로 표현 될 수 있다. 여기서 “<<”는 비트 단위 왼쪽 시프트(bitwise left shift)를 나타낸다.c) If the length of the bit vector of W and IFM is L, the dot product result is determined as (2 X v)-L, and can be expressed as (v << 1)-L. Here, “<<” represents a bitwise left shift.

일 예에서, 단계 c)는, v가 3이고 L이 8이므로, 결과를 -2로 표시한다. 따라서 비트 방식으로 계산된 내적은 "내적"의 정의에 의한 연산과 동일하다. 위의 단계 a) 내지 c)는 도 3b의 흐름도의 형태로 도시될 수 있다. 도 3b에서 단계 B304의 출력 결과는 단계 B305에서 더해진다. 이 덧셈 연산은 MAC 연산에서의 누산에 해당한다.In one example, step c) represents the result as -2, since v is 3 and L is 8. Therefore, the dot product calculated in the bitwise manner is the same as the operation by the definition of "dot product". Steps a) to c) above may be illustrated in the form of a flowchart of FIG. 3B. In Fig. 3B, the output result of step B304 is added in step B305. This addition operation corresponds to the accumulation in the MAC operation.

W와 IFM 사이의 내적Dot product between W and IFM

= #1 + (#0) * -1= #1 + (#0) * -1

= #1 - #0= #1-#0

= #1 - (BIT_VECTOR_LENGTH - #1)= #1-(BIT_VECTOR_LENGTH-#1)

= 2 * (#1) - BIT_VECTOR_LENGTH= 2 * (#1)-BIT_VECTOR_LENGTH

= (popcount << 1) - BIT_VECTOR_LENGTH= (popcount << 1)-BIT_VECTOR_LENGTH

= (3 << 1) - 8= (3 << 1)-8

= -2= -2

도 3a는 BNN_A 데이터 유형에 속하는 이진 데이터에 대해 MAC 연산을 수행하는 방법을 나타내는 흐름도이다. 3A is a flowchart illustrating a method of performing a MAC operation on binary data belonging to the BNN_A data type.

단계 A301에서, W 벡터(Bit vector W)와 IFM 벡터(Bit vector IFM) 사이에 비트 단위 AND(Bitwise AND) 연산을 수행한다. 단계 A302에서, 각 비트 쌍에 대해 비트 단위 AND 연산을 수행하고 IFM 벡터(Bit vector IFM)와 W 벡터(Bit vector W) 사이의 내적 값을 결정하는 것에 응답하여 popcount를 검출한다. 단계 A303에서, 누적 연산을 위하여 popcount를 더하고, 여기서 MAC 연산은 단계 A303 이후에 완료된다.In step A301, a bitwise AND (AND) operation is performed between the W vector and the IFM vector. In step A302, a popcount is detected in response to performing a bitwise AND operation on each bit pair and determining a dot product value between an IFM vector (Bit vector IFM) and a W vector (Bit vector W). In step A303, popcount is added for the accumulation operation, where the MAC operation is completed after step A303.

도 3b는 BNN_B 데이터 유형에 속하는 이진 데이터에 대해 MAC 연산을 수행하는 방법을 나타내는 흐름도이다. 3B is a flowchart illustrating a method of performing a MAC operation on binary data belonging to the BNN_B data type.

단계 B301에서, W 벡터(Bit vector W)와 IFM 벡터(Bit vector IFM) 사이에 비트 단위 XNOR(Bitwise XNOR) 연산을 수행한다. 단계 B302에서, 각 비트 쌍에 대해 비트 단위 XNOR 연산을 수행하는 것에 응답하여 popcount를 검출한다. 단계 B303에서, popcount에 대해 왼쪽 시프트를 수행한다. 단계 B304에서, 왼쪽으로 시프트된 popcount에서 IFM 벡터 또는 W 벡터의 길이(BIT_VECTOR_LENGTH)를 빼고 IFM 벡터와 W 벡터 사이의 내적 값을 결정한다. 단계 B305에서, 누적 연산을 위하여 팝 카운트를 더하고, 여기서 MAC 연산은 단계 B305 이후에 완료된다.In step B301, a bitwise XNOR (XNOR) operation is performed between a W vector and a bit vector IFM. In step B302, a popcount is detected in response to performing a bitwise XNOR operation on each bit pair. In step B303, a left shift is performed on popcount. In step B304, the length of the IFM vector or the W vector (BIT_VECTOR_LENGTH) is subtracted from the left-shifted popcount, and a dot product value between the IFM vector and the W vector is determined. In step B305, a pop count is added for an accumulation operation, where the MAC operation is completed after step B305.

도 4는 부호를 포함하는 삼진 데이터를 2비트 표현으로 수정하는 예시적인 시나리오를 도시한다. 4 shows an exemplary scenario in which ternary data including a sign is modified into a 2-bit representation.

일 예에서 삼진 데이터의 벡터는 도 4의 a이다. 각 삼진 데이터는 표 2를 사용하여 마스크(mask)에 대해 1비트, 값(value)에 대해 1비트로서, 2비트로 표시된다. 마스크 비트는 삼진 데이터가 0인지 여부를 나타낸다. 값 비트는 0보다 작은 삼진 데이터를 0으로, 그렇지 않으면 1로 나타낸다.In an example, a vector of ternary data is a of FIG. 4. Each ternary data is expressed as 1 bit for a mask, 1 bit for a value, and 2 bits using Table 2. The mask bit indicates whether the ternary data is 0. The value bit represents ternary data less than 0 as 0, otherwise it is represented as 1.

DataData Mask (m)Mask (m) Value (v)Value (v) 2 bit representation
msb: m lsb: v2 bit representation
msb: m lsb: v 00 00 00 0000 -1-One 1One 00 1010 +1+1 1One 1One 1111

이러한 방식으로, 삼진 비트 벡터가 도 4의 b와 같이 표시될 수 있다. 도 4의 c 및 d는 삼진 비트 벡터(b)로부터 추출된 마스크 벡터 및 값 벡터를 각각 나타낸다. 이러한 표현 방식에서 "01"(즉, 마스크 비트 0 및 값 비트 1)은 허용되지 않는다.도 5a 내지 도 12b에는 바람직한 실시 예들이 도시되어 있다. 이하에서는 도 5a 내지 도 12b를 함께 참조하여, 설명한다.In this way, the ternary bit vector can be represented as shown in FIG. 4B. 4C and 4D show mask vectors and value vectors extracted from the ternary bit vector (b), respectively. "01" (i.e., mask bit 0 and value bit 1) is not allowed in this expression method. Preferred embodiments are shown in Figs. 5A to 12B. Hereinafter, a description will be given with reference to FIGS. 5A to 12B together.

도 5a는 일 실시 예에 따른 산술 계산의 표현에 기초하여 TNN 데이터 유형에 대한 내적을 결정하는 시나리오를 도시한다.5A illustrates a scenario for determining a dot product for a TNN data type based on an expression of arithmetic calculation according to an embodiment.

TNN(Ternary Neural Network) 데이터 유형의 오퍼랜드(operand)는 {-1, 0, 1}로 표시된다. 도 5a에 도시된 바와 같이 TNN 데이터 유형을 사용하여 표현된 W 벡터(W) 및 IFM 벡터(IFM)를 고려한다. W 벡터(W)와 IFM 벡터(IFM)의 내적은 W 벡터와 IFM 벡터 사이에 엘리먼트 별 곱을 수행한 다음, 엘리먼트 별 곱셈 결과의 각 비트의 합계를 수행함으로써 결정된다. "W X IFM"벡터는 W와 IFM 간의 엘리먼트 별 곱셈을 나타낸다. 삼진 데이터에 대한 내적의 최종 결과는 A에서 B를 빼는 것이다. 여기서 B는 "W X IFM"벡터에서 "-1"의 수이고 A는 "W X IFM"벡터에서 "1"의 수이다. 일 예에서 최종 결과는 -2이다. "W X IFM"에서 1의 수는 #1로 표시되고 "W X IFM"에서 -1의 수는 #-1로 표시된다.The operand of the TNN (Ternary Neural Network) data type is represented by {-1, 0, 1}. Consider the W vector (W) and the IFM vector (IFM) expressed using the TNN data type as shown in FIG. 5A. The dot product of the W vector (W) and the IFM vector (IFM) is determined by performing element-by-element multiplication between the W vector and the IFM vector, and then performing the sum of each bit of the multiplication result by element. The "W X IFM" vector represents an element-by-element multiplication between W and IFM. The final result of the dot product for struck out data is to subtract B from A. Where B is the number of "-1" in the "W X IFM" vector and A is the number of "1" in the "W X IFM" vector. In one example, the final result is -2. In "W X IFM", the number of 1 is indicated by #1, and in "W X IFM" the number of -1 is indicated by #-1.

W와 IFM 사이의 내적Dot product between W and IFM

= #1 + (#-1) * -1= #1 + (#-1) * -1

= 2 - 4= 2-4

= -2= -2

도 5b는 일 실시 예에 따른 비트 단위 구현에 기초하여 TNN 데이터 유형에 대한 내적을 결정하는 시나리오를 도시한다. 또한, 도 5b는 삼진 데이터에 대한 내적 연산에 대한 예를 나타낸다.5B is a diagram illustrating a scenario of determining a dot product for a TNN data type based on a bit-wise implementation according to an embodiment. In addition, FIG. 5B shows an example of a dot product operation for ternary data.

표 2에 따르면, W 벡터(W)와 IFM 벡터(IFM)는 2비트 표현으로 수정된다. W에 대한 마스크 벡터(m1) 및 값 벡터(v1)는 도 5b의 a에 표시되어 있다. IFM에 대한 마스크 벡터(m2) 및 값 벡터(v2)는 도 5b의 b에 표시되어 있다. 또한 m1 벡터와 m2 벡터 간의 비트 단위 AND 연산이 수행되어 m3 벡터가 형성된다. 또한 v1 벡터와 v2 벡터 간의 비트 단위 XNOR 연산이 수행되어 v3 벡터가 형성된다. 또한 비트 단위 AND 연산은 m3 벡터와 v3 벡터를 사용하여 내적을 결정한다. popcount는 벡터의 1의 수이다. 이는 비트 단위 구현으로 얻은 최종 결과가 도 5a에서 설명한 곱셈 방식과 동일함을 보여준다.According to Table 2, the W vector (W) and the IFM vector (IFM) are modified to a 2-bit representation. The mask vector (m1) and the value vector (v1) for W are shown in a of Fig. 5B. The mask vector (m2) and the value vector (v2) for the IFM are indicated in b of FIG. 5B. In addition, a bitwise AND operation between the m1 vector and the m2 vector is performed to form an m3 vector. In addition, a v3 vector is formed by performing a bit-wise XNOR operation between the v1 vector and the v2 vector. Also, the bitwise AND operation uses m3 vectors and v3 vectors to determine the dot product. popcount is the number of 1s in the vector. This shows that the final result obtained by the bit-wise implementation is the same as the multiplication method described in FIG. 5A.

W와 IFM 사이의 내적Dot product between W and IFM

= (2 * m3 AND v3의 popcount) - m3의 popcount= (2*m3 AND v3 popcount)-m3 popcount

= (m3 AND v3의 popcount << 1) - m3의 popcount= (m3 AND v3 popcount << 1)-m3 popcount

= 4 - 6= 4-6

= -2= -2

도 6은 일 실시 예에 따른 TNN 데이터 유형에 속하는 삼진 데이터에 대해 MAC 연산을 수행하는 방법을 나타내는 흐름도이다.6 is a flowchart illustrating a method of performing a MAC operation on ternary data belonging to a TNN data type according to an embodiment.

단계 601에서, m1과 m2 사이의 비트 단위 AND(Bitwise AND) 연산을 수행한다. 단계 602에서, 단계 601로부터의 출력은 엘리먼트 별 연산 후 어떤 삼진 데이터가 0이 아닌지를 결정하는 데 사용된다. 단계 603에서, v1과 v2 사이의 비트 단위 XNOR(Bitwise XNOR) 연산을 수행한다. 단계 603의 출력은 값 비트 벡터들 사이의 엘리먼트 별 연산의 결과이다. BNN_B 데이터 유형과 같이 단계 a) 내지 c)를 수행하면 내적 결과가 잘못될 수 있다. 삼진 표현에서 값 비트가 0인 것은 0 또는 0보다 작음을 나타내기 때문이다. 두 삼진 데이터, d_a = 00 및 d_b = 00이 내적에 사용되는 시나리오를 고려한다. 이 경우 d_a, d_b의 값 비트의 비트 단위 XNOR로부터 1이 획득된다. 이 값을 고려하면 최종 내적은 0이 아닌 1로서 잘못된 결과가 획득된다. 따라서 0이 아닌 값들만 고려될 수 있도록 마스크를 사용하여 추가적으로 AND 연산을 수행해야 한다. 따라서, 단계 604에서, 단계 601 및 603로부터의 출력들 간에 비트 단위 AND(Bitwise AND) 연산을 수행한다. 단계 604의 출력은 v1과 v2 사이의 비트 단위 연산 후에 0이 아닌 값만을 유지한다. 단계 605에서, 단계 604의 출력으로서 생성된 벡터에서 1의 수를 센다. 단계 605의 출력을 왼쪽 시프트하는 것은 2를 곱하는 것과 동일하다. 단계 606의 출력에서 삼진 데이터의 비트 벡터의 길이를 빼면 결과가 올바르지 않다. 이는 단계 604의 출력이 제로 데이터를 포함 할 수 있기 때문이다. 따라서, 단계 607에서 0이 아닌 데이터에 의해서만 생성된 내적 결과를 유지하기 위해, 단계 606의 출력에서 벡터의 길이 대신 0이 아닌 값의 개수(단계 602의 출력)가 감산된다. 단계 607의 출력은 2개의 삼진 벡터들의 내적을 결정한다. 누적 연산을 수행하기 위해, 2개의 삼진 벡터들의 내적은 단계 608에서 더해진다.In step 601, a bitwise AND (AND) operation between m1 and m2 is performed. In step 602, the output from step 601 is used to determine which ternary data is non-zero after an element-by-element operation. In step 603, a bitwise XNOR (XNOR) operation between v1 and v2 is performed. The output of step 603 is the result of an element-by-element operation between value bit vectors. If steps a) to c) are performed with the BNN_B data type, the inner result may be wrong. In the ternary representation, the value bit is 0 because it indicates 0 or less than 0. Consider a scenario where two strikeout data, d_a = 00 and d_b = 00 are used for the dot product. In this case, 1 is obtained from the bit-wise XNOR of the value bits of d_a and d_b. Considering this value, the final dot product is 1, not 0, resulting in an incorrect result. Therefore, an AND operation must be additionally performed using a mask so that only non-zero values can be considered. Accordingly, in step 604, a bitwise AND (AND) operation is performed between the outputs from steps 601 and 603. The output of step 604 holds only non-zero values after the bitwise operation between v1 and v2. In step 605, the number of 1s in the generated vector is counted as the output of step 604. Left shifting the output of step 605 is equivalent to multiplying by 2. If the length of the bit vector of the ternary data is subtracted from the output of step 606, the result is incorrect. This is because the output of step 604 may contain zero data. Thus, in order to maintain the dot product result generated only by non-zero data in step 607, the number of non-zero values (output of step 602) is subtracted from the output of step 606 instead of the length of the vector. The output of step 607 determines the dot product of the two ternary vectors. To perform the cumulative operation, the dot product of the two ternary vectors is added in step 608.

삼진 데이터에 대한 내적 연산에 대한 예의 정확성을 살펴본다. m3 AND v3 이후의 최종 벡터에는 2개의 "1"이 있다. m3에서 "1"의 수는 6이다. 따라서 (2 << 1) - 6의 결과는 -2이고, 이는 도 5a에 도시 된 바와 같이 엘리먼트 별 계산을 사용하여 수행된 내적 결과와 동일하다.Examine the accuracy of examples of dot product operations on ternary data. There are two "1s" in the final vector after m3 AND v3. The number of "1" in m3 is 6. Therefore, the result of (2 << 1)-6 is -2, which is the same as the dot product result performed using element-by-element calculation as shown in FIG. 5A.

도 7a는 일 실시 예에 따라, 이진 데이터, 삼진 데이터, 비 이진 데이터, 및 비 삼진 데이터에 대한 내적을 계산하기위한 전자 장치의 개요를 도시한다.7A is a schematic diagram of an electronic device for calculating a dot product for binary data, ternary data, non-binary data, and non ternary data, according to an embodiment.

전자 장치(1000)의 예로는 스마트 폰, 태블릿 컴퓨터, 개인용 컴퓨터, 데스크톱 컴퓨터, PDA(Personal Digital Assistance), 멀티미디어 장치, 사물 인터넷(IoT), 또는 이와 같은 장치이나, 이에 제한되지 않는다. 전자 장치(1000)는 극도로 낮은 레벨의 양자화된 신경망, 즉 BNN 및 TNN에서 추론을 수행하는 하드웨어 가속기이다. 일 실시 예에서, 전자 장치(1000)는 컨볼루션, 풀링(pooling), 풀리 커넥티드(fully connected) 등과 같은 레이어 연산을 수행하기위한 명령을 수신하기 위해 호스트 중앙 처리 장치(CPU)(2000)에 연결된다. 일 실시 예에서, 전자 장치(1000)는 데이터 벡터를 수신하고 저장하기 위해 DRAM(Dynamic Random-Access Memory)(3000)에 연결된다.Examples of the electronic device 1000 include a smart phone, a tablet computer, a personal computer, a desktop computer, a personal digital assistant (PDA), a multimedia device, an Internet of Things (IoT), or a device such as, but is not limited thereto. The electronic device 1000 is a hardware accelerator that performs inference in an extremely low level quantized neural network, that is, a BNN and a TNN. In one embodiment, the electronic device 1000 transmits a command to the host central processing unit (CPU) 2000 to receive a command for performing a layer operation such as convolution, pooling, and fully connected. Connected. In an embodiment, the electronic device 1000 is connected to a dynamic random-access memory (DRAM) 3000 to receive and store a data vector.

일 실시 예에서, 전자 장치(1000)는 컨트롤러(100), 디스패처(200), 정적 랜덤 액세스 메모리(SRAM)(300)(예: on-chip SRAM), 프로세싱 엔진(Processing Engine, PE) 어레이(400)를 포함한다. 컨트롤러(100)는 PE 어레이(400) 및 디스패처(200)를 제어한다. 디스패처(200)는 컨트롤러(100) 및 on-chip SRAM (300)과 결합된다. 컨트롤러(100)는 SRAM(300)에 주소를 전송하도록 구성된다. 디스패처(200)는 SRAM 데이터를 수신하고 SRAM 데이터를 PE 어레이(400)로 전달하도록 구성된다. 일 실시 예에서, 디스패처(200)는 PE 어레이(400)의 각 PE의 2개의 데이터 경로(즉, FPL(Full Precision Layer) 데이터 경로 및 융합된 비트 단위(fused bitwise)(BW) 데이터 경로) 중 하나로 데이터를 전달하도록 구성된다. 일 실시 예에서, FPL 데이터 경로는 풀 정밀도(full precision)(비 삼진 및 비 이진) IFM 데이터 벡터와 이진 또는 삼진 커널 데이터 벡터 중 하나 사이의 내적을 수행한다. 일 실시 예에서, FPL 데이터 경로 및 BW 데이터 경로 중 하나의 출력은 누적을 위한 풀 정밀도 레이어(full precision layer) 및 히든 레이어 중 하나의 유형에 기초하여 선택된다. 일 실시 예에서, 융합된 데이터 경로는 삼진 데이터의 경우 한 쌍의 벡터(즉, IFM 및 커널)와 이진 데이터의 경우 두 쌍의 벡터(즉, IFM 및 커널) 간의 내적을 수행한다.In one embodiment, the electronic device 1000 includes a controller 100, a dispatcher 200, a static random access memory (SRAM) 300 (eg, on-chip SRAM), a processing engine (PE) array ( 400). The controller 100 controls the PE array 400 and the dispatcher 200. The dispatcher 200 is coupled with the controller 100 and the on-chip SRAM 300. The controller 100 is configured to transmit an address to the SRAM 300. Dispatcher 200 is configured to receive SRAM data and deliver SRAM data to PE array 400. In one embodiment, the dispatcher 200 is one of two data paths of each PE of the PE array 400 (ie, a full precision layer (FPL) data path and a fused bitwise (BW) data path). It is configured to pass data into one. In one embodiment, the FPL data path performs a dot product between a full precision (non-ternary and non-binary) IFM data vector and one of a binary or ternary kernel data vector. In one embodiment, the output of one of the FPL data path and the BW data path is selected based on the type of one of a full precision layer and a hidden layer for accumulation. In one embodiment, the fused data path performs a dot product between a pair of vectors (i.e., IFM and kernel) for ternary data and two pairs of vectors (i.e., IFM and kernel) for binary data.

일 실시 예에서, PE 어레이(400)는 PE 어레이 컨트롤러(100), 출력 피처맵 (OFM) 결합기(402), 및 복수의 융합된 데이터 경로 엔진들(403)을 포함한다. 융합된 데이터 경로 엔진(403)은 풀 정밀도 모드(full precision mode) 또는 비트 단위 모드(bitwise mode)에서 BNN 모델 및 TNN 모델에 대한 융합된 데이터 경로를 제공하도록 구성된다. 일 실시 예에서, 내적은 BNN 모델 및 TNN 모델에서 MAC 연산에 대한 계산을 수행한다. 일 실시 예에서, 융합된 데이터 경로 엔진(403)은 BNN 모델에 의해 사용되는 이진 데이터 및 TNN 모델에 의해 사용되는 삼진 데이터 중 적어도 하나를 지원하도록 구성된다. 일 실시 예에서, 융합된 데이터 경로 엔진(403)은 FPL 데이터 경로 및 융합된 비트 단위 데이터 경로를 단일 융합된 데이터 경로로 결합하도록 구성된다. 일 실시 예에서, 융합된 데이터 경로 엔진(403)은 비트 단위 방식으로 삼진 데이터를 처리하도록 구성된다. 일 실시 예에서, 융합된 데이터 경로 엔진(403)은 BNN 모델과 TNN 모델에 대한 데이터 경로들을 단일 융합된 데이터 경로로 결합하도록 구성된다. 일 실시 예에서, BNN 모델은 한 쌍의 이진 데이터에 대한 내적을 계산하는 데 사용되고 TNN 모델은 삼진 데이터에 대한 내적을 계산하는 데 사용된다.In one embodiment, the PE array 400 includes a PE array controller 100, an output feature map (OFM) combiner 402, and a plurality of fused data path engines 403. The fused data path engine 403 is configured to provide a fused data path for the BNN model and the TNN model in full precision mode or bitwise mode. In one embodiment, the dot product is calculated for MAC operations in the BNN model and the TNN model. In one embodiment, the fused data path engine 403 is configured to support at least one of binary data used by the BNN model and ternary data used by the TNN model. In one embodiment, the fused data path engine 403 is configured to combine the FPL data path and the fused bitwise data path into a single fused data path. In one embodiment, the fused data path engine 403 is configured to process struck out data in a bit-wise manner. In one embodiment, the fused data path engine 403 is configured to combine the data paths for the BNN model and the TNN model into a single fused data path. In one embodiment, the BNN model is used to calculate the dot product for a pair of binary data and the TNN model is used to calculate the dot product for the ternary data.

다른 실시 예에서, PE 어레이(400)는 복수의 PE(즉, PE 0 내지 PE N-1), PE 어레이 컨트롤러(401), 및 OFM 결합기(402)를 포함하며, 여기서 각각의 PE는 융합된 데이터 경로 엔진(403), 가산기("+"로 표시됨), 누산기(ACC로 표시됨), 및 비교기(">=0"로 표시됨)를 포함한다. 융합된 데이터 경로 엔진(403)은 BNN 모델과 TNN 모델에 대한 개별 데이터 경로를 결합하여 융합된 데이터 경로를 제공한다. 일 실시 예에서, PE 어레이(400)는 디스패처(200)에 의해 제공된 데이터에 대해 MAC 연산을 수행한다. 각 PE는 신경망 추론의 핵심 부분인 MAC 연산을 수행한다. 이 MAC 연산은 두 개의 하위 연산, 즉 곱하기 또는 내적 계산 및 누산으로 구성된다. PE는 다양한 데이터 유형, 즉 BNN에서 사용되는 1비트 이진 데이터, TNN에서 사용되는 2비트 삼진 데이터에서 내적을 지원한다. BNN, TNN에서 첫 번째 레이어의 입력 데이터를 더 높은 정밀도(예: 8 비트)로 유지하고, 가중치(커널)는 낮은 정밀도(예: 1 비트 또는 2 비트)로 유지하는 경향이 있다. 따라서 이러한 경향을 반영하기 위해, PE는 높은 정밀도의 오퍼랜드와 낮은 정밀도(즉, 이진 또는 삼진)의 다른 오퍼랜드에 내적을 지원한다.In another embodiment, the PE array 400 includes a plurality of PEs (i.e., PE 0 to PE N-1), a PE array controller 401, and an OFM coupler 402, wherein each PE is a fused It includes a data path engine 403, an adder (indicated by "+"), an accumulator (indicated by ACC), and a comparator (indicated by ">=0"). The fused data path engine 403 provides a fused data path by combining individual data paths for the BNN model and the TNN model. In one embodiment, the PE array 400 performs a MAC operation on data provided by the dispatcher 200. Each PE performs a MAC operation, a key part of neural network inference. This MAC operation consists of two sub-operations: multiply or dot product calculation and accumulate. PE supports dot product in various data types, namely 1-bit binary data used in BNN and 2-bit ternary data used in TNN. In BNN and TNN, the input data of the first layer tends to be kept with higher precision (eg 8 bits), and the weights (kernel) with low precision (eg 1 bit or 2 bits). Therefore, to reflect this trend, PE supports dot products for operands with high precision and other operands with low precision (ie, binary or ternary).

일 실시 예에서, PE 어레이(400)는 입력 데이터를 공유하고 전자 장치(1000)의 처리량을 증가시키기 위해 단일 명령어 다중 데이터(Single Instruction Multiple Data, SIMD) 기반 PE들의 2차원 어레이로 설계된다. PE 어레이(400)는 다중 내적 계산들을 병렬로 수행할 수 있다. 신경망의 맥락에서 출력(예: OFM)을 생성하기 위해 두 개의 입력, 즉 IFM과 커널이 사용된다. IFM은 PE 어레이(400)의 여러 열들에서 공유되는 반면 커널들은 PE 어레이(400)의 여러 행들에서 공유된다. 모든 입력 데이터는 PE 어레이(400)의 스테이징 레지스터들(staging registers)에 저장된다. 디스패처(200)는 추가적인 데이터 저장 오버 헤드없이, 모드(즉, 데이터 경로 선택, 융합된 비트 단위 또는 FPL) 선택에 따라 스테이징 레지스터들에서 PE 어레이(400)의 각 PE로 입력 데이터를 전달한다.In one embodiment, the PE array 400 is designed as a two-dimensional array of single instruction multiple data (SIMD) based PEs to share input data and increase throughput of the electronic device 1000. The PE array 400 can perform multiple dot product calculations in parallel. In the context of a neural network, two inputs are used, the IFM and the kernel to generate the output (e.g. OFM). IFM is shared across multiple columns of PE array 400 while kernels are shared across multiple rows of PE array 400. All input data is stored in the staging registers of the PE array 400. The dispatcher 200 transfers input data from staging registers to each PE of the PE array 400 according to a mode (ie, data path selection, fused bit unit or FPL) selection without additional data storage overhead.

일 실시 예에서, 호스트 CPU(2000)는 컨트롤러(100)에 필요한 명령을 제공하고, 여기서 컨트롤러(100)는 호스트 CPU(2000)와 전자 장치(1000) 간의 통신을 확립한다. 일 실시 예에서, 컨트롤러(100)는 OFM 텐서를 생성하기 위해 루프 순회(loop traversal)를 처리한다. 루프 순회 동안, 컨트롤러(100)는 2개의 컨트롤러들(즉, IFM 컨트롤러 및 커널 컨트롤러)을 사용하여 SRAM 리퀘스트를 생성하고 데이터 벡터의 주소를 SRAM(300)으로 전송한다. 또한, 컨트롤러(100)는 SRAM(300)으로부터 복수의 PE로 데이터 벡터를 전달하기 위해 디스패처(200)에 명령을 전송한다. 디스패처(200)는 각 PE에 의한 MAC 연산을 수행하기 위해 SRAM(300)으로부터 수신된 데이터 벡터를 데이터 경로 선택(즉, FPL 또는 융합된 비트 단위)에 기초하여 PE 어레이(400)로 전달한다. 또한, PE의 융합된 데이터 경로 엔진(403)은 디스패처(200)로부터 데이터 벡터를 수신한다. 컨트롤러(100)는 복수의 PE의 모드 신호를 전송하기 위해 PE 어레이 컨트롤러(401)에 명령을 제공한다. PE 어레이 컨트롤러(401)는 컨트롤러(100)로부터 명령을 수신하는 것에 응답하여 연산 모드(즉, 모드 선택)를 선택하기 위해 모드 신호를 복수의 PE들로 전송한다. 일 실시 예에서, 연산 모드는 이진 연산 모드 또는 삼진 연산 모드 중 하나이다. PE의 융합된 데이터 경로 엔진(403)은 내적 계산을 위한 연산 모드를 기반으로 데이터 벡터에 대한 MAC 연산을 수행한다.In an embodiment, the host CPU 2000 provides a command required to the controller 100, where the controller 100 establishes communication between the host CPU 2000 and the electronic device 1000. In one embodiment, the controller 100 processes a loop traversal to generate an OFM tensor. During the loop traversal, the controller 100 generates an SRAM request using two controllers (i.e., an IFM controller and a kernel controller) and transmits the address of the data vector to the SRAM 300. In addition, the controller 100 transmits a command to the dispatcher 200 to transfer data vectors from the SRAM 300 to a plurality of PEs. The dispatcher 200 transfers the data vector received from the SRAM 300 to the PE array 400 based on a data path selection (ie, FPL or fused bit unit) in order to perform a MAC operation by each PE. In addition, the PE's fused data path engine 403 receives a data vector from the dispatcher 200. The controller 100 provides a command to the PE array controller 401 to transmit mode signals of a plurality of PEs. The PE array controller 401 transmits a mode signal to the plurality of PEs to select an operation mode (ie, mode selection) in response to receiving a command from the controller 100. In one embodiment, the operation mode is one of a binary operation mode or a ternary operation mode. The PE's fused data path engine 403 performs a MAC operation on a data vector based on an operation mode for dot product calculation.

일 실시 예에서, 융합된 데이터 경로 엔진(403)은 이진 데이터 또는 삼진 데이터에 대한 내적을 계산하기 위해 삼진 데이터 및 이진 데이터 중 하나를 수신한다. 또한, 융합된 데이터 경로 엔진(403)은 삼진 데이터 및 이진 데이터에 대한 연산 모드를 결정한다. 또한, 융합된 데이터 경로 엔진(403)은 결정된 연산 모드에 기초하여 적어도 하나의 popcount 로직을 사용하여 XNOR 게이트 및 AND 게이트에서 삼진 데이터 및 이진 데이터 중 적어도 하나를 처리한다. 또한, 융합된 데이터 경로 엔진(403)은 처리된 삼진 데이터 및 처리된 이진 데이터 중 적어도 하나를 누산기로 수신한다. MAC 연산의 최종 결과는 누산기에 저장된다. 누산기의 값은 임계 값과 추가로 비교되고 출력 값을 최종 삼진 값 또는 최종 이진 값으로 변환한다. PE는 OFM 결합기(402)에 00, 10, 또는 11 형식의 데이터 출력을 제공한다. 또한, OFM 결합기(402)는 on-chip SRAM(300)에 데이터를 기록한다. PE 어레이 컨트롤러(401)는 BNN 또는 TNN의 레이어에서 요구되는 MAC 연산을 완료한 것에 응답하여 done_pe 신호를 컨트롤러(100)로 전송한다. 또한 컨트롤러(100)는 모든 레이어의 연산이 완료되면 호스트 CPU(2000)에 완료 신호를 전송한다.In one embodiment, the fused data path engine 403 receives one of the ternary data and the binary data to calculate the dot product for the binary data or ternary data. In addition, the fused data path engine 403 determines the operation mode for the ternary data and the binary data. Further, the fused data path engine 403 processes at least one of ternary data and binary data in the XNOR gate and the AND gate using at least one popcount logic based on the determined operation mode. Further, the fused data path engine 403 receives at least one of the processed ternary data and the processed binary data as an accumulator. The final result of the MAC operation is stored in the accumulator. The value of the accumulator is further compared with the threshold value and converts the output value to the final ternary value or the final binary value. The PE provides data output in the form of 00, 10, or 11 to the OFM combiner 402. In addition, the OFM combiner 402 writes data to the on-chip SRAM 300. The PE array controller 401 transmits the done_pe signal to the controller 100 in response to completing the MAC operation required by the layer of the BNN or TNN. In addition, the controller 100 transmits a completion signal to the host CPU 2000 when the operation of all layers is completed.

비록 도 7a는 전자 장치(1000)의 하드웨어 컴포넌트들을 도시하지만, 다른 실시 예가 이에 제한되지 않음이 이해되어야 한다. 다른 실시 예에서, 전자 장치(1000)는 더 적은 또는 더 많은 수의 컴포넌트들를 포함 할 수 있다. 또한, 컴포넌트들의 라벨 또는 명칭은 예시적인 목적으로만 사용되며 본 발명의 범위를 제한하지 않는다. 내적을 계산하기 위해 동일하거나 실질적으로 유사한 기능을 수행 하도록 하나 이상의 컴포넌트들이 결합될 수 있다.Although FIG. 7A illustrates hardware components of the electronic device 1000, it should be understood that other embodiments are not limited thereto. In another embodiment, the electronic device 1000 may include fewer or more components. Also, the labels or names of components are used for illustrative purposes only and do not limit the scope of the present invention. One or more components may be combined to perform the same or substantially similar function to calculate the dot product.

도 7b는 일 실시 예에 따른 전자 장치의 PE 어레이의 개요를 도시한다.7B is a schematic diagram of a PE array of an electronic device according to an exemplary embodiment.

M X N 차원을 갖는 PE 어레이(400)가 도 7b에 도시되어 있다. 여기서, M 개의 PE가 PE 어레이(400)의 각 열에 존재하고 N개의 PE가 PE 어레이(400)의 각 행에 존재한다. 일 예에서 PE(0, 5)는 PE 어레이(400)의 0번째 행 및 5번째 열의 PE를 나타낸다. 또 다른 예에서 PE(3, 2)는 PE 어레이(400)의 3번째 행 및 2번째 열의 PE를 나타낸다. PE 어레이(400)의 왼쪽 가장자리에는 M개의 레지스터가 서로 다른 IFM 데이터를 저장하는 데 사용된다. PE 어레이(400)의 상단 가장자리에는 N개의 레지스터가 서로 다른 커널 데이터를 저장하는 데 사용된다. PE에 대한 데이터 공유를 최대화하기 위해 IFM 벡터는 PE 어레이(400)의 한 행에 있는 PE간에 공유되고, 커널 벡터는 PE 어레이(400)의 한 열에 있는 PE간에 공유된다. 각 PE는 OFM 픽셀을 생성한다. 일 실시 예에서, PE 어레이(400)는 디스패처(200)에 의해 제공된 데이터에 대해 복수의 MAC 연산들을 병렬로 수행한다.A PE array 400 having dimensions M X N is shown in FIG. 7B. Here, M PEs exist in each column of the PE array 400 and N PEs exist in each row of the PE array 400. In one example, PE(0, 5) represents PEs in the 0th row and 5th column of the PE array 400. In another example, PE(3, 2) represents the PE in the 3rd row and 2nd column of the PE array 400. On the left edge of the PE array 400, M registers are used to store different IFM data. On the upper edge of the PE array 400, N registers are used to store different kernel data. To maximize data sharing for PEs, IFM vectors are shared among PEs in a row of PE array 400, and kernel vectors are shared among PEs in a row of PE array 400. Each PE produces an OFM pixel. In one embodiment, the PE array 400 performs a plurality of MAC operations in parallel on data provided by the dispatcher 200.

도 8은 일 실시 예에 따른, 이진 데이터, 삼진 데이터, 비 이진 데이터, 및 비 삼진 데이터에 대한 내적을 계산하기위한 전자 장치의 블록도이다.8 is a block diagram of an electronic device for calculating a dot product for binary data, struck out data, non-binary data, and non-struck out data, according to an exemplary embodiment.

일 실시 예에서, 전자 장치(1000)는 컨트롤러(100), 디스패처(200), IFM 메모리 공간(301), 커널 메모리 공간(302), OFM 메모리 공간(303), PE 어레이(400), 풀링 컨트롤러(Pooling Controller, PC)(500), 및 변환 엔진(Transform Engine)(TE)(600)을 포함한다. 일 실시 예에서, on-chip SRAM(300)은 IFM 메모리 공간(301), 커널 메모리 공간(302), 및 OFM 메모리 공간(303)을 포함한다. PE 어레이(400)는 복수의 PE들 및 디스패처(200)를 포함하며, 여기서 각각의 PE는 FPL 데이터 경로 및 융합된 비트 단위 데이터 경로를 포함한다. 호스트 CPU(2000)는 전자 장치(1000)의 컨트롤러(100)에 연결된다. 또한, DRAM(3000)은 전자 장치(1000)의 IFM 메모리 공간(301), 커널 메모리 공간(302), 및 OFM 메모리 공간(303)에 연결된다.In an embodiment, the electronic device 1000 includes a controller 100, a dispatcher 200, an IFM memory space 301, a kernel memory space 302, an OFM memory space 303, a PE array 400, and a pooling controller. (Pooling Controller, PC) 500, and a Transform Engine (TE) 600. In one embodiment, the on-chip SRAM 300 includes an IFM memory space 301, a kernel memory space 302, and an OFM memory space 303. The PE array 400 includes a plurality of PEs and a dispatcher 200, where each PE includes an FPL data path and a fused bitwise data path. The host CPU 2000 is connected to the controller 100 of the electronic device 1000. Further, the DRAM 3000 is connected to the IFM memory space 301, the kernel memory space 302, and the OFM memory space 303 of the electronic device 1000.

호스트 CPU(2000)는 레이어 연산(예를 들어, 컨볼루션, 풀링, 풀리 커넥티드 등)을 수행하기 위해 필요한 명령을 컨트롤러(100)에 제공한다. 또한, 컨트롤러(100)는 다른 컴포넌트들 간의 통신을 확립한다. 루프 순회 동안, 컨트롤러(100)는 2개의 컨트롤러(즉, IFM 컨트롤러 및 커널 컨트롤러)를 사용하여 SRAM 리퀘스트를 생성한다. 또한, 컨트롤러(100)는 버퍼(Buff)를 사용하여 커널 메모리 공간(302)의 커널 메모리 및 IFM 메모리 공간(301)의 IFM 메모리로부터 수신된 데이터를 디스패처(200)로 전달한다. 또한, 디스패처(200)는 IFM 데이터 및 커널 데이터를 복수의 PE들로 분배한다. 일 실시 예에서, 디스패처(200)는 IFM 및 커널 데이터를 각각 저장하기 위해 PE 어레이(400)의 좌측 가장자리 및 상단 가장자리에 있는 다수의 스테이징 레지스터들에 데이터를 분배한다. 디스패처(200)는 특정 행의 서로 다른 열에서 PE들 간에 공유되는 IFM 데이터를 분배한다. 디스패처(200)는 특정 열의 서로 다른 행에서 PE들 간에 공유되는 커널 데이터를 분배한다. 각 PE는 디스패처(200)로부터 수신한 데이터에 대해 MAC 연산을 수행한다. 또한, PE 어레이(400)는 출력(즉, ACC 벡터)을 TE(600)로 전송하여 배치 정규화(batch normalization), 임계 값과의 비교 등과 같은 작업을 수행한다. 또한, TE(600)는 출력을 OFM 메모리 공간(303)의 OFM 메모리 1에 다시 기록한다. PC(500)는 OFM 메모리 1로부터 데이터(즉, OFM vector)를 읽고 풀링 연산을 수행한다. 또한, PC(500)는 OFM 메모리 공간(303)의 OFM 메모리 2에 최종 데이터(즉, OFM vector after pooling)를 기록한다. 또한, OFM 메모리 1 및 OFM 메모리 2의 데이터는 DRAM(3000)에 다시 기록된다. 또한, 컨트롤러(100)는 모든 레이어의 연산들이 완료되면 완료 신호를 호스트 CPU(2000)로 전송한다.The host CPU 2000 provides the controller 100 with instructions necessary to perform a layer operation (eg, convolution, pooling, fully connected, etc.). In addition, the controller 100 establishes communication between different components. During loop traversal, the controller 100 generates an SRAM request using two controllers (i.e., an IFM controller and a kernel controller). In addition, the controller 100 transfers data received from the kernel memory of the kernel memory space 302 and the IFM memory of the IFM memory space 301 to the dispatcher 200 using a buffer (Buff). Also, the dispatcher 200 distributes IFM data and kernel data to a plurality of PEs. In one embodiment, the dispatcher 200 distributes data to a plurality of staging registers at the left and upper edges of the PE array 400 to store IFM and kernel data, respectively. The dispatcher 200 distributes IFM data shared among PEs in different columns of a specific row. The dispatcher 200 distributes kernel data shared between PEs in different rows in a specific column. Each PE performs a MAC operation on data received from the dispatcher 200. In addition, the PE array 400 transmits an output (ie, an ACC vector) to the TE 600 to perform tasks such as batch normalization and comparison with a threshold value. Further, the TE 600 writes the output back to the OFM memory 1 of the OFM memory space 303. The PC 500 reads data (ie, OFM vector) from OFM memory 1 and performs a pooling operation. In addition, the PC 500 records the final data (ie, OFM vector after pooling) in the OFM memory 2 of the OFM memory space 303. In addition, data of the OFM memory 1 and OFM memory 2 are written back to the DRAM 3000. In addition, the controller 100 transmits a completion signal to the host CPU 2000 when operations of all layers are completed.

도 9a는 일 실시 예에 따른, MAC 연산을 수행하기위한 융합된 데이터 경로 엔진의 블록도이다. 9A is a block diagram of a fused data path engine for performing MAC operations, according to an embodiment.

일 실시 예에서, 융합된 데이터 경로 엔진은 3개의 AND 게이트(AND1-AND3), 2개의 배타적 NOR(XNOR) 게이트(XNOR1-XNOR2), 6개의 멀티플렉서(MUX1-MUX6), 2개의 감산기(SUB1-SUB2), 2개의 팝 카운터(POPCOUNT1-POPCOUNT2), 2개의 가산기(ADD1-ADD2), 2개의 누산기(ACC-ACC_D), 및 비교기(COMPARATOR)를 포함한다.In one embodiment, the fused data path engine includes three AND gates (AND1-AND3), two exclusive NOR (XNOR) gates (XNOR1-XNOR2), six multiplexers (MUX1-MUX6), and two subtractors (SUB1- SUB2), two pop counters (POPCOUNT1-POPCOUNT2), two adders (ADD1-ADD2), two accumulators (ACC-ACC_D), and a comparator (COMPARATOR).

제1 XNOR 게이트(XNOR1)는 제1 XNOR 게이트(XNOR1)의 출력으로서 길이 N비트의 곱 벡터를 생성하기 위하여, 제1 벡터(BIT VECTOR 1)의 1비트와 제2 벡터(BIT VECTOR 2)의 1비트를 수신한다. 제1 AND 게이트(AND1)는 제1 AND 게이트(AND1)의 출력으로서 길이 N비트의 곱 벡터를 생성하기 위하여, 제1 벡터의 1비트와 제2 벡터의 1비트를 수신한다. 제2 XNOR 게이트(XNOR2)는 제3 벡터(BIT VECTOR 3)의 1비트와 제4 번째 벡터(BIT VECTOR 4)의 1비트를 수신하여 제2 XNOR 게이트(XNOR2)의 출력으로 길이가 N비트인 곱 벡터를 생성한다. 제2 AND 게이트(AND2)는 제3 벡터의 1비트와 제4 벡터의 1 비트를 수신하여 N비트 길이의 곱 벡터를 제2 AND 게이트(AND2)의 출력으로 생성한다.The first XNOR gate (XNOR1) is an output of the first XNOR gate (XNOR1), in order to generate a product vector of length N bits, 1 bit of the first vector (BIT VECTOR 1) and the second vector (BIT VECTOR 2). Receive 1 bit. The first AND gate AND1 receives 1 bit of the first vector and 1 bit of the second vector in order to generate a product vector of length N bits as an output of the first AND gate AND1. The second XNOR gate (XNOR2) receives 1 bit of the third vector (BIT VECTOR 3) and 1 bit of the fourth vector (BIT VECTOR 4), and is output of the second XNOR gate (XNOR2) with a length of N bits. Create a product vector. The second AND gate AND2 receives 1 bit of the third vector and 1 bit of the fourth vector, and generates a product vector having an N-bit length as an output of the second AND gate AND2.

제1 XNOR 게이트(XNOR1)의 출력 및 제1 AND 게이트(AND1)의 출력은 제1 멀티플렉서(MUX1)의 입력으로 제공된다. 제1 AND 게이트(AND1)의 출력과 제2 XNOR 게이트(XNOR2)의 출력은 제3 AND 게이트(AND3)의 입력으로 제공된다. 제2 멀티플렉서(MUX2)는 제2 XNOR 게이트(XNOR2)의 출력, 제2 AND 게이트(AND2)의 출력, 제3 AND 게이트(AND3)의 출력으로부터 입력을 수신한다. 제2 멀티플렉서(MUX2)의 출력에는 두 개의 삼진 벡터에 대한 내적 연산의 0이 아닌 엘리먼트만 포함되거나 두 개의 이진 벡터(BIT VECTOR3 및 BIT VECTOR4)에 대한 내적 연산의 모든 엘리먼트가 포함된다. 제1 멀티플렉서(MUX1)의 출력은 제1 팝 카운터(POPCOUNT1)의 입력으로 공급되며, 여기서 제1 팝 카운터(POPCOUNT1)는 제1 XNOR 게이트(XNOR1) 또는 제1 AND 게이트(AND1)의 출력에서 1의 수를 계산하며, 이는 제3 멀티플렉서(MUX3) 및 제4 멀티플렉서(MUX4)로 전달된다. 제3 멀티플렉서(MUX3)는 제1 비트 길이(BIT LENGTH 1) 및 1 팝 카운터(POPCOUNT1)를 통해 제1 멀티플렉서(MUX1)로부터의 입력을 수신한다. 제2 팝 카운터(POPCOUNT2)는 제2 멀티플렉서(MUX 2)의 출력에서 1의 수를 계산하며, 이는 제4 멀티플렉서(MUX4)로 전달된다. 여기서, 제2 팝 카운터(POPCOUNT2)의 출력은 왼쪽으로 1비트 시프트된다.The output of the first XNOR gate XNOR1 and the output of the first AND gate AND1 are provided as inputs of the first multiplexer MUX1. The output of the first AND gate AND1 and the output of the second XNOR gate XNOR2 are provided as inputs of the third AND gate AND3. The second multiplexer MUX2 receives an input from the output of the second XNOR gate XNOR2, the output of the second AND gate AND2, and the output of the third AND gate AND3. The output of the second multiplexer MUX2 includes only non-zero elements of the dot product operation on two ternary vectors, or all elements of the dot product operation on two binary vectors (BIT VECTOR3 and BIT VECTOR4). The output of the first multiplexer MUX1 is supplied to the input of the first pop counter POPCOUNT1, where the first pop counter 1 is 1 at the output of the first XNOR gate XNOR1 or the first AND gate AND1. The number of is calculated, which is transferred to the third multiplexer MUX3 and the fourth multiplexer MUX4. The third multiplexer MUX3 receives an input from the first multiplexer MUX1 through a first bit length BIT LENGTH 1 and a pop counter POPCOUNT1. The second pop counter POPCOUNT2 calculates the number of 1s at the output of the second multiplexer MUX 2, which is transferred to the fourth multiplexer MUX4. Here, the output of the second pop counter POPCOUNT2 is shifted by 1 bit to the left.

제4 멀티플렉서(MUX4)는 제2 비트 길이(BIT LENGTH 2)와 제2 팝 카운터(POPCOUNT2)를 통해 제2 멀티플렉서로부터의 입력을 수신한다. 또한, 제1 감산기(SUB1)에서는 제1 팝 카운터(POPCOUNT1)의 왼쪽으로 시프트된 출력과 제3 멀티플렉서(MUX3)의 출력이 감산된다. 또한, 제2 팝 카운터 (POPCOUNT2)의 왼쪽으로 시프트된 출력과 제4 멀티플렉서(MUX4)의 출력은 제2 감산기(SUB2)에서 감산되며, 여기서 제2 감산기(SUB2)의 출력은 삼진 데이터인 경우 2개의 삼진 벡터 간의 내적을 나타내고, 이진 데이터인 경우 2개의 이진 벡터(BIT VECTOR3 및 BIT VECTOR4)간의 내적을 나타낸다. 제1 감산기(SUB1)의 출력은 이진 데이터의 경우 BIT VECTOR1과 BIT VECTOR2의 내적을 나타내며, 삼진 데이터의 경우 제1 감산기(SUB1)의 출력은 사용되지 않는다. 제2 감산기(SUB2)의 출력과 제1 감산기(SUB1)의 출력은 제1 가산기(ADD1)에서 더해지며, 여기서 이진 데이터인 경우에 제1 가산기(ADD1)의 출력은 두 쌍의 벡터의 결합된 내적 결과를 나타낸다(즉, 쌍 1은 {BIT VECTOR1, BIT VECTOR2}이고 쌍 2는 {BIT VECTOR2, BIT VECTOR3}이다). 제5 멀티플렉서(MUX5)는 삼진 데이터의 경우 제2 감산기(SUB2)의 출력을 출력으로 선택하고, 이진 데이터의 경우 제1 가산기(ADD1)의 출력을 출력으로 선택한다. 누적 연산을 수행하기 위해 제2 가산기(ADD2)에서 제5 멀티플렉서(MUX5)의 출력은 제1 누산기(ACC)와 함께 더해진다. 제2 가산기의 출력은 제1 누산기(ACC)에 저장된다. 또한 제2 가산기의 출력을 비교기(COMPARATOR)의 임계 값과 비교하여 출력 값을 생성한다. 일 실시 예에서, 제2 누산기(ACC_D)는 제2 가산기(ADD2)의 출력을 저장하고 제2 가산기(ADD2)의 출력 값을 비교기(COMPARATOR)의 임계 값과 비교한다. 일 실시 예에서, 제6 멀티플렉서(MUX6)는 제2 가산기 및 제2 누산기(ACC_D)로부터 입력을 수신하고, 비교기(COMPARATOR)는 제2 누산기(ACC_D)로부터 입력을 수신한다.The fourth multiplexer MUX4 receives an input from the second multiplexer through the second bit length BIT LENGTH 2 and the second pop counter POPCOUNT2. In addition, in the first subtractor SUB1, the output shifted to the left of the first pop counter POPCOUNT1 and the output of the third multiplexer MUX3 are subtracted. In addition, the output shifted to the left of the second pop counter (POPCOUNT2) and the output of the fourth multiplexer (MUX4) are subtracted by the second subtractor (SUB2), where the output of the second subtractor (SUB2) is 2 in the case of strikeout data. It represents the dot product between two ternary vectors, and in the case of binary data, it represents the dot product between two binary vectors (BIT VECTOR3 and BIT VECTOR4). The output of the first subtractor SUB1 represents the dot product of BIT VECTOR1 and BIT VECTOR2 in the case of binary data, and the output of the first subtractor SUB1 is not used in the case of ternary data. The output of the second subtractor SUB2 and the output of the first subtractor SUB1 are added by the first adder ADD1. Here, in the case of binary data, the output of the first adder ADD1 is a combination of two pairs of vectors. Represents the dot product result (i.e. pair 1 is {BIT VECTOR1, BIT VECTOR2} and pair 2 is {BIT VECTOR2, BIT VECTOR3}). The fifth multiplexer MUX5 selects the output of the second subtractor SUB2 as an output in the case of ternary data, and selects the output of the first adder ADD1 as the output in the case of binary data. In order to perform the accumulation operation, the output of the fifth multiplexer MUX5 from the second adder ADD2 is added together with the first accumulator ACC. The output of the second adder is stored in the first accumulator (ACC). In addition, an output value is generated by comparing the output of the second adder with a threshold value of a comparator. In an embodiment, the second accumulator ACC_D stores the output of the second adder ADD2 and compares the output value of the second adder ADD2 with a threshold value of the comparator COMPARATOR. In one embodiment, the sixth multiplexer MUX6 receives inputs from the second adder and the second accumulator ACC_D, and the comparator COMPARATOR receives the input from the second accumulator ACC_D.

삼진 데이터에 대한 MAC 연산에 필요한 하드웨어 자원은 이진 데이터에도 사용될 수 있다. 이진 데이터는 1비트(삼진 데이터의 2비트 대신)로 표현되기 때문에, 이진 데이터에 대해 내적 연산을 수행하기 위해 TNN에서 마스크 및 값의 쌍 벡터들이 2개의 독립적인 벡터 쌍으로 사용된다. 이진 데이터의 경우 두 개의 독립적인 벡터를 처리하면 각각 하나의 출력이 생성되므로 BNN의 처리량이 TNN의 두 배가 된다. 또한 PE는 MAC 연산을 위해 두 가지 데이터 유형 BNN_A({0,1}) 및 BNN_B({-1, +1})를 지원한다. 적절한 제어 신호를 사용하면, PE의 융합된 데이터 경로가 이진 및 삼진 데이터에 대한 MAC 연산을 지원할 수 있다.The hardware resources required for MAC operation on ternary data can also be used for binary data. Since binary data is represented by 1 bit (instead of 2 bits of ternary data), mask and value pair vectors are used as two independent vector pairs in the TNN to perform dot product operations on binary data. In the case of binary data, processing of two independent vectors produces one output each, so the throughput of the BNN is twice that of the TNN. In addition, the PE supports two data types BNN_A({0,1}) and BNN_B({-1, +1}) for MAC operation. With appropriate control signals, the fused data path of the PE can support MAC operations on binary and ternary data.

높은 정밀도로 두 입력 중 하나에 대한 내적 연산을 지원하기 위해 추가 데이터 경로, 즉 FPL 데이터 경로가 PE에 포함된다. FPL 데이터 경로의 경우 입력 데이터 벡터 중 하나는 높은 정밀도의 데이터로 구성되고 다른 입력 벡터는 극도로 낮은(즉, 이진 또는 삼진) 정밀도의 데이터로 구성된다. 따라서 PE에는 두 개의 데이터 경로, 즉 비트 단위 데이터 경로와 FPL 데이터 경로가 있으며, 내적을 위한 별도의 출력을 갖는다. 데이터 경로 선택에 따라 두 데이터 경로의 출력(내적 결과) 중 하나가 누산기에 추가되어 MAC 연산이 완료된다.An additional data path, the FPL data path, is included in the PE to support dot product operations on either input with high precision. In the case of the FPL data path, one of the input data vectors consists of high-precision data and the other input vector consists of extremely low (ie, binary or ternary) precision data. Therefore, there are two data paths in the PE, namely, a bit unit data path and an FPL data path, and have separate outputs for the dot product. Depending on the data path selection, one of the outputs (internal results) of the two data paths is added to the accumulator to complete the MAC operation.

융합된 비트 단위 데이터 경로는 PE 내부의 두 데이터 경로 중 하나이다. 삼진 데이터의 내적 계산에 사용되는 하드웨어 컴포넌트들은 이진 데이터에도 사용될 수 있다. 예를 들어, POPCOUNT 로직(BNN_A 및 BNN_B 모두), 비트 단위 XNOR 로직(BNN_B 용), 비트 단위 AND 로직(BNN_A 용) 등이 있다. 도 6에 도시된 2개의 POPCOUNT 로직의 출력들(단계 602, 605)은 개별적으로 사용될 수 있고, 이진 벡터의 두 개의 독립적인 쌍은 내적 연산을 위해 처리 될 수 있다. 예를 들어, 도 6에서 삼진 데이터의 경우 m1, m2, v1, v2 벡터를 사용하여 내적을 계산한다. 제안 된 방법에서는 이진 데이터의 경우 {m1, m2}, {v1, v2} 벡터 쌍의 내적을 병렬로 계산한다. 이 두 가지 내적 결과가 더해질 수 있으며 덧셈 결과는 누산기에서 추가적으로 더해질 수 있다. 이러한 방식으로 삼진 데이터에 비해 이진 데이터에 대한 내적 계산의 처리량이 두 배가 된다.The fused bitwise data path is one of the two data paths inside the PE. The hardware components used to calculate the dot product of ternary data can also be used for binary data. For example, there are POPCOUNT logic (both BNN_A and BNN_B), bitwise XNOR logic (for BNN_B), and bitwise AND logic (for BNN_A). The outputs of the two POPCOUNT logic shown in FIG. 6 (steps 602 and 605) can be used individually, and the two independent pairs of binary vectors can be processed for dot product operation. For example, in the case of ternary data in FIG. 6, the dot product is calculated using m1, m2, v1, and v2 vectors. In the proposed method, for binary data, the dot product of a pair of {m1, m2} and {v1, v2} vectors is computed in parallel. These two inner results can be added, and the addition result can be added additionally in the accumulator. In this way, the throughput of dot product calculations on binary data is doubled compared to ternary data.

융합된 비트 단위 데이터 경로의 마이크로 아키텍처가 도 9a에 도시되어 있다. 도 9a는 벡터 길이가 16인 데이터 경로를 나타낸다. 그러나 제안된 방법은 임의의 벡터 길이로 확장될 수 있다. 데이터 경로에서 입력 데이터가 아래와 같이 주어지며, 여기서 상세한 제어 신호들은 도 11a에 도시되어 있다.The microarchitecture of the fused bitwise data path is shown in Fig. 9A. 9A shows a data path having a vector length of 16. However, the proposed method can be extended to any vector length. Input data in the data path is given as follows, where detailed control signals are shown in Fig. 11A.

a) BIT VETCOR1a) BIT VETCOR1

b) BIT VECTOR2b) BIT VECTOR2

c) BIT VECTOR3c) BIT VECTOR3

d) BIT VECTOR4d) BIT VECTOR4

e) BIT LENGTH1e) BIT LENGTH1

f) BIT LENGTH2f) BIT LENGTH2

도 9a에 도시 된 바와 같이. 4개의 입력 a)-d)는 16비트의 입력 벡터이고, e)-f)는 각각 입력 쌍 {a, b} 및 {c, d}에 대한 비트 벡터 길이(즉, 16비트 벡터를 고려해야하는 양)를 나타내는 두 개의 추가 데이터이다. 도 9a 및 후속 도면(즉, 도 9b, 도 9c)의 컴포넌트들 MUX, ADD, SUB, AND, XNOR, POPCOUNT는 각각 멀티플렉서, 가산기, 감산기, 비트 단위 AND, 비트 단위 XNOR, 및 POPCOUNT 로직이다. 이러한 로직 컴포넌트들에 접미사로 추가 된 숫자는 인스턴스 번호를 나타낸다(예: AND3은 비트 단위 AND 로직의 세 번째 인스턴스를 나타냄). ACC와 ACC_D는 두 개의 누산기이다. ACC는 매주기마다 업데이트되고 ACC_D는 출력 데이터(즉, OFM)에 대한 모든 MAC 연산이 완료될 때 업데이트된다. 도 9b 및 9c는 각각 삼진 및 이진 데이터에서 대해서 연산할 때 융합된 데이터 경로의 활성 컴포넌트들(진한 회색으로 강조 표시됨)을 나타낸다.As shown in Figure 9a. The four inputs a)-d) are 16-bit input vectors, and e)-f) are the bit vector lengths for the input pairs {a, b} and {c, d} respectively (i.e., a 16-bit vector should be considered. There are two additional data representing the amount). Components MUX, ADD, SUB, AND, XNOR, and POPCOUNT of FIG. 9A and subsequent drawings (ie, FIGS. 9B and 9C) are a multiplexer, an adder, a subtractor, a bitwise AND, a bitwise XNOR, and a POPCOUNT logic, respectively. The number added as a suffix to these logic components represents the instance number (eg AND3 represents the third instance of the bitwise AND logic). ACC and ACC_D are two accumulators. ACC is updated every week and ACC_D is updated when all MAC operations on the output data (i.e. OFM) are complete. 9B and 9C show the active components of the fused data path (highlighted in dark gray) when operating on ternary and binary data, respectively.

도 9b는 일 실시 예에 따라, 삼진 모드에서 삼진 데이터에 대한 MAC 연산을 수행하기 위해 융합 데이터 경로 엔진에 의해 사용되는 TNN 모델을 나타낸다. 9B is a diagram illustrating a TNN model used by a fusion data path engine to perform a MAC operation on triplet data in a triplet mode, according to an embodiment.

제1 AND 게이트(AND1)는 각각 길이가 N인 두 개의 삼진 벡터 각각의 마스크 벡터(BIT VECTOR1, BIT VECTOR2)를 수신한다. 또한, 제2 XNOR 게이트(XNOR2)는 각각 길이가 N인 두 개의 삼진 벡터 각각의 값 벡터(BIT VECTOR3, BIT VECTOR4)를 수신한다. 또한, 제1 AND 게이트(AND1)의 출력은 제1 멀티플렉서(MUX1)의 입력으로 제공된다. 또한, 제1 AND 게이트(AND1)의 출력과 제2 XNOR 게이트(XNOR2)의 출력은 제3 AND 게이트(AND3)의 입력으로 제공된다. 또한, 제2 멀티플렉서(MUX2)는 제2 XNOR 게이트(XNOR2)의 출력과 제3 AND 게이트(AND3)의 출력으로부터 입력을 수신한다. 제2 멀티플렉서(MUX2)의 출력은 제3 AND 게이트(AND3)의 출력을 선택한다. 여기서 제2 멀티플렉서(MUX2)의 출력은 내적에서 0이 아닌 엘리먼트를 포함한다. 또한, 제4 멀티플렉서(MUX4)는 제2 비트 길이(BIT LENGTH 2)와 제2 팝 카운터(POPCOUNT2)를 통한 제2 멀티플렉서(MUX2)의 입력 및 제1 팝 카운터(POPCOUNT1)를 통한 제1 멀티플렉서(MUX1)의 입력을 수신한다.The first AND gate AND1 receives mask vectors BIT VECTOR1 and BIT VECTOR2 of two ternary vectors each having a length of N. In addition, the second XNOR gate XNOR2 receives value vectors BIT VECTOR3 and BIT VECTOR4 of two ternary vectors each having a length of N. Also, the output of the first AND gate AND1 is provided as an input of the first multiplexer MUX1. In addition, the output of the first AND gate AND1 and the output of the second XNOR gate XNOR2 are provided as inputs of the third AND gate AND3. Also, the second multiplexer MUX2 receives an input from the output of the second XNOR gate XNOR2 and the output of the third AND gate AND3. The output of the second multiplexer MUX2 selects the output of the third AND gate AND3. Here, the output of the second multiplexer MUX2 includes a non-zero element in the dot product. In addition, the fourth multiplexer MUX4 inputs the second multiplexer MUX2 through the second bit length BIT LENGTH 2 and the second pop counter POPCOUNT2, and the first multiplexer through the first pop counter POPCOUNT1. MUX1) input is received.

제2 팝 카운터(POPCOUNT2)는 제4 멀티플렉서(MUX4)로 전달되는 제2 멀티플렉서(MUX2)의 출력으로부터 1의 개수를 계산한다. 제2 팝 카운터(POPCOUNT2)의 출력은 1비트 만큼 왼쪽으로 시프트된다. 제1 팝 카운터(POPCOUNT1)는 제4 멀티플렉서(MUX4)로 전달되는 내적 연산에서 0이 아닌 엘리먼트를 계산한다. 여기서 제4 멀티플렉서(MUX4)는 제1 팝 카운터(POPCOUNT1)의 출력을 제4 멀티플렉서(MUX4)의 출력으로서 선택한다. 제2 팝 카운터(POPCOUNT2)의 왼쪽으로 시프트된 출력과 제4 멀티플렉서(MUX4)의 출력은 제2 감산기(SUB2)에서 감산된다. 제2 감산기(SUB2)의 출력은 두 삼진 벡터들 간의 내적을 나타낸다. 제2 감산기(SUB2)의 출력은 제5 멀티플렉서(MUX5)와 제2 가산기를 사용하여 제1 누산기(ACC)와 함께 누적 연산을 수행하기 위해 제공된다. 제2 가산기의 출력은 제1 누산기(ACC)에 저장된다. 제2 가산기(ADD2)의 출력은 제2 누산기(ACC_D)에 추가로 저장되고 비교기(COMPARATOR)의 임계 값과 비교되어 출력 값을 생성한다.The second pop counter POPCOUNT2 calculates the number of 1s from the output of the second multiplexer MUX2 delivered to the fourth multiplexer MUX4. The output of the second pop counter POPCOUNT2 is shifted to the left by 1 bit. The first pop counter POPCOUNT1 calculates a non-zero element in the dot product operation transmitted to the fourth multiplexer MUX4. Here, the fourth multiplexer MUX4 selects the output of the first pop counter POPCOUNT1 as the output of the fourth multiplexer MUX4. The output shifted to the left of the second pop counter POPCOUNT2 and the output of the fourth multiplexer MUX4 are subtracted by the second subtractor SUB2. The output of the second subtractor SUB2 represents the dot product between two ternary vectors. The output of the second subtractor SUB2 is provided to perform an accumulation operation together with the first accumulator ACC using the fifth multiplexer MUX5 and the second adder. The output of the second adder is stored in the first accumulator (ACC). The output of the second adder ADD2 is additionally stored in the second accumulator ACC_D and compared with a threshold value of the comparator COMPARATOR to generate an output value.

일 실시 예에서, PE는 처리량을 개선하기 위해 병렬로 한 쌍의 데이터 벡터에 대해 내적을 수행한다. PE는 삼진 데이터의 벡터 쌍(TNN의 경우)에 대해 내적을 수행한다. 각 삼진 데이터에는 두 비트, 즉 마스크 비트(데이터가 0인지 여부를 나타냄)와 값 비트(데이터가 각각 0과 1로 표시되는 -1 또는 +1을 나타내는 지 여부)가 있다. 이러한 방식으로 삼진 데이터의 세 가지 가능한 값들(0, -1 및 +1)이 2비트(각각 00, 10, 및 11)로 인코딩된다. 또한, 2비트의 삼진 데이터 벡터는 마스크와 값에 대해 각각 두 개의 개별 1비트 벡터들로 분할된다. 따라서 삼진 데이터에 대해 내적을 수행하기 위해 PE는 4개의 1비트 벡터들을 입력으로 수신한다. 한 쌍의 입력 마스크 비트 벡터들과 입력 값 비트 벡터들은 두 개의 삼진 비트 벡터들의 내적을 결정하기 위해 별도로 처리된다. PE에 대한 입력 데이터는 비트 벡터의 형태로 전달되기 때문에, 곱셈기 대신에 면적 및 전력에서 효율적인 비트 단위 연산 로직(예: 비트 단위 OR, 비트 단위 AND, 비트 단위 XNOR 등)이 사용된다.In one embodiment, the PE performs dot products on a pair of data vectors in parallel to improve throughput. PE performs a dot product on a vector pair of ternary data (in the case of TNN). Each ternary data has two bits: a mask bit (indicating whether the data is 0 or not) and a value bit (whether the data represents -1 or +1, represented by 0 and 1, respectively). In this way, three possible values of the ternary data (0, -1 and +1) are encoded in 2 bits (00, 10, and 11, respectively). Also, the 2-bit ternary data vector is divided into two separate 1-bit vectors for mask and value, respectively. Therefore, in order to perform dot product on the ternary data, the PE receives 4 1-bit vectors as inputs. The pair of input mask bit vectors and input value bit vectors are processed separately to determine the dot product of the two ternary bit vectors. Since the input data to the PE is transferred in the form of a bit vector, an area- and power-efficient bit-wise operation logic (eg, bit-wise OR, bit-wise AND, bit-wise XNOR, etc.) is used instead of a multiplier.

삼진 데이터의 마스크 벡터들은 BIT VECTOR1 및 BIT VECTOR2에 전달된다. 삼진 데이터의 값 벡터들은 BIT VECTOR3 및 BIT VECTOR4에 전달된다. 비트 길이들(BIT LENGTH1 및 BIT LENGTH2)의 경우 임의의 값이 전달될 수 있다. 삼진 데이터의 경우 0이 아닌 엘리먼트의 수가 내부적으로 계산된다. AND1에서 BIT VECTOR1과 BIT VECTOR2간에 비트 단위 AND 연산이 수행된다. XNOR1에서 BIT VECTOR3과 BIT VECTOR4간에 비트 단위 XNOR 연산이 수행된다. AND1 및 XNOR1의 출력은 AND3을 사용하여 비트 단위 AND 처리된다. AND3의 출력은 0이 아닌 삼진 데이터를 가지며 MUX2를 통해 전달된다. POPCOUNT2는 MUX2의 출력 비트 벡터에서 1의 수를 계산한다. POPCOUNT2의 출력은 1비트 만큼 왼쪽으로 시프트된다. 이 왼쪽 시프트 로직의 출력은 SUB2 감산기의 입력 중 하나이다. SUB2에 대한 또 다른 입력은 MUX4 출력에서 생성된다. MUX4는 POPCOUNT1 출력을 통과시킨다(MUX2의 출력에서 1의 수를 계산, 즉 AND1의 출력에서 1의 수를 계산). MUX4의 출력은 두 삼진 벡터들 사이의 내적 뒤의 결과 벡터에서 0이 아닌 엘리먼트의 수를 나타낸다. 따라서 SUB2의 출력은 삼진 데이터에 대한 내적 계산의 최종 결과를 생성한다. SUB2의 출력은 MUX5로 추가로 제공된다. MUX5는 SUB2(삼진 데이터에 대한 내적 연산에 대한 결과를 생성함)와 ADD1(이진 데이터에 대한 내적 연산에 대한 결과를 생성함)으로부터 입력을 받는다. 모드(즉, 이진 또는 삼진)에 따라 출력(즉, SUB2 또는 ADD1) 중 하나가 MUX5를 통해 전달된다. 따라서 삼진 모드의 경우 SUB2의 출력이 전달된다. MUX5의 출력은 MAC 연산의 곱셈 단계의 결과이다. 누적 단계는 ADD2에서 ACC와 함께 MUX5의 출력을 더함으로써 수행된다. 모든 MAC 작업이 완료되면 ACC의 출력이 ACC_D에 저장된다. ACC_D는 임계 값과 추가로 비교되는 OFM의 최종 값을 저장하고 2비트의 삼진 데이터(즉, 00, 10 또는 11)를 생성한다.The mask vectors of the ternary data are transferred to BIT VECTOR1 and BIT VECTOR2. The value vectors of the ternary data are passed to BIT VECTOR3 and BIT VECTOR4. In the case of the bit lengths BIT LENGTH1 and BIT LENGTH2, an arbitrary value may be transferred. In the case of ternary data, the number of non-zero elements is internally calculated. In AND1, a bitwise AND operation is performed between BIT VECTOR1 and BIT VECTOR2. In XNOR1, a bit-wise XNOR operation is performed between BIT VECTOR3 and BIT VECTOR4. The outputs of AND1 and XNOR1 are bitwise ANDed using AND3. The output of AND3 has non-zero strikeout data and is delivered through MUX2. POPCOUNT2 counts the number of 1s in the output bit vector of MUX2. The output of POPCOUNT2 is shifted to the left by 1 bit. The output of this left shift logic is one of the inputs of the SUB2 subtractor. Another input to SUB2 is generated at the MUX4 output. MUX4 passes the POPCOUNT1 output (counting the number of 1s at the output of MUX2, i.e. counting the number of 1s at the output of AND1). The output of MUX4 represents the number of non-zero elements in the resulting vector after the dot product between the two ternary vectors. Therefore, the output of SUB2 produces the final result of the dot product calculation for the ternary data. The output of SUB2 is additionally provided by MUX5. MUX5 receives inputs from SUB2 (which generates the result of the dot product operation on struck out data) and ADD1 (which generates the result of the dot product operation on binary data). Depending on the mode (i.e. binary or ternary), one of the outputs (i.e., SUB2 or ADD1) is delivered through MUX5. Therefore, in the case of three-step mode, the output of SUB2 is transmitted. The output of MUX5 is the result of the multiplication step of the MAC operation. The accumulation step is performed by adding the output of MUX5 together with ACC in ADD2. When all MAC tasks are completed, the output of ACC is saved in ACC_D. ACC_D stores the final value of the OFM that is further compared with the threshold value and generates 2-bit struck out data (ie 00, 10 or 11).

도 9c는 일 실시 예에 따라, 이진 모드에서 이진 데이터에 대한 MAC 연산을 수행하기 위해 융합된 데이터 경로 엔진에 의해 사용되는 BNN 모델을 나타낸다.9C illustrates a BNN model used by a fused data path engine to perform a MAC operation on binary data in a binary mode, according to an embodiment.

제1 XNOR 게이트(XNOR1)는 제1 벡터(BIT VECTOR1)의 1비트와 제2 벡터(BIT VECTOR2)의 1비트를 수신하여 길이 N비트의 곱 벡터를 제1 XNOR 게이트(XNOR1)의 출력으로 생성한다. 또한, 제2 XNOR 게이트(XNOR2)는 제3 벡터(BIT VECTOR3)의 1비트와 제4 벡터(BIT VECTOR4)의 1비트를 수신하여 제2 XNOR 게이트(XNOR2)의 출력으로 길이 N 비트의 곱 벡터를 생성한다. 또한, 제1 XNOR 게이트(XNOR1)의 출력은 제1 멀티플렉서(MUX1)의 입력으로 공급된다. 또한, 제2 XNOR 게이트(XNOR2)의 출력은 제2 멀티플렉서(MUX2)의 입력으로 공급된다.The first XNOR gate (XNOR1) receives 1 bit of the first vector (BIT VECTOR1) and 1 bit of the second vector (BIT VECTOR2) and generates a product vector of length N bits as the output of the first XNOR gate (XNOR1). do. In addition, the second XNOR gate XNOR2 receives 1 bit of the third vector BIT VECTOR3 and 1 bit of the fourth vector BIT VECTOR4, and outputs the product vector of length N bits as the output of the second XNOR gate XNOR2. Create In addition, the output of the first XNOR gate XNOR1 is supplied as an input of the first multiplexer MUX1. In addition, the output of the second XNOR gate XNOR2 is supplied to the input of the second multiplexer MUX2.

또한, 제3 멀티플렉서(MUX3)는 제1 비트 길이(BIT LENGTH1)와 제1 팝 카운터(POPCOUNT1)를 통해 제1 멀티플렉서로부터의 입력을 수신하고, 제1 멀티플렉서의 출력은 제1 팝 카운터(POPCOUNT1)의 입력으로 공급된다. 제1 팝 카운터(POPCOUNT1)는 이진 데이터에서 BIT VECTOR1과 BIT VECTOR2 사이의 내적을 수행한 후 결과 벡터에서 1의 수를 계산한다. 계산된 1의 수는 제3 멀티플렉서(MUX3)와 제4 멀티플렉서(MUX4)로 전달된다. 제4 멀티플렉서(MUX4)는 제2 비트 길이(BIT LENGTH2)와 제1 멀티플렉서로부터의 입력을 제2 팝 카운터(POPCOUNT2)를 통해 수신한다. 제2 팝 카운터(POPCOUNT2)는 전달되는 이진 데이터에 대해 BIT VECTOR3과 BIT VECTOR4 사이의 내적을 수행한 후 결과 벡터 (즉, 제2 멀티플렉서(MUX2)의 출력)에서 1의 수를 계산한다. 계산된 1의 수는 제4 멀티플렉서(MUX4)로 전달된다. 제2 팝 카운터(POPCOUNT2)의 출력은 1만큼 왼쪽으로 시프트된다. 제1 팝 카운터(POPCOUNT1)의 왼쪽으로 시프트된 출력과 제3 멀티플렉서(MUX3)의 출력은 제1 감산기(SUB1)에서 감산된다. 여기서 제1 감산기(SUB1)의 출력은 이진 데이터에 대한 BIT VECTOR1 및 BIT VECTOR2 사이의 내적을 나타낸다.In addition, the third multiplexer (MUX3) receives the input from the first multiplexer through the first bit length (BIT LENGTH1) and the first pop counter (POPCOUNT1), and the output of the first multiplexer is a first pop counter (POPCOUNT1). Is supplied as an input of. The first pop counter (POPCOUNT1) calculates the number of 1s in the result vector after performing a dot product between BIT VECTOR1 and BIT VECTOR2 on binary data. The calculated number of 1s is transferred to the third multiplexer MUX3 and the fourth multiplexer MUX4. The fourth multiplexer MUX4 receives a second bit length BIT LENGTH2 and an input from the first multiplexer through a second pop counter POPCOUNT2. The second pop counter (POPCOUNT2) performs a dot product between BIT VECTOR3 and BIT VECTOR4 on the transferred binary data, and then calculates the number of 1s in the result vector (that is, the output of the second multiplexer MUX2). The calculated number of 1s is transferred to the fourth multiplexer MUX4. The output of the second pop counter POPCOUNT2 is shifted to the left by one. The output shifted to the left of the first pop counter POPCOUNT1 and the output of the third multiplexer MUX3 are subtracted by the first subtractor SUB1. Here, the output of the first subtractor SUB1 represents the dot product between BIT VECTOR1 and BIT VECTOR2 for binary data.

또한, 제2 감산기(SUB2)에서 제2 팝 카운터(POPCOUNT2)의 왼쪽으로 시프트된 출력과 제4 멀티플렉서(MUX4)의 출력이 감산된다. 여기서 제2 감산기(SUB2)의 출력은 이진 데이터에서 BIT VECTOR3과 BIT VECTOR4 사이의 내적을 나타낸다. 제2 감산기(SUB2)의 출력과 제1 감산기(SUB1)의 출력은 제1 가산기(ADD1)에서 더해지고, 제1 누산기(ACC)와 함께 제5 멀티플렉서(MUX5)와 제2 가산기를 사용하여 누적 연산을 수행한다. 제2 가산기의 출력은 제1 누산기(ACC)에 저장된다. 제2 가산기의 출력은 비교기(COMPARATOR)의 임계 값과 비교되어 출력 값을 생성한다.In addition, the output shifted to the left of the second pop counter POPCOUNT2 and the output of the fourth multiplexer MUX4 are subtracted by the second subtractor SUB2. Here, the output of the second subtractor SUB2 represents the dot product between BIT VECTOR3 and BIT VECTOR4 in binary data. The output of the second subtractor SUB2 and the output of the first subtractor SUB1 are added by the first adder ADD1, and accumulated using the fifth multiplexer (MUX5) and the second adder together with the first accumulator (ACC). Perform the operation. The output of the second adder is stored in the first accumulator (ACC). The output of the second adder is compared with a threshold value of a comparator to generate an output value.

일 실시 예에서, 이진 데이터 유형이 {0,1}(즉, BNN_A)인 경우, 제1 팝 카운터(POPCOUNT1)의 출력은 왼쪽 시프터와 제3 멀티플렉서(MUX3)를 통해 전달된다. 제2 팝 카운터(POPCOUNT2)의 출력은 왼쪽 시프터와 제4 멀티플렉서(MUX4)를 통해 전달된다. 또한, 제1 팝 카운터(POPCOUNT1)에서 왼쪽으로 시프트된 출력과 제3 멀티플렉서의 출력은 제1 감산기(SUB1)에서 감산된다. 또한, 제2 팝 카운터(POPCOUNT2)에서 왼쪽으로 시프트된 출력과 제4 멀티플렉서의 출력은 제2 감산기(SUB2)에서 감산된다. 이러한 방식으로 제1 감산기(SUB1)와 제2 감산기(SUB2)의 출력은 BNN_A 데이터 유형의 두 이진 벡터에 대한 내적 결과 인 POPCOUNT1 및 POPCOUNT2 값들을 제공한다.In an embodiment, when the binary data type is {0,1} (ie, BNN_A), the output of the first pop counter (POPCOUNT1) is transmitted through the left shifter and the third multiplexer (MUX3). The output of the second pop counter POPCOUNT2 is transmitted through the left shifter and the fourth multiplexer MUX4. In addition, the output shifted to the left by the first pop counter POPCOUNT1 and the output of the third multiplexer are subtracted by the first subtractor SUB1. In addition, the output shifted to the left by the second pop counter POPCOUNT2 and the output of the fourth multiplexer are subtracted by the second subtractor SUB2. In this way, the outputs of the first subtractor SUB1 and the second subtractor SUB2 provide POPCOUNT1 and POPCOUNT2 values, which are dot product results for two binary vectors of the BNN_A data type.

일 실시 예에서, 이진 데이터 유형이 {-1,1}인 경우, 제1 팝 카운터(POPCOUNT1)의 출력은 왼쪽 시프터를 통과하고 제1 비트 길이(BIT LENGTH1)는 제3 멀티플렉서를 통과하고, 제2 팝 카운터(POPCOUNT2)의 출력은 왼쪽 시프터를 통과하고 제2 비트 길이(BIT LENGTH2)는 제4 멀티플렉서를 통과한다. 또한, 제1 감산기(SUB1)는 제1 팝 카운터(POPCOUNT1)의 왼쪽으로 시프트된 출력에서 제3 멀티플렉서의 출력을 감산하여 제1 비트 벡터와 제2 비트 벡터 간의 내적을 계산한다. 또한, 제2 감산기(SUB2)는 제2 팝 카운터(POPCOUNT2)의 왼쪽으로 시프트된 출력에서 제4 멀티플렉서의 출력을 감산하여 제3 비트 벡터와 제4 비트 벡터 간의 내적을 계산한다. 이와 같이 제1 감산기(SUB1)와 제2 감산기(SUB2)의 출력은 BNN_B 데이터 유형의 두 이진 벡터에 대한 내적 결과인 ((POPCOUNT1 << 1) - BIT LENGTH1) 값과 ((POPCOUNT2 << 1) - BIT LENGTH2) 값을 제공한다.In one embodiment, when the binary data type is {-1,1}, the output of the first pop counter (POPCOUNT1) passes through the left shifter, the first bit length (BIT LENGTH1) passes through the third multiplexer, and 2 The output of the pop counter (POPCOUNT2) passes through the left shifter and the second bit length (BIT LENGTH2) passes through the fourth multiplexer. In addition, the first subtractor SUB1 calculates a dot product between the first bit vector and the second bit vector by subtracting the output of the third multiplexer from the output shifted to the left of the first pop counter POPCOUNT1. In addition, the second subtractor SUB2 calculates the dot product between the third bit vector and the fourth bit vector by subtracting the output of the fourth multiplexer from the output shifted to the left of the second pop counter POPCOUNT2. As such, the outputs of the first subtractor (SUB1) and the second subtractor (SUB2) are ((POPCOUNT1 << 1)-BIT LENGTH1) values and ((POPCOUNT2 << 1), which are the dot product results of two binary vectors of the BNN_B data type. -Provides BIT LENGTH2) value.

예시적인 시나리오에서, 전자 장치(1000)는 이진 내적 및 삼진 내적을 수행하기위한 2X2 정수 곱셈기 하드웨어를 갖는 기준 하드웨어에 대해서 면적에서 더 효율적인 융합된 데이터 경로를 포함한다. 예시적인 시나리오에서, 기존 가속기에 대한 전자 장치(1000)의 전력 효율은 전자 장치(1000)의 PE 어레이(400)의 크기가 증가함에 따라 증가한다.In an exemplary scenario, the electronic device 1000 includes a fused data path that is more efficient in area with respect to reference hardware having 2X2 integer multiplier hardware to perform binary dot product and triple dot product. In an exemplary scenario, the power efficiency of the electronic device 1000 with respect to the existing accelerator increases as the size of the PE array 400 of the electronic device 1000 increases.

2개의 독립적인 벡터 쌍 {BIT VECTOR1, BIT VECTOR2} 및 {BIT VECTOR3, BIT VECTOR4}는 이진 데이터에 대한 내적을 계산하는 데 사용된다. BNN_B 데이터 유형에 대한 내적: 활성 컴포넌트들(진한 회색으로 강조 표시됨)은 BNN_B 데이터 유형의 내적을 계산하는 데 사용된다. 융합된 데이터 경로에서 두 개의 벡터 쌍은 두 개의 개별 데이터 흐름들에서 독립적으로 처리된다. 하나의 데이터 흐름(BIT LENGTH1의 왼쪽)은 XNOR1, MUX1, POPCOUNT1, 왼쪽 시프트 로직, SUB1의 시퀀스와 같다. 다른 데이터 흐름(BIT LENGTH2의 오른쪽)은 XNOR2, MUX2, POPCOUNT2, 왼쪽 시프트 로직, SUB2의 시퀀스와 같다. 이 두 데이터 흐름의 출력은 ADD1에서 더해진다. BNN_B 데이터 유형의 경우 유효한 비트 길이를 결정하려면 비트 길이가 필요하다. 따라서 두 시퀀스에 대해 BIT LENGTH1 및 BIT LENGTH2는 각각 MUX3 및 MUX4를 통해 전달된다.Two independent vector pairs {BIT VECTOR1, BIT VECTOR2} and {BIT VECTOR3, BIT VECTOR4} are used to calculate the dot product for binary data. Dot product for the BNN_B data type: The active components (highlighted in dark gray) are used to calculate the dot product of the BNN_B data type. In a fused data path, two vector pairs are processed independently in two separate data flows. One data flow (left of BIT LENGTH1) is the same as the sequence of XNOR1, MUX1, POPCOUNT1, left shift logic, and SUB1. The other data flow (right of BIT LENGTH2) is the same as the sequence of XNOR2, MUX2, POPCOUNT2, left shift logic, and SUB2. The outputs of these two data flows are added at ADD1. For the BNN_B data type, the bit length is required to determine the effective bit length. Thus, for both sequences, BIT LENGTH1 and BIT LENGTH2 are passed through MUX3 and MUX4, respectively.

BNN_A 데이터 유형에 대한 내적: BNN_A 데이터 유형의 경우 비트 단위 XNOR 연산 대신 비트 단위 AND 연산이 필요하다(도 1 참조). 이로 인해 BNN_B에 대해 언급된 두 데이터 흐름이 약간 변경된다. BNN_A 데이터 유형의 왼쪽 데이터 흐름은 AND1, MUX1, POPCOUNT1, 왼쪽 시프트 로직, 및 SUB1이다. BNN_A 데이터 유형의 오른쪽 데이터 흐름은 AND2, MUX2, POPCOUNT2, 왼쪽 시프트 로직, 및 SUB2이다. BNN_A 데이터 유형의 경우 POPCOUNT 로직 자체의 출력이 내적 결과를 생성하기 때문에 비트 길이 및 왼쪽 시프트 로직이 필요하지 않다. POPCOUNT의 출력에 대한 왼쪽 시프트 로직의 효과를 없애려면, POPCOUNT 값은 왼쪽으로 시프트되고 그 이전 값에 의해 감산된다. 예를 들어, 왼쪽 흐름의 경우 왼쪽 시프트 로직의 출력에서 POPCOUNT1의 출력을 뺀다. 따라서 왼쪽 및 오른쪽 데이터 흐름의 경우 MUX3 및 MUX4를 통해 POPCOUNT1 및 POPCOUNT2의 출력이 전달된다. BNN_B 데이터 유형과 유사하게 SUB1 및 SUB2의 출력이 ADD1에서 더해진다.Dot product for the BNN_A data type: For the BNN_A data type, a bitwise AND operation is required instead of a bitwise XNOR operation (see Fig. 1). This slightly changes the two data flows mentioned for BNN_B. The left data flows of the BNN_A data type are AND1, MUX1, POPCOUNT1, left shift logic, and SUB1. The right data flows of the BNN_A data type are AND2, MUX2, POPCOUNT2, left shift logic, and SUB2. For the BNN_A data type, the bit length and left shift logic are not required because the output of the POPCOUNT logic itself produces a dot product result. To eliminate the effect of the left shift logic on the output of POPCOUNT, the POPCOUNT value is shifted left and subtracted by the previous value. For example, in the case of the left flow, the output of POPCOUNT1 is subtracted from the output of the left shift logic. Therefore, for the left and right data flows, the outputs of POPCOUNT1 and POPCOUNT2 are passed through MUX3 and MUX4. Similar to the BNN_B data type, the outputs of SUB1 and SUB2 are added in ADD1.

ADD1의 출력은 이진 데이터(BNN_A 또는 BNN_B 중 하나)에 대한 내적의 결과를 나타낸다. 이진 모드의 경우 ADD1의 출력이 MUX5를 통해 전달된다. MUX5의 출력은 MAC 연산의 곱셈 단계의 결과이다. 누적 단계는 ADD2에서 ACC와 함께 MUX5의 출력을 더함으로써 수행된다. 모든 MAC 연산이 완료되면 ACC의 출력이 ACC_D에 저장된다. ACC_D는 1비트 이진 데이터(즉, 0 또는 1)를 생성하기 위해 임계 값과 추가로 비교되는 OFM의 최종 값을 저장한다.The output of ADD1 represents the result of the dot product for binary data (either BNN_A or BNN_B). In binary mode, the output of ADD1 is delivered through MUX5. The output of MUX5 is the result of the multiplication step of the MAC operation. The accumulation step is performed by adding the output of MUX5 together with ACC in ADD2. When all MAC operations are completed, the output of ACC is stored in ACC_D. ACC_D stores the final value of the OFM, which is further compared to the threshold to generate 1-bit binary data (ie, 0 or 1).

도 10은 일 실시 예에 따른 루프 포맷에서의 순회의 개략도를 나타낸다.10 is a schematic diagram of a traversal in a loop format according to an embodiment.

PE 어레이에 의한 OFM 생성은 도 10에 도시되어 있다. 컨트롤러(100)는 IFM 및 커널 데이터를 가져오기 위해 SRAM 어드레스의 시퀀스를 생성하는 루프 순회를 수행한다. 루프 순회는 FIG. 10에 의해 설명된다. 루프 형식에서 순회를 수행하는 단계들은 다음과 같다.OFM generation by the PE array is shown in FIG. 10. The controller 100 performs a loop traversal to generate a sequence of SRAM addresses to fetch IFM and kernel data. Loop traversal is shown in Fig. Explained by 10. The steps to perform traversal in the loop form are as follows.

단계 1: for ofm_ch in range (0, C, OCH)Step 1: for ofm_ch in range (0, C, OCH)

단계 2: for ofm_h in range(0, 1, OH)Step 2: for ofm_h in range(0, 1, OH)

단계 3: for ofm_w in range (0, R, OW)Step 3: for ofm_w in range (0, R, OW)

단계 4: if BNN, ch_step = vector_length X 2Step 4: if BNN, ch_step = vector_length X 2

단계 5: else ch_step = vector_lengthStep 5: else ch_step = vector_length

단계 6: for ifm_ch in range (0, ch_step, ICH)Step 6: for ifm_ch in range (0, ch_step, ICH)

단계 7: for k_h in range (0, 1, KH)Step 7: for k_h in range (0, 1, KH)

단계 8: for k_w in range (0, 1, KW)Step 8: for k_w in range (0, 1, KW)

단계 9: OFM[ofm_ch : (ofm_ch +C), ofm_h, ofm_w : (ofm_w+R)] += matrix_mul ( IFM[ifm_ch : (ifm_ch +ch_step), ifm_h + k_h, (ifm_w +k_w) : (ifm_w +k_w) +R] , kernel[ofm_ ch : (ofm_ch +C), ifm_ch : (ifm_ch +ch_step), k_h, k_w] )Step 9: OFM[ofm_ch: (ofm_ch +C), ofm_h, ofm_w: (ofm_w+R)] += matrix_mul (IFM[ifm_ch: (ifm_ch +ch_step), ifm_h + k_h, (ifm_w +k_w): (ifm_w + k_w) +R], kernel[ofm_ ch: (ofm_ch +C), ifm_ch: (ifm_ch +ch_step), k_h, k_w])

단계 1에서, OFM 채널 업데이트가 수행된다(도 10에서 OFM tensor에서 3으로 표시됨). 단계 2에서, OFM 높이 업데이트가 수행된다(도 10의 OFM tensor에서 2로 표시됨). 단계 3에서, OFM 폭 업데이트가 수행된다(도 10의 OFM tensor에서 1로 표시됨). 단계 4에서, 이진 데이터(즉, BNN의 경우)에 대한 2개의 스트림 처리를 위해 채널 단계 업데이트가 수행된다. 단계 5에서, 삼진 데이터(즉, TNN의 경우)에 대한 1개의 스트림 처리에 대해 채널 단계 업데이트가 수행된다. 단계 6에서, IFM 채널 업데이트가 수행된다(도 10의 IFM tensor에서 3으로 표시됨). 단계 7에서, 커널 높이 업데이트가 수행된다(도 10의 IFM tensor에서 2로 표시됨). 단계 8에서 커널 폭 업데이트가 수행된다(도 10의 IFM tensor에서 1로 표시됨). 단계 9에서, 사이즈 1 X R X C의 OFM 타일이 생성된다.In step 1, the OFM channel update is performed (indicated by 3 in the OFM tensor in FIG. 10). In step 2, an OFM height update is performed (indicated by 2 in the OFM tensor of FIG. 10). In step 3, an OFM width update is performed (indicated by 1 in the OFM tensor of FIG. 10). In step 4, channel step update is performed for processing two streams of binary data (ie, in the case of BNN). In step 5, channel step update is performed for one stream processing for the three-way data (ie, in the case of TNN). In step 6, an IFM channel update is performed (indicated by 3 in the IFM tensor of FIG. 10). In step 7, a kernel height update is performed (indicated by 2 in the IFM tensor in Fig. 10). In step 8, the kernel width update is performed (indicated by 1 in the IFM tensor of FIG. 10). In step 9, an OFM tile of size 1 X R X C is created.

OFM 3차원 텐서는 최대 차원이 1 X R X C 인 타일 형식으로 생성되며, 여기서 R 및 C는 PE 어레이의 행 및 열의 차원이다. OFM 타일을 생성하기 위해 단계 6-8을 수행함으로써 IFM에서 사이즈 KW X KH X ICH의 IFM 텐서(IFM tensor)를 불러온다. 여기서 KW, KH, 및 ICH는 각각 커널 너비, 커널 높이, 및 IFM 채널을 나타낸다. BNN에 사용되는 채널 단계는 TNN에 사용되는 숫자의 두 배이며, 각 삼진 데이터는 2 비트로 표시된다. 나머지 OFM 텐서(OFM tensor)는 단계 2-3를 수행함으로써 행을 주 순서로 사이즈 C의 OFM 채널(OFM channel)에 대해 생성된 다음, OFM 채널의 OCH 사이즈가 완성된다.OFM three-dimensional tensors are created in a tile format with a maximum dimension of 1 X R X C, where R and C are the rows and columns of the PE array. The IFM tensor of size KW X KH X ICH is loaded from IFM by performing steps 6-8 to generate OFM tiles. Here, KW, KH, and ICH represent a kernel width, a kernel height, and an IFM channel, respectively. The channel level used for BNN is twice the number used for TNN, and each ternary data is represented by 2 bits. The remaining OFM tensors are generated for the OFM channels of size C in the main order by performing steps 2-3, and then the OCH size of the OFM channels is completed.

도 11a는 일 실시 예에 따른 PE의 데이터 경로의 개략도를 도시한다.11A is a schematic diagram of a data path of a PE according to an embodiment.

FPL 및 비트 단위 레이어를 각각 지원하기 위해, 각 PE 내에서 FPL 데이터 경로 및 융합된 비트 단위 데이터 경로(BW 데이터 경로)를 사용할 수 있다. FPL 데이터 경로는 풀 정밀도 IFM 데이터와 이진 커널 데이터 중 하나 및 삼진 커널 데이터 사이의 내적을 수행한다. FPL 데이터 경로 및 융합된 데이터 경로(BW 데이터 경로) 중 하나의 출력은 레이어 유형(Layer type)을 기반으로 누적을 위해 선택되며, 여기서 레이어 유형은 풀 정밀도 레이어(Full precision layer) 또는 히든 레이어(Hidden layer)다. 융합된 데이터 경로(BW 데이터 경로)는 커널 데이터와 IFM 데이터 사이의 내적을 수행한다. 여기서 커널 데이터와 IFM 데이터는 이진 데이터 또는 삼진 데이터로 표시된다.In order to support the FPL and the bit unit layer, respectively, an FPL data path and a fused bit unit data path (BW data path) can be used in each PE. The FPL data path performs a dot product between the full-precision IFM data and one of the binary kernel data and the ternary kernel data. The output of either the FPL data path and the fused data path (BW data path) is selected for accumulation based on the layer type, where the layer type is Full precision layer or Hidden layer. layer). The fused data path (BW data path) performs a dot product between kernel data and IFM data. Here, kernel data and IFM data are expressed as binary data or ternary data.

FPL에서, IFM 벡터는 풀 정밀도(즉, 이진 또는 삼진이 아님)의 형식으로 사용되는 반면 커널 벡터는 이진 또는 삼진 형식으로 사용된다. 풀 정밀도 IFM 벡터는 일반적으로 채널 수가 더 작은 입력 레이어에서 사용된다(즉, 이미지 처리의 경우 3). 비트 단위 레이어에서, IFM과 커널 벡터는 모두 이진 또는 삼진 형식으로 사용된다. PE에 대한 입력에는 모드(Op. mode), 레이어 유형(Layer type), 데이터 유형(Datatype), IFM 스트림(IFM stream), 커널 스트림(Kernel stream), 벡터 길이 1(Vector len. 1), 및 벡터 길이 2(Vector len. 2)가 포함된다. 모드(Op. mode)는 이진 연산 모드 또는 삼진 연산 모드이다. 레이어 유형(Layer type)은 FPL 또는 비트 단위 레이어다. 데이터 유형(Data type)은 BNN_A 데이터 유형 또는 BNN_B 데이터 유형이다. IFM 스트림(IFM stream)은 디스패처(200)에 IFM 데이터의 벡터를 제공한다. 커널 스트림(Kernel stream)은 커널 데이터의 벡터를 디스패처(200)에 제공한다. 벡터 길이 1(Vector len. 1)은 IFM 스트림의 벡터 길이다. 벡터 길이 2(Vector len. 2)는 커널 스트림의 벡터 길이다. 디스패처(200)는 두 데이터 경로에 필요한 데이터를 제공한다. 디스패처(200)의 작동 메커니즘은도 도 12a-12b에 더 설명되어 있다. 전자 장치는 융합된 비트 단위 데이터 경로의 경우 이미 사용된 IFM 및 커널 스테이징 레지스터를 사용하여 풀 정밀도의 IFM 데이터를 압축한다. 따라서 풀 정밀도의 IFM 데이터를 저장하기위해 추가 스토리지가 필요하지 않다.In FPL, IFM vectors are used in full precision (i.e., not binary or not ternary), whereas kernel vectors are used in binary or ternary format. Full precision IFM vectors are typically used in the input layer with a smaller number of channels (i.e. 3 for image processing). In the bitwise layer, both IFM and kernel vector are used in binary or ternary format. Inputs to PE include Mode, Layer type, Datatype, IFM stream, Kernel stream, Vector len. 1, and Vector len. 2 is included. Mode (Op. mode) is a binary operation mode or a ternary operation mode. The layer type is an FPL or bit-based layer. The data type is the BNN_A data type or the BNN_B data type. The IFM stream provides a vector of IFM data to the dispatcher 200. The kernel stream provides a vector of kernel data to the dispatcher 200. Vector length 1 (Vector len. 1) is the vector length of the IFM stream. Vector length 2 (Vector len. 2) is the vector length of the kernel stream. The dispatcher 200 provides necessary data for both data paths. The mechanism of operation of the dispatcher 200 is further described in FIGS. 12A-12B. In the case of a fused bitwise data path, the electronic device compresses full-precision IFM data using the already used IFM and kernel staging registers. Therefore, no additional storage is required to store full-precision IFM data.

FPL 데이터 경로의 경우 내적 연산은 다음 식을 사용하여 표현된다.For the FPL data path, the dot product operation is expressed using the following equation.

Dot product (A,W) = sum (Ai X Wi)Dot product (A,W) = sum (Ai X Wi)

여기서, i=1 to n, Ai R and Wi {-1, 0, +1} (삼진 데이터의 경우) 또는 [{-1, +1} (BNN_B의 경우) 또는 {0,1} (BNN_A의 경우)].Here, i=1 to n, Ai R and Wi {-1, 0, +1} (for ternary data) or [{-1, +1} (for BNN_B) or {0,1} (for BNN_A) Occation)].

곱셈은 Ai 값을 업데이트하고 합산함으로써 대체될 수 있으며, Ai의 3가지 가능한 값은 0, -Ai, 및 + Ai다. 따라서, 도 11a에 도시된 바와 같이 3개의 입력(즉, IFM 벡터, 0, 반대 부호의 IFM 벡터)이 FPL 데이터 경로의 멀티플렉서들(MUX8, MUX9, MUX L-1)에 제공된다. 멀티플렉서들(MUX8, MUX9, MUX L-1)의 제어는 커널 벡터(즉, Wi)의 값을 기반으로 한다. 또한, 멀티플렉서들(MUX8, MUX9, MUX L-1)의 출력은 FPL 데이터 경로의 가산기 트리(ADDER TREE)를 사용하여 합산된다. 가산기 트리(ADDER TREE)의 출력은 내적이다. 누적 연산을 수행하기 위해서는 레이어 유형에 따라 선택되는 멀티플렉서(MUX6)를 사용하여 FPL 데이터 경로(즉, 가산기 트리의 출력) 및 융합된 데이터 경로의 출력을 선택한다. 또한, 멀티플렉서(MUX6)의 출력에는 누산기 값이 더해져서 새로운 누산기 값이 생성된다.Multiplication can be replaced by updating and summing the Ai values, the three possible values of Ai are 0, -Ai, and + Ai. Accordingly, as shown in FIG. 11A, three inputs (ie, an IFM vector, 0, and an IFM vector of opposite signs) are provided to the multiplexers MUX8, MUX9, and MUX L-1 of the FPL data path. Control of the multiplexers (MUX8, MUX9, MUX L-1) is based on the value of the kernel vector (ie, Wi). Further, the outputs of the multiplexers MUX8, MUX9, and MUX L-1 are summed using an adder tree (ADDER TREE) of the FPL data path. The output of the ADDER TREE is the dot product. In order to perform the accumulation operation, the FPL data path (that is, the output of the adder tree) and the output of the fused data path are selected using a multiplexer MUX6 selected according to the layer type. In addition, an accumulator value is added to the output of the multiplexer MUX6 to generate a new accumulator value.

PE의 FPL 데이터 경로는 전형적인 내적 연산(오퍼랜드 중 하나가 더 높은 정밀도로 표현됨)을 지원한다. 높은 정밀도의 IFM 데이터의 벡터와 더 낮은 정밀도(이진 또는 삼진)의 커널 데이터의 벡터는 FPL 데이터 경로에서 입력으로 수신된다. 높은 정밀도의 IFM 데이터의 경우, IFM(Ai)과 커널간에 이진(bi) 또는 삼진(ti)의 정밀도로 엘리먼트 별 곱셈이 수행될 때, 곱셈의 출력은 동일(Ai), 반대 부호(-Ai), 또는 0일 수 있다. 이 원칙에 따라 각 멀티플렉서(MUX8, MUX9, MUX L-1)는 IFM 값 중 하나를 통과시킨다. 각 멀티플렉서의 출력은 가산기 트리(ADDER TREE)를 사용하여 더해진다. 가산기 트리(ADDER TREE)의 출력은 높은 정밀도의 IFM 데이터에 대한 내적의 결과를 나타낸다.PE's FPL data path supports typical dot product operations (one of the operands is represented with higher precision). A vector of high-precision IFM data and a vector of lower-precision (binary or ternary) kernel data are received as inputs in the FPL data path. In the case of high-precision IFM data, when element-by-element multiplication is performed with the precision of binary (bi) or ternary (ti) between IFM (Ai) and the kernel, the output of the multiplication is the same (Ai), opposite sign (-Ai). , Or can be 0. According to this principle, each multiplexer (MUX8, MUX9, MUX L-1) passes one of the IFM values. The output of each multiplexer is added using an adder tree (ADDER TREE). The output of the ADDER TREE represents the result of the dot product for high-precision IFM data.

도 11a에 PE 내부의 2개의 데이터 경로가 도시되어 있다. 각 데이터 경로의 출력은 MUX6 멀티플렉서를 사용하여 선택된다. 도 9a, 9b, 및 9c에서 MUX5 아래의 로직 유닛(즉, ADD2, ACC, MUX6, ACC_D)은 PE의 융합된 비트 단위 데이터 경로의 완전성을 위해 표시된다. PE의 최종 설계에서 이러한 유닛들은 두 데이터 경로에서 공유된다. 따라서, 이러한 논리 유닛들은 두 데이터 경로의 외부에 배치된다(즉, 도 11a에서 MUX6의 아래).Figure 11a shows two data paths inside the PE. The output of each data path is selected using a MUX6 multiplexer. In Figs. 9A, 9B, and 9C, the logic units under MUX5 (ie, ADD2, ACC, MUX6, ACC_D) are indicated for the integrity of the fused bitwise data path of the PE. In the final design of the PE, these units are shared across both data paths. Thus, these logical units are placed outside of the two data paths (ie, below MUX6 in Fig. 11A).

도 11b는 일 실시 예에 따른, 전자 장치를 사용하여 이진 데이터, 삼진 데이터, 비 이진 데이터, 및 비 삼진 데이터에 대한 내적을 계산하는 방법을 나타내는 흐름도이다. 11B is a flowchart illustrating a method of calculating a dot product for binary data, struck out data, non-binary data, and non-struck out data using an electronic device, according to an exemplary embodiment.

단계 1101에서, 방법은 삼진 데이터에 대한 내적을 계산하도록 설계하는 단계를 포함한다. 단계 1102에서, 방법은 이진 데이터 및 삼진 데이터에 대한 내적 계산을 지원하기 위해 융합된 비트 단위 데이터 경로를 설계하는 단계를 포함한다. 단계 1103에서, 방법은 비 이진 데이터와 비 삼진 데이터 중 하나와 이진 데이터와 삼진 데이터 중 하나 사이의 내적을 계산하기 위해 FPL 데이터 경로를 설계하는 단계를 포함한다. 단계 1104에서, 방법은 이진 데이터 및 삼진 데이터에 대한 내적 계산 및 비 이진 데이터와 비 삼진 데이터 중 하나와 이진 데이터 및 삼진 데이터 중 하나 사이의 내적을 융합된 데이터 경로 및 FPL 데이터 경로에 분배하는 단계를 포함한다.In step 1101, the method includes designing to calculate a dot product for the ternary data. At step 1102, the method includes designing a fused bitwise data path to support dot product computation for binary data and ternary data. In step 1103, the method includes designing an FPL data path to calculate a dot product between one of the non-binary data and the non-ternary data and one of the binary data and the ternary data. In step 1104, the method comprises calculating a dot product for the binary data and ternary data and distributing the dot product between one of the non-binary and non ternary data and one of the binary and ternary data to the fused data path and the FPL data path. Includes.

일 실시 예에서, 전자 장치(1000)는 삼진 데이터 및 이진 데이터 중 하나를 수신하고, 삼진 데이터 및 이진 데이터를 위한 작동 모드를 결정하고, 결정된 작동 모드에 기초한 적어도 하나의 팝 카운트 로직(POPCOUNT1- POPCOUNT2)을 사용하여 XNOR 게이트(XNOR1- XNOR2) 및 AND 게이트(AND1- AND3)에서 삼진 데이터 및 이진 데이터 중 적어도 하나를 처리하고, 처리된 삼진 데이터 및 처리된 이진 데이터 중 적어도 하나를 누산기에 수신하고, 최종 삼진 값 및 최종 이진 값 중 적어도 하나를 생성함으로써, 이진 데이터 및 삼진 데이터에 대한 내적 계산을 지원하도록 융합된 데이터 경로를 설계한다.In an embodiment, the electronic device 1000 receives one of the strikeout data and the binary data, determines an operation mode for the strikeout data and the binary data, and determines at least one pop count logic (POPCOUNT1-POPCOUNT2) based on the determined operation mode. ) To process at least one of the ternary data and the binary data in the XNOR gate (XNOR1- XNOR2) and the AND gate (AND1- AND3), and receive at least one of the processed ternary data and the processed binary data to the accumulator, The fused data path is designed to support dot product calculations on binary data and ternary data by generating at least one of a final ternary value and a final binary value.

일 실시 예에서, 전자 장치(1000)의 융합된 데이터 경로 엔진(403)이 BNN 모델에서 사용되는 이진 데이터와 TNN 모델에서 사용되는 삼진 데이터를 지원하도록 구성되었을 때, 전자 장치(1000)는 결정된 작동 모드에 기초한 적어도 하나의 팝 카운트 로직(POPCOUNT1-POPCOUNT2)을 사용하여 XNOR 게이트(XNOR1-XNOR2) 및 AND 게이트(AND1-AND3)에서 삼진 데이터 및 이진 데이터 중 적어도 하나를 처리한다. 전자 장치는 제1 XNOR 게이트(XNOR1)를 사용하여 제1 벡터의 1비트와 제2 벡터의 1비트를 수신하여 길이가 N비트인 곱 벡터를 제1 XNOR 게이트(XNOR1)의 출력으로 생성한다.In one embodiment, when the fused data path engine 403 of the electronic device 1000 is configured to support binary data used in the BNN model and ternary data used in the TNN model, the electronic device 1000 is operated At least one of the ternary data and binary data is processed in the XNOR gates XNOR1-XNOR2 and the AND gates AND1-AND3 using at least one pop count logic (POPCOUNT1-POPCOUNT2) based on the mode. The electronic device generates a product vector having a length of N bits as an output of the first XNOR gate XNOR1 by receiving 1 bit of the first vector and 1 bit of the second vector using the first XNOR gate XNOR1.

전자 장치(1000)는 제1 AND 게이트(AND1)를 사용하여 제1 벡터의 1비트와 제2 벡터의 1비트를 수신하여 길이가 N비트 인 곱 벡터를 제1 AND 게이트(AND1)의 출력으로 생성한다. 전자 장치(1000)는 제2 XNOR 게이트(XNOR2)를 사용하여 제3 벡터의 1비트와 제4 벡터의 1비트를 수신하여 길이가 N비트인 곱 벡터를 제2 XNOR 게이트(XNOR2)의 출력으로 생성한다.The electronic device 1000 receives 1 bit of the first vector and 1 bit of the second vector using the first AND gate AND1, and converts a product vector having a length of N bits as an output of the first AND gate AND1. Generate. The electronic device 1000 receives 1 bit of the third vector and 1 bit of the fourth vector using the second XNOR gate XNOR2, and converts a product vector having a length of N bits as an output of the second XNOR gate XNOR2. Generate.

전자 장치(1000)는 제2 AND 게이트(AND2)를 사용하여 제3 벡터의 1비트와 제4 벡터의 1비트를 수신하여 제2 AND 게이트(AND2)의 출력으로 길이 N비트의 곱 벡터를 생성한다. 전자 장치(1000)는 제1 XNOR 게이트(XNOR1)의 출력과 제1 AND 게이트(AND1)의 출력을 제1 멀티플렉서(MUX1)의 입력으로 공급하며, 여기서 제1 멀티플렉서(MUX1)의 출력은 삼진 데이터의 경우 두 삼진 벡터들 사이의 내적의 마스크 벡터 또는 이진 데이터의 경우 제1 이진 벡터 쌍(BIT VECTOR 1 및 BIT VECTOR 2)의 내적 벡터를 포함한다.The electronic device 1000 generates a product vector of length N bits as an output of the second AND gate AND2 by receiving 1 bit of the third vector and 1 bit of the fourth vector using the second AND gate AND2. do. The electronic device 1000 supplies the output of the first XNOR gate XNOR1 and the output of the first AND gate AND1 as inputs of the first multiplexer MUX1, where the output of the first multiplexer MUX1 is three-dimensional data. In the case of, the mask vector of the dot product between two ternary vectors or the dot product vector of the first binary vector pair (BIT VECTOR 1 and BIT VECTOR 2) in case of binary data is included.

전자 장치(1000)는 제1 AND 게이트(AND1)의 출력과 제2 XNOR 게이트(XNOR2)의 출력을 제3 AND 게이트(AND3)의 입력으로 공급한다. 전자 장치(1000)는 제2 멀티플렉서(MUX2)를 사용하여 제2 XNOR 게이트(XNOR2)의 출력, 제2 AND 게이트(AND2)의 출력, 제3 AND 게이트(AND3)의 출력에서 입력을 수신한다. 여기서 제2 멀티플렉서(MUX2)의 출력에는 삼진 데이터의 경우 입력 삼진 벡터 쌍의 값 벡터에서 0이 아닌 엘리먼트 쌍에 의해서만 영향을 받는 두 삼진 벡터 쌍의 내적의 값 벡터가 포함된다. 이진 데이터의 경우 제2 멀티플렉서(MUX2)의 출력에는 제2 이진 벡터 쌍(BIT VECTOR 3 및 BIT VECTOR 4)의 내적 벡터가 포함된다.The electronic device 1000 supplies the output of the first AND gate AND1 and the output of the second XNOR gate XNOR2 as inputs of the third AND gate AND3. The electronic device 1000 receives an input from the output of the second XNOR gate XNOR2, the output of the second AND gate AND2, and the output of the third AND gate AND3 using the second multiplexer MUX2. Here, in the case of ternary data, the output of the second multiplexer MUX2 includes a dot product value vector of two ternary vector pairs that are affected only by a non-zero element pair in the value vector of the input ternary vector pair. In the case of binary data, the output of the second multiplexer MUX2 includes the dot product vector of the second binary vector pair (BIT VECTOR 3 and BIT VECTOR 4).

전자 장치(1000)는 제1 멀티플렉서(MUX1)로부터 제1 팝 카운터(POPCOUNT1)를 통해 제1 비트 길이 및 입력을 수신하고, 여기서 제1 멀티플렉서(MUX1)의 출력은 제1 팝 카운터(POPCOUNT1)로 제공되고, 제1 팝 카운터(POPCOUNT1)는 삼진 데이터의 경우 마스크 벡터의 1의 수를 계산하고 이진 데이터의 경우 내적 벡터의 1의 수를 계산한다. 계산 결과는 이진 데이터의 경우 제3 멀티플렉서(MUX3)로 전달되고, 삼진 데이터의 경우 제4 멀티플렉서(MUX4)로 전달된다.The electronic device 1000 receives a first bit length and an input from the first multiplexer MUX1 through a first pop counter (POPCOUNT1), and the output of the first multiplexer (MUX1) is a first pop counter (POPCOUNT1). Provided, the first pop counter (POPCOUNT1) calculates the number of 1s of the mask vector in case of ternary data and the number of 1s of the dot product vector in case of binary data. The calculation result is transferred to the third multiplexer MUX3 in case of binary data, and is transferred to the fourth multiplexer MUX4 in case of ternary data.

전자 장치(1000)는 제2 비트 길이, 제1 멀티플렉서(MUX1)의 출력, 및 제2 팝 카운터(POPCOUNT2)를 제4 멀티플렉서(MUX4)로 제공하고, 여기서 제2 팝 카운터(POPCOUNT2)는 제4 멀티플렉서(MUX4)로 전달되는 제2 멀티플렉서(MUX2)의 출력에서 1의 수를 계산하고, 제2 팝 카운터(POPCOUNT2)의 출력은 1만큼 왼쪽으로 시프트되고, 제4 멀티플렉서(MUX4)의 출력은 삼진 데이터의 경우 제1 팝 카운터(POPCOUNT1)의 출력 또는 이진 데이터 유형 B의 경우 제2 비트 길이 또는 이진 데이터 유형 A의 경우 제2 팝 카운터의 출력(POPCOUNT2)을 포함한다.The electronic device 1000 provides a second bit length, an output of the first multiplexer MUX1, and a second pop counter (POPCOUNT2) to the fourth multiplexer (MUX4), wherein the second pop counter (POPCOUNT2) is a fourth The number of 1 is calculated from the output of the second multiplexer (MUX2) delivered to the multiplexer (MUX4), the output of the second pop counter (POPCOUNT2) is shifted to the left by 1, and the output of the fourth multiplexer (MUX4) is struck. In the case of data, the output of the first pop counter (POPCOUNT1) or the second bit length in the case of binary data type B, or the output of the second pop counter (POPCOUNT2) in case of binary data type A are included.

제1 팝 카운터(POPCOUNT1)의 왼쪽으로 시프트된 출력과 제3 멀티플렉서(MUX3)의 출력은 제1 감산기 (SUB1)에서 감산되며, 여기서 제1 감산기(SUB1)의 출력은 두 이진 벡터 쌍(BIT VECTOR 1 및 BIT VECTOR2)의 내적을 나타낸다. 제2 팝 카운터(POPCOUNT2)의 왼쪽으로 시프트된 출력과 제4 멀티플렉서(MUX4)의 출력은 제2 감산기 (SUB2)에서 감산되며, 여기서 제2 감산기(SUB2)의 출력은 삼진 데이터의 경우 두 삼진 벡터 쌍 간의 내적 또는 이진 데이터의 경우 제2 이진 벡터 쌍(BIT VECTOR 3 및 BIT VECTOR4) 사이의 내적을 나타낸다.The left-shifted output of the first pop counter (POPCOUNT1) and the output of the third multiplexer (MUX3) are subtracted by the first subtractor (SUB1), where the output of the first subtractor (SUB1) is two binary vector pairs (BIT VECTOR). 1 and BIT VECTOR2). The left-shifted output of the second pop counter (POPCOUNT2) and the output of the fourth multiplexer (MUX4) are subtracted by the second subtractor (SUB2), where the output of the second subtractor (SUB2) is two ternary vectors in the case of ternary data. In the case of the dot product between pairs or binary data, it represents the dot product between the second binary vector pair (BIT VECTOR 3 and BIT VECTOR4).

이진 데이터의 경우 제2 감산기(SUB2)의 출력과 제1 감산기(SUB1)의 출력은 제1 가산기(ADD1)에 가산되며, 여기서 제5 멀티플렉서(MUX5)의 출력은 이진 데이터의 경우 제1 가산기(ADD1)의 출력, 삼진 데이터의 경우 제2 감산기(SUB2)의 출력을 선택하고, 여기서 제5 멀티플렉서(MUX5)의 출력은 제2 가산기(ADD2)를 사용하여 제1 누산기(ACC)와 함께 더해진다. 제2 가산기(ADD2)의 출력은 제1 누산기(ACC)에 저장되며, 여기서 제2 가산기(ADD2)의 출력은 비교기(COMPARATOR)의 임계 값과 비교되어 출력 값을 생성한다.In the case of binary data, the output of the second subtractor SUB2 and the output of the first subtractor SUB1 are added to the first adder ADD1, where the output of the fifth multiplexer MUX5 is the first adder ( In the case of the output of ADD1) and the triplet data, the output of the second subtractor SUB2 is selected, and the output of the fifth multiplexer MUX5 is added together with the first accumulator ACC using the second adder ADD2. . The output of the second adder ADD2 is stored in the first accumulator ACC, where the output of the second adder ADD2 is compared with a threshold value of the comparator COMPARATOR to generate an output value.

일 실시 예에서, 전자 장치(1000)의 융합된 데이터 경로 엔진(403)이 TNN 모델 (900B)에서 사용하는 삼진 데이터를 지원하도록 구성되면, 전자 장치(1000)는 결정된 작동 모드에 따라 적어도 하나의 팝 카운트 로직(POPCOUNT1-POPCOUNT2)을 사용하여 XNOR 게이트(XNOR1-XNOR2) 및 AND 게이트(AND1-AND3)에서 삼진 데이터 및 이진 데이터 중 적어도 하나를 처리한다. 전자 장치(1000)는 제1 AND 게이트(AND1)를 사용하여 제1 벡터의 1비트와 제2 벡터의 1비트를 수신하여 길이가 N비트인 곱 벡터를 제1 AND 게이트(AND1)의 출력으로 생성한다.In an embodiment, when the fused data path engine 403 of the electronic device 1000 is configured to support the struck out data used in the TNN model 900B, the electronic device 1000 At least one of ternary data and binary data is processed in the XNOR gates XNOR1-XNOR2 and AND gates AND1-AND3 using pop count logic POPCOUNT1-POPCOUNT2. The electronic device 1000 receives 1 bit of the first vector and 1 bit of the second vector using the first AND gate AND1, and receives a product vector having a length of N bits as an output of the first AND gate AND1. Generate.

전자 장치(1000)는 제2 XNOR 게이트(XNOR2)를 사용하여 제3 벡터의 1비트 및 제4 벡터의 1비트를 수신하여 제2 XNOR 게이트(XNOR2)의 출력으로서 길이 N비트의 곱 벡터를 생성한다. 전자 장치(1000)는 제1 멀티플렉서(MUX1)의 입력으로서 제1 AND 게이트(AND1)의 출력을 제공한다. 전자 장치(1000)는 제1 AND 게이트(AND1)의 출력과 제2 XNOR 게이트(XNOR2)의 출력을 제3 AND 게이트(AND3)의 입력으로 제공한다. 전자 장치(1000)는 제2 XNOR 게이트(XNOR2)의 출력으로부터 입력을 수신하고, 제2 멀티플렉서(MUX2)를 사용하여 제3 AND 게이트(AND3)의 출력을 수신하며, 여기서 제2 멀티플렉서(MUX2)의 출력은 두 삼진 벡터 간의 비트 단위 연산에서 얻은 비트 벡터의 엘리먼트들을 포함한다.The electronic device 1000 generates a product vector of length N bits as an output of the second XNOR gate (XNOR2) by receiving 1 bit of the third vector and 1 bit of the fourth vector using the second XNOR gate (XNOR2). do. The electronic device 1000 provides an output of the first AND gate AND1 as an input of the first multiplexer MUX1. The electronic device 1000 provides the output of the first AND gate AND1 and the output of the second XNOR gate XNOR2 as inputs of the third AND gate AND3. The electronic device 1000 receives an input from the output of the second XNOR gate XNOR2, and receives the output of the third AND gate AND3 using the second multiplexer MUX2, where the second multiplexer MUX2 The output of contains the elements of a bit vector obtained from a bitwise operation between two ternary vectors.

전자 장치(1000)는 제4 멀티플렉서(MUX4)를 사용하여 제2 피트 길이, 제2 멀티플렉서(MUX2)로부터 제2 팝 카운터(POPCOUNT2)를 통한 입력, 및 제1 멀티플렉서(MUX1)로부터 제1 팝 카운터(POPCOUNT1)를 통한 입력을 수신하고, 여기서 제2 팝 카운터(POPCOUNT2)는 제4 멀티플렉서(MUX4)로 제공되는 제2 멀티플렉서(MUX2)의 1의 개수를 계산하고, 제2 팝 카운터(POPCOUNT2)의 출력은 1만큼 왼쪽으로 시프트되고, 제1 팝 카운터(POPCOUNT1)는 제4 멀티플렉서(MUX4)로 전달되는 두 삼진 데이터의 마스크 벡터들 간의 비트 단위 AND 연산 후 획득된 비트 벡터의 1의 수를 계산하고, 여기서 제1 비트 벡터와 제2 비트 벡터는 마스크 벡터이며, 여기서 1의 수는 두 삼진 벡터 간의 내적 연산 후 획득된 0이 아닌 값의 수를 나타낸다.The electronic device 1000 includes a second pit length using a fourth multiplexer MUX4, an input from the second multiplexer MUX2 through a second pop counter (POPCOUNT2), and a first pop counter from the first multiplexer MUX1. Receives an input through (POPCOUNT1), wherein the second pop counter (POPCOUNT2) calculates the number of 1s of the second multiplexer (MUX2) provided to the fourth multiplexer (MUX4), and calculates the number of 1s of the second pop counter (POPCOUNT2). The output is shifted to the left by 1, and the first pop counter (POPCOUNT1) calculates the number of 1s in the bit vector obtained after a bitwise AND operation between the mask vectors of two ternary data transmitted to the fourth multiplexer (MUX4). , Where the first bit vector and the second bit vector are mask vectors, where the number of 1 represents the number of non-zero values obtained after a dot product operation between two ternary vectors.

제2 감산기(SUB2)에서 제2 팝 카운터(POPCOUNT2)의 왼쪽으로 시프트된 출력과 제4 멀티플렉서(MUX4)의 출력을 감산하여 제2 감산기(SUB2)의 출력에서 0인 엘리먼트들의 영향이 제거된다. 여기서 제2 감산기(SUB2)의 출력은 두 개의 삼진 벡터들에 대해 수행된 내적 연산의 결과이다. 제2 감산기(SUB2)의 출력은 제1 누산기(ACC)와 함께 제공되어 제5 멀티플렉서(MUX5)와 제2 가산기(ADD2)를 이용한 누산 연산이 수행된다. 제2 가산기(ADD2)의 출력은 제1 누산기(ACC)에 저장되며, 여기서 제2 가산기(ADD2)의 출력은 비교기(COMPARATOR)의 임계 값과 비교되어 출력 값을 생성한다.The influence of zero elements from the output of the second subtractor SUB2 is removed by subtracting the output shifted to the left of the second pop counter POPCOUNT2 and the output of the fourth multiplexer MUX4 in the second subtractor SUB2. Here, the output of the second subtractor SUB2 is a result of a dot product operation performed on two ternary vectors. The output of the second subtractor SUB2 is provided together with the first accumulator ACC to perform an accumulation operation using the fifth multiplexer MUX5 and the second adder ADD2. The output of the second adder ADD2 is stored in the first accumulator ACC, where the output of the second adder ADD2 is compared with a threshold value of the comparator COMPARATOR to generate an output value.

일 실시 예에서, 전자 장치(1000)의 융합 데이터 경로 엔진(403)이 BNN 모델(900C)에서 사용하는 이진 데이터를 지원하도록 구성되면, 전자 디바이스(1000)는 결정된 작동 모드에 따라 적어도 하나의 팝 카운트 로직(POPCOUNT1-POPCOUNT2)을 사용하여 XNOR 게이트(XNOR1-XNOR2) 및 AND 게이트(AND1-AND3)에서 삼진 데이터 및 이진 데이터 중 적어도 하나를 처리한다. 전자 장치(1000)는 제1 XNOR 게이트(XNOR1)를 사용하여 제1 벡터의 1비트와 제2 벡터의 1비트를 수신하여 제1 XNOR 게이트(XNOR1)의 출력으로 길이 N비트의 곱 벡터를 생성한다. 전자 장치(1000)는 제2 XNOR 게이트(XNOR2)를 사용하여 제3 벡터의 1비트와 제4 벡터의 1비트를 수신하여 길이가 N비트인 곱 벡터를 제2 XNOR 게이트(XNOR2)의 출력으로 생성한다.In an embodiment, when the fusion data path engine 403 of the electronic device 1000 is configured to support binary data used in the BNN model 900C, the electronic device 1000 At least one of ternary data and binary data is processed in the XNOR gates XNOR1-XNOR2 and AND gates AND1-AND3 using count logic POPCOUNT1-POPCOUNT2. The electronic device 1000 generates a product vector of length N bits as an output of the first XNOR gate (XNOR1) by receiving 1 bit of the first vector and 1 bit of the second vector using the first XNOR gate (XNOR1). do. The electronic device 1000 receives 1 bit of the third vector and 1 bit of the fourth vector using the second XNOR gate XNOR2, and converts a product vector having a length of N bits as an output of the second XNOR gate XNOR2. Generate.

전자 장치(1000)는 제1 멀티플렉서(MUX1)의 입력으로서 제1 XNOR 게이트(XNOR1)의 출력을 제공한다. 전자 장치(1000)는 제2 XNOR 게이트(XNOR2)의 출력을 제2 멀티플렉서(MUX2)의 입력으로 제공하며, 여기서 제2 멀티플렉서(MUX2)의 출력은 제3 비트와 제4 비트 간의 비트 단위 XNOR 연산 후 얻은 비트 벡터를 포함한다. 전자 장치(1000)는 제1 비트 길이 및 제1 팝 카운터(POPCOUNT1)를 통해 제1 멀티플렉서(MUX1)의 출력을 제3 멀티플렉서(MUX3)에 제공하고, 여기서 제1 멀티플렉서 (MUX1)의 출력은 제1 팝 카운터(POPCOUNT1)의 입력으로 제공되고, 제1 팝 카운터(POPCOUNT1)는 제1 멀티플렉서(MUX1)의 출력에서 비트 벡터에서 획득된 1의 수를 계산하고, 계산 결과는 제3 멀티플렉서(MUX3) 및 제4 멀티플렉서(MUX4)로 제공된다.The electronic device 1000 provides an output of the first XNOR gate XNOR1 as an input of the first multiplexer MUX1. The electronic device 1000 provides the output of the second XNOR gate XNOR2 as an input of the second multiplexer MUX2, where the output of the second multiplexer MUX2 is a bit-wise XNOR operation between the third and fourth bits. It contains the bit vector obtained after. The electronic device 1000 provides the output of the first multiplexer MUX1 to the third multiplexer MUX3 through a first bit length and a first pop counter POPCOUNT1, wherein the output of the first multiplexer MUX1 is zero. 1 is provided as an input of a pop counter (POPCOUNT1), and the first pop counter (POPCOUNT1) calculates the number of 1s obtained from the bit vector at the output of the first multiplexer (MUX1), and the calculation result is the third multiplexer (MUX3). And a fourth multiplexer MUX4.

전자 장치(1000)는 제2 비트 길이, 제1 팝 카운터(POPCOUNT1)를 통해 제1 멀티플렉서(MUX1)로부터의 입력, 및 제2 팝 카운터(POPCOUNT2)를 통해 제2 멀티플렉서(MUX2)로부터의 입력을 수신하고, 여기서 제2 팝 카운터(POPCOUNT2)는 제4 멀티플렉서(MUX4)로 전달되는 제2 멀티플렉서(MUX2)의 출력에서 1의 수를 계산하고, 제2 팝 카운터(POPCOUNT2)의 출력은 1만큼 왼쪽으로 시프트된다. 제1 팝 카운터(POPCOUNT1)의 왼쪽으로 시프트된 출력과 제3 멀티플렉서(MUX3)의 출력은 제1 감산기(SUB1)에서 감산된다. 여기서 제1 감산기(SUB1)의 출력은 제1 비트 벡터와 제2 비트 벡터의 내적을 나타낸다. 제2 팝 카운터(POPCOUNT2)의 왼쪽으로 시프트된 출력과 제4 멀티플렉서(MUX4)의 출력은 제2 감산기(SUB2)에서 감산된다. 여기서 제2 감산기(SUB2)의 출력은 제3 비트 벡터와 제4 비트 벡터의 내적을 나타낸다. 제2 감산기(SUB2)의 출력과 제1 감산기(SUB1)의 출력은 제1 가산기(ADD1)에서 더해지고 제1 누산기(ACC)와 함께 제5 멀티플렉서(MUX5)와 제2 가산기(ADD2)를 이용하여 누적 연산을 수행한다. 제2 가산기(ADD2)의 출력은 제1 누산기(ACC)에 저장되며, 여기서 제2 가산기(ADD2)의 출력은 비교기(COMPARATOR)의 임계 값과 비교되어 출력 값을 생성한다.The electronic device 1000 receives a second bit length, an input from the first multiplexer MUX1 through the first pop counter POPCOUNT1, and an input from the second multiplexer MUX2 through the second pop counter POPCOUNT2. Receive, where the second pop counter (POPCOUNT2) counts the number of 1s at the output of the second multiplexer (MUX2) delivered to the fourth multiplexer (MUX4), and the output of the second pop counter (POPCOUNT2) is left by 1 Shifted to. The output shifted to the left of the first pop counter POPCOUNT1 and the output of the third multiplexer MUX3 are subtracted by the first subtractor SUB1. Here, the output of the first subtractor SUB1 represents the dot product of the first bit vector and the second bit vector. The output shifted to the left of the second pop counter POPCOUNT2 and the output of the fourth multiplexer MUX4 are subtracted by the second subtractor SUB2. Here, the output of the second subtractor SUB2 represents the dot product of the third bit vector and the fourth bit vector. The output of the second subtractor SUB2 and the output of the first subtractor SUB1 are added by the first adder ADD1, and the fifth multiplexer (MUX5) and the second adder (ADD2) are used together with the first accumulator (ACC). To perform the cumulative operation. The output of the second adder ADD2 is stored in the first accumulator ACC, where the output of the second adder ADD2 is compared with a threshold value of the comparator COMPARATOR to generate an output value.

일 실시 예에서, 비트 단위 데이터 경로와 FPL 데이터 경로를 결합하여 PE를 형성하도록 설계하고, 추가 스토리지 오버 헤드 없이 비트 단위 데이터 경로와 FPL 데이터 경로 모두를 지원하도록 디스패처(200)를 사용하여, 2차원 PE 어레이의 여러 PE에 필요한 데이터를 배포함으로써, 전자 장치(1000)는 이진 데이터와 삼진 데이터의 내적 계산 및 비 이진 데이터와 비 삼진 데이터 중 하나와 이진 데이터와 삼진 데이터중 하나 간의 내적을 융합된 데이터 경로 및 FPL 데이터 경로에 배분하\한다. PE는 비트 단위 데이터 경로 내의 이진 또는 삼진 데이터 쌍 사이의 내적 또는 FPL 데이터 경로 내의 하나의 풀 정밀도 데이터와 하나의 이진 또는 삼진 데이터 간의 내적을 계산한다.In one embodiment, a two-dimensional data path is designed to form a PE by combining a bit-wise data path and an FPL data path, and using the dispatcher 200 to support both the bit-wise data path and the FPL data path without additional storage overhead. By distributing the necessary data to multiple PEs of the PE array, the electronic device 1000 calculates the dot product of binary data and ternary data, and combines the dot product between one of the non-binary data and the non stern data and one of the binary data and ternary data. Allocate to path and FPL data path. The PE computes the dot product between binary or ternary data pairs in a bitwise data path or between one full precision data and one binary or ternary data in the FPL data path.

흐름도(1100)에서 다양한 동작, 블록, 단계 등은 제시된 순서로, 다른 순서로, 또는 동시에 수행 될 수 있다. 또한, 일부 실시 예에서, 동작, 블록, 단계 등의 일부는 본 발명의 범위를 벗어나지 않고 생략, 추가, 수정, 건너 뛰기 등이 될 수 있다.In the flowchart 1100, various operations, blocks, steps, etc. may be performed in the order presented, in a different order, or simultaneously. Further, in some embodiments, some of operations, blocks, steps, etc. may be omitted, added, modified, skipped, and the like without departing from the scope of the present invention.

도 12a는 일 실시 예에 따른, 풀 정밀도 레이어 데이터 경로를 사용하여 데이터 전달을 위해 디스패처에 의해 수행되는 단계들을 예시하는 개략도이다.12A is a schematic diagram illustrating steps performed by a dispatcher for data transfer using a full precision layer data path, according to an embodiment.

PE 어레이(400)는 IFM 스트림(IFM stream)과 커널 스트림(Kernel stream)을 저장하기위한 두 개의 레지스터 세트, 즉 세트 0(Set 0)과 세트 1(Set 1)을 가지고 있다. 세트의 IFM 벡터와 커널 벡터를 저장하기위한 레지스터의 수는 각각 PE 어레이(400) 행 및 열과 같다. PE 어레이의 차원이 8 X 8이고, 벡터 길이가 16비트이고, 풀 정밀도 IFM 비트의 폭이 8인 전자 장치(1000)를 고려한다. PE 어레이(400)에는 2X8 (즉, Set 0 및 Set 1) IFM 레지스터들(IFM registers) 및 커널 레지스터들(Kernel registers)이 있다. 각 레지스터의 크기는 16비트이다. IFM 레지스터와 커널 레지스터는 각각 도12a에 수직 및 수평으로 도시되어 있다. IFM 데이터는 PE 어레이(400)의 행을 통해 브로드캐스트되고 커널 데이터는 PE 어레이(400)의 열을 통해 브로드캐스트된다.The PE array 400 has two sets of registers for storing an IFM stream and a kernel stream, that is, a set 0 and a set 1. The number of registers for storing the IFM vector and the kernel vector of the set is equal to the row and column of the PE array 400, respectively. Consider the electronic device 1000 in which the PE array has a dimension of 8 X 8, a vector length of 16 bits, and a full-precision IFM bit width of 8. The PE array 400 includes 2X8 (ie, Set 0 and Set 1) IFM registers and Kernel registers. Each register is 16 bits in size. The IFM register and the kernel register are shown vertically and horizontally in Fig. 12A, respectively. IFM data is broadcast through rows of PE array 400 and kernel data is broadcast through columns of PE array 400.

행으로 브로드캐스트되는 IFM 벡터는 두 개의 레지스터들로 구성되고, 열로 브로드캐스트되는 커널 벡터는 하나의 레지스터로 구성된다. 예를 들어, 행 0(Row 0)은 세트 0(Set 0)의 R0, R1에서 IFM 벡터를 수신한다. 또한 행 1(Row 1)은 세트 0(Set 0)의 R2, R3에서 IFM 벡터를 수신한다. 마찬가지로 행 7(Row 7)은 세트 1(Set 1)의 R6, R7에서 IFM 벡터를 수신한다. IFM 비트의 폭이 8이므로 각 IFM 레지스터에는 2개의 IFM 데이터가 포함되며, IFM 벡터의 길이는 4이다(즉, 각 레지스터에서 2 개). 내적을 수행하려면 커널 벡터의 길이도 IFM 벡터 길이, 즉 4와 같아야 한다. 따라서 커널 데이터의 이진 또는 삼진 형식에 대한 4 또는 8 비트가 각 커널 레지스터에 저장된다. 일 예에서 세트들 중 임의의 하나가 커널 레지스터에 사용된다. 열 0(Column 0)은 C0에서 커널 데이터를 받는다. 열 1(Column 1)은 C1에서 커널 데이터를 받는다. 마찬가지로 열 7(Column)은 C7 레지스터에서 커널 데이터를 받는다. 따라서 일 예에서 각 PE는 매 사이클마다 4 개의 내적 연산을 수행 할 수 있다.An IFM vector broadcast in a row consists of two registers, and a kernel vector broadcast in a column consists of one register. For example, Row 0 receives IFM vectors from R0 and R1 in Set 0. Also, Row 1 receives IFM vectors from R2 and R3 of Set 0. Similarly, Row 7 receives IFM vectors from R6 and R7 of Set 1. Since the width of the IFM bits is 8, each IFM register contains 2 IFM data, and the length of the IFM vector is 4 (ie 2 in each register). To perform dot product, the length of the kernel vector must also be equal to the length of the IFM vector, i.e. 4. Thus, 4 or 8 bits of the binary or ternary format of kernel data are stored in each kernel register. In one example, any one of the sets is used for kernel registers. Column 0 receives kernel data from C0. Column 1 receives kernel data from C1. Similarly, column 7 receives kernel data from the C7 register. So, in one example, each PE can perform 4 dot product operations every cycle.

성능 및 데이터 재사용을 개선하기 위해, 다수의 PE가 2차원 어레이로 구성된다. PE 어레이(400)는 단일 명령 다중 데이터(Single Instruction Multiple Data, SIMD) 기반 아키텍처를 따르고 PE 어레이(400)의 인접 PE가 서로 통신하지 않기 때문에 데이터 흐름 기반 아키텍처(예: 시스톨릭 어레이(systolic array))를 따르지 않는다. 입력 데이터 재사용을 개선하기 위해, IFM 데이터 벡터는 PE 어레이(400)의 여러 열들에서 공유되는 반면 커널 벡터는 PE 어레이(400)의 여러 행들에서 공유된다. 모든 입력 데이터는 스테이징 레지스터들에 저장된다. 스테이징 레지스터들은 PE 어레이(400)의 가장자리(왼쪽 및 위쪽)에 배치된다. R X C 차원의 PE 어레이(400)(여기서 R 및 C는 PE 어레이(400)의 행 및 열)에는 IFM 및 커널 데이터를 각각 저장하는 R 및 C의 개수의 스테이징 레지스터들이 있다. 일 실시예에는 두 세트(Set 0 및 Set 1)의 스테이징 레지스터들이 있다. 디스패처(200)는 추가 스토리지 오버 헤드없이 데이터 경로 선택(즉, FPL 또는 융합된 비트 단위)에 기초하여 스테이징 레지스터들에서 PE 어레이(400)의 각 PE로 입력 데이터를 분배한다. 도 12a 및 12b는 각각 FPL 및 융합된 비트 단위 데이터 경로에 대한 데이터 분배 방식을 나타낸다. 예를 들어, 각 스테이징 레지스터의 비트 폭은 16으로 간주되고 IFM 데이터의 정밀도는 8 비트로 간주된다. PE 어레이의 차원은 8 X 8로 간주된다. 그러나 제안된 방법은 모든 레지스터의 비트 폭, IFM 데이터 정밀도, 및 PE 어레이 차원에서 작동한다.In order to improve performance and data reuse, multiple PEs are configured into a two-dimensional array. Since the PE array 400 follows a single instruction multiple data (SIMD) based architecture and adjacent PEs of the PE array 400 do not communicate with each other, a data flow based architecture (e.g., systolic array) ) Do not follow. To improve input data reuse, IFM data vectors are shared across multiple columns of PE array 400 while kernel vectors are shared across multiple rows of PE array 400. All input data is stored in staging registers. Staging registers are disposed on the edges (left and top) of the PE array 400. In the PE array 400 of the R X C dimension (where R and C are the rows and columns of the PE array 400), there are the number of R and C staging registers that store IFM and kernel data, respectively. In one embodiment, there are two sets of staging registers (Set 0 and Set 1). Dispatcher 200 distributes input data from staging registers to each PE of PE array 400 based on data path selection (ie, FPL or fused bit unit) without additional storage overhead. 12A and 12B show data distribution schemes for FPL and fused bitwise data paths, respectively. For example, the bit width of each staging register is considered to be 16 and the precision of IFM data is considered to be 8 bits. The dimension of the PE array is considered to be 8 X 8. However, the proposed method works at the bit width of all registers, IFM data precision, and PE array level.

도 12a는 FPL 데이터 경로에 대한 데이터 분배 방식을 도시한다.12A shows a data distribution scheme for the FPL data path.

일 예에 따라, 두 개의 8 비트의 IFM 데이터가 각 16 비트의 IFM 스테이징 레지스터들(IFM registers)에 함께 저장된다. 4 IFM 데이터는 IFM 벡터를 생성하기 위해 형성되며 PE 어레이(400)의 특정 행의 모든 열에서 공유된다. 예를 들어, 세트 0(Set 0)의 R0, R1은 PE 어레이(400)의 0번째 행에 대한 IFM 벡터를 생성한다. 마찬가지로, 세트 0(Set 0)의 R2, R3는 PE 어레이(400)의 첫 번째 행에 대해, 세트 1(Set 1)의 R6, R7는 PE 어레이(400)의 7번째 행에 대한 IFM 벡터를 생성한다. 커널의 경우, 각 스테이징 레지스터는 전체 16 비트 중 4(이진) 또는 8(삼진) 비트를 사용한다. 따라서 일 예에서는 한 세트의 스테이징 레지스터만 커널 데이터의 공급에 사용된다. 일 예에서 세트 0(Set 0) 레지스터의 C0은 PE 어레이(400)의 0 번째 열의 모든 행에 커널 데이터를 제공한다. 마찬가지로 세트 0(Set 0)의 C1은 첫 번째 열에 공급되고, 세트 0(Set 0)의 C7은 PE 어레이(400)의 7번째 열에 공급된다. 따라서 일 예에서 FPL 데이터 경로의 각 PE는 4개의 엘리먼트 별 연산(즉, IFM 및 커널 데이터 쌍)을 병렬로 수행한다.According to an example, two 8-bit IFM data are stored together in each 16-bit IFM staging registers. 4 IFM data is formed to generate IFM vectors and is shared across all columns of a specific row of PE array 400. For example, R0 and R1 of Set 0 generate IFM vectors for the 0th row of the PE array 400. Similarly, R2 and R3 of Set 0 are the IFM vectors for the first row of the PE array 400, and R6 and R7 of Set 1 are the IFM vectors for the 7th row of the PE array 400. Generate. In the case of the kernel, each staging register uses 4 (binary) or 8 (ternary) bits of the total 16 bits. Therefore, in one example, only one set of staging registers is used to supply kernel data. In one example, C0 of the Set 0 register provides kernel data to all rows of the 0th column of the PE array 400. Similarly, C1 of Set 0 is supplied to the first column, and C7 of Set 0 is supplied to the 7th column of the PE array 400. Therefore, in an example, each PE of the FPL data path performs four element-specific operations (ie, IFM and kernel data pairs) in parallel.

도 12b는 일 실시 예에 따른, 융합된 비트 단위 데이터 경로를 사용하여 데이터 전달을 위해 디스패처에 의해 수행되는 단계들을 예시하는 개략도이다.12B is a schematic diagram illustrating steps performed by a dispatcher for data transfer using a fused bitwise data path, according to an embodiment.

PE 어레이(400)는 IFM 스트림(IFM stream)과 커널 스트림(Kernel stream)을 저장하기위한 두 개의 레지스터 세트, 즉 세트 0(Set 0)과 세트 1(Set 1)을 가지고 있다. 세트에 있는 IFM 벡터와 커널 벡터를 저장하기위한 레지스터의 수는 각각 PE 어레이(400)의 열 및 행과 동일하다. PE 어레이(400)의 차원이 8 X 8이고, 벡터 길이가 16 비트이고, 풀 정밀도 IFM 비트 폭이 8인 전자 장치(1000)를 고려한다. PE 어레이(400)는 2X8 (즉, Set 0 및 Set1) IFM 레지스터(IFM registers) 및 커널 레지스터(Kernel registers)를 포함한다. 각 레지스터의 크기는 16비트이다. IFM 레지스터(IFM registers)와 커널 레지스터(Kernel registers)는 각각 도 12b에 수직 및 수평으로 도시되어 있다. PE 어레이(400)의 행을 통해 브로드캐스트되는 IFM 데이터와 PE 어레이(400)의 열을 통해 브로드캐스트되는 커널 데이터는 두 개의 레지스터에 저장된다. 예를 들어, 행 0(Row 0)은 세트 0(Set 0) 및 세트 1(Set 1)의 R0 레지스터에서 IFM 벡터를 수신한다. 마찬가지로 행 7(Row 7)은 세트 0(Set 0) 및 세트 1(Set 1)의 R7 레지스터에서 IFM 벡터를 수신한다. 열 0(Column 0)은 세트 0(Set 0)과 세트 1(Set 1)의 C0에서 커널 벡터를 얻는다. 마찬가지로 열 7(Column 7)은 세트 0(Set 0)과 세트 1(Set 1)의 C7에서 커널 벡터를 가져온다. 각 레지스터의 크기는 16 비트이므로, 이진 데이터와 삼진 데이터 각각에 대하여, 각 PE는 모든 사이클에서 32 (2 X 16) 및 16의 내적 연산을 수행할 수 있다.The PE array 400 has two sets of registers for storing an IFM stream and a kernel stream, that is, a set 0 and a set 1. The number of registers for storing the IFM vector and the kernel vector in the set is equal to the column and row of the PE array 400, respectively. Consider an electronic device 1000 in which the PE array 400 has a dimension of 8 X 8, a vector length of 16 bits, and a full precision IFM bit width of 8. The PE array 400 includes 2X8 (ie, Set 0 and Set1) IFM registers and Kernel registers. Each register is 16 bits in size. IFM registers and Kernel registers are shown vertically and horizontally in Fig. 12B, respectively. IFM data broadcast through a row of the PE array 400 and kernel data broadcast through a column of the PE array 400 are stored in two registers. For example, Row 0 receives the IFM vector in the R0 register of Set 0 and Set 1. Similarly, Row 7 receives the IFM vector from the R7 registers of Set 0 and Set 1. Column 0 obtains a kernel vector from C0 of Set 0 and Set 1. Similarly, Column 7 fetches the kernel vector from C7 of Set 0 and Set 1. Since each register is 16 bits in size, for each of binary data and ternary data, each PE can perform 32 (2 X 16) and 16 dot product operations in every cycle.

도 12b는 융합된 비트 단위 데이터 경로에 대한 데이터 분배 방식을 도시한다. 12B shows a data distribution method for a fused bit-wise data path.

FPL 데이터 경로와 달리 융합된 비트 단위 데이터 경로의 경우 PE 어레이(400)에 데이터를 공급하기 위해 두 세트의 스테이징 레지스터들이 사용된다. 각 PE는 4개의 입력 데이터(IFM 용 2 개, 커널 용 2 개)를 수신해야하기 때문에 IFM 및 커널에서 2개의 레지스터들이 필요한 데이터를 제공하는데 사용된다. 일 예에서 2개의 레지스터들은 PE 어레이(400)의 특정 행의 모든 열에 2개의 독립적인 IFM 벡터(이진의 경우) 또는 하나의 IFM 벡터(삼진의 경우)를 공급하는 데 사용된다. 유사하게, 2개의 레지스터들은 PE 어레이(400)의 특정 열의 모든 행에 2개의 독립적인 커널 벡터(이진 데이터의 경우) 또는 하나의 커널 벡터 (삼진 데이터의 경우)를 제공하는 데 사용된다. 예를 들어, 세트 0(Set0)의 {RO, R1}은 PE 어레이의 0번째 행에 IFM 데이터를 제공하고, 세트 1(Set1)의 {R6, R7}은 PE 어레이의 7번째 행(400)에 IFM 데이터를 제공한다. 커널 데이터의 경우, 세트 0(Set0) 및 세트 1(Set1)의 {C0, C0}은 0번째 열에, 세트 0(Set 0)의 {C1, C1}은 1 번째 열에 공급된다. 일 예에서 융합된 비트 단위 데이터 경로의 각 PE는 32(이진 데이터의 경우) 또는 16(삼진 데이터의 경우)의 엘리먼트 별 연산을 병렬로 수행한다.Unlike the FPL data path, in the case of a fused bitwise data path, two sets of staging registers are used to supply data to the PE array 400. Since each PE needs to receive 4 input data (2 for IFM and 2 for kernel), 2 registers are used in the IFM and kernel to provide the required data. In one example, two registers are used to supply two independent IFM vectors (for binary) or one IFM vector (for ternary) to every column of a particular row of PE array 400. Similarly, two registers are used to provide two independent kernel vectors (for binary data) or one kernel vector (for ternary data) for every row of a particular column of PE array 400. For example, {RO, R1} of set 0 (Set0) provides IFM data to the 0th row of the PE array, and {R6, R7} of set 1 (Set1) is the 7th row of the PE array (400). Provides IFM data. In the case of kernel data, {C0, C0} of Set 0 and Set 1 are supplied to the 0th column, and {C1, C1} of Set 0 are supplied to the 1st column. In an example, each PE of the fused bitwise data path performs an operation for each element of 32 (for binary data) or 16 (for ternary data) in parallel.

전술한 본 명세서의 설명은 예시를 위한 것이며, 본 명세서의 내용이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 본 명세서에서 사용 된 어법 또는 용어는 제한이 아니라 설명을위한 것임을 이해해야한다. 따라서, 본 명세서의 실시 예가 바람직한 실시 예의 관점에서 설명되었지만, 통상의 기술자는 본 명세서의 실시 예가 본 명세서에 설명 된 실시 예의 범위 내에서 수정되어 실행될 수 있음을 인식할 것이다.The foregoing description of the present specification is for illustrative purposes only, and those of ordinary skill in the art to which the content of the present specification belongs will understand that it is possible to easily transform it into other specific forms without changing the technical spirit or essential features of the present invention. I will be able to. It is to be understood that the phraseology or terminology used herein is for description and not limitation. Accordingly, although the embodiments of the present specification have been described in terms of preferred embodiments, those of ordinary skill in the art will recognize that the embodiments of the present specification may be modified and implemented within the scope of the embodiments described herein.

Claims

As a method of calculating the dot product for binary data, ternary data, non-binary data, and non ternary data,
Designing, by an electronic device, to calculate a dot product for the strikeout data;
Designing, by the electronic device, a fused bit-wise data path for supporting dot product calculation of the binary data and the ternary data;
Designing, by the electronic device, a full precision layer (FPL) data path for calculating a dot product between one of the non-binary data and the non-three data, and one of the binary data and the ternary data ; And
By the electronic device, the fused data path and the dot product between the binary data and the ternary data and one of the non-binary data and the struck out data, and the dot product between the binary data and the struck out data are calculated. Distributing to the FPL data path.

The method of claim 1,
Designing the fused bitwise data path to support dot product calculation for the binary data and the ternary data:
Receiving, by the electronic device, one of the struck out data and the binary data;
Determining, by the electronic device, an operation mode for the struck out data and the binary data;
Processing, by the electronic device, at least one of the ternary data and the binary data in an XNOR gate and an AND gate using at least one pop count logic based on the determined operation mode;
Receiving, by the electronic device, at least one of the processed struck out data and the processed binary data as an accumulator; And
Generating, by the electronic device, at least one of a final ternary value and a final binary value.

The method of claim 2,
When the fused data path engine of the electronic device is configured to support a binary data model used in a Binary Neural Network (BNN) model and a ternary data model used in a TNN (Ternary Neural Network) model, by the electronic device, Processing at least one of the ternary data and the binary data in the XNOR gate and the AND gate using the at least one pop count logic based on the determined operation mode,
Receiving 1 bit of a first vector and 1 bit of a second vector using a first XNOR gate, and generating a product vector of length N bits as an output of the first XNOR gate;
Receiving the 1 bit of the first vector and the 1 bit of the second vector using a first AND gate, and generating a product vector of length N bits as an output of the first AND gate;
Receiving 1 bit of a third vector and 1 bit of a fourth vector using a second XNOR gate, and generating a product vector of length N bits as an output of the second XNOR gate;
Receiving the 1 bit of the third vector and the 1 bit of the fourth vector using a second AND gate, and generating a product vector of length N bits as an output of the second AND gate;
Supplying an output of the first XNOR gate and an output of the first AND gate as an input of a first multiplexer, wherein the output of the first multiplexer is a dot product between two ternary vectors or of the binary data in the case of the ternary data. Case contains the mask vector of the dot product of the first binary vector pair;
Providing an output of the first AND gate and an output of the second XNOR gate as inputs of a third AND gate;
Using a second multiplexer, receiving an input from an output of the second XNOR gate, an output of the second AND gate, and an output of the third AND gate, wherein the output of the second multiplexer is Case contains the dot product value vector of two ternary vector pairs affected only by a non-zero element pair from the value vector of the input ternary vector pair, where the output of the second multiplexer is a second binary vector pair in the case of the binary data Contains the dot product vector of;
Receiving a first bit length and an input from the first multiplexer through a first pop counter, wherein the output of the first multiplexer is provided as an input of the first pop counter, wherein the first pop counter is the struck out In the case of data, the number of 1s in the mask vector is calculated, in the case of binary data, the number of 1s in the dot product vector is calculated. In the case of binary data, the calculated result is transferred to a third multiplexer, and in the case of the ternary data The calculated result is transferred to a fourth multiplexer; And
Receiving a second bit length provided to the fourth multiplexer, an output of the first multiplexer, and an output of a second pop counter, wherein the second pop counter is provided to the fourth multiplexer. 2 Calculate the number of 1s at the output of the multiplexer, where the output of the second pop counter is shifted to the left by 1, where the output of the fourth multiplexer is the output of the first pop counter in the case of the struck data, binary Including the output of the second bit length for data type B, or the second pop counter for ternary data type A;
The left-shifted output of the first pop counter and the output of the third multiplexer are subtracted by a first subtractor, where the output of the first subtractor represents the dot product of two bit-wise vector pairs,
The left-shifted output of the second pop counter and the output of the fourth multiplexer are subtracted by a second subtractor, where the output of the second subtractor represents the dot product between two ternary vector pairs in the case of the ternary data or the For binary data, it represents the dot product between two binary vector pairs,
In the case of the binary data, the output of the second subtraction and the output of the first subtractor are added by a first adder, where the output of the fifth multiplexer selects the output of the first adder in case of binary data or in the case of ternary data Select the output of the second subtractor, where the output of the fifth multiplexer is added together with the first accumulator using a second adder,
The output of the second adder is stored in the first accumulator, wherein the output of the second adder is compared to a threshold value of a comparator to produce an output value.

The method of claim 2,
When the fused data path engine of the electronic device is configured to support the ternary data model used in the TNN (Ternary Neural Network) model, the at least one pop count logic based on the determined operation mode by the electronic device Processing at least one of the ternary data and the binary data in the XNOR gate and the AND gate by using,
Receiving 1 bit of a first vector and 1 bit of a second vector using a first AND gate, and generating a product vector of length N bits as an output of the first AND gate;
Receiving 1 bit of a third vector and 1 bit of a fourth vector using a second XNOR gate, and generating a product vector of length N bits as an output of the second XNOR gate;
Supplying an output of the first AND gate as an input of a first multiplexer;
Supplying an output of the first AND gate and an output of the second XNOR gate as inputs of a third AND gate;
Receiving an input from an output of the second XNOR gate and an output of the third AND gate using a second multiplexer, wherein the output of the second multiplexer is a bit vector obtained in a bit-wise operation between two ternary vectors. Contains elements; And
Receiving a second bit length using a fourth multiplexer, an input from the second multiplexer through a second pop counter, and an input from the first multiplexer through a first pop counter, wherein the first The 2 pop counter calculates the number of 1s of the output of the second multiplexer provided to the fourth multiplexer, where the output of the second pop counter is shifted to the left by 1, where the first pop counter is the first 4 Calculate the number of 1s of the stroked bit vectors after a bitwise AND operation between the mask vectors of two ternary data provided by the multiplexer, where the first bit vector and the second bit vector are mask vectors, where the number of 1s is Represents the number of non-zero values obtained after the dot product of the two ternary vectors;
The left-shifted output of the second pop counter and the output of the fourth multiplexer are subtracted by a second subtractor to remove the influence of the zero element from the output of the second subtractor, where the output of the second subtractor is the Represents the result of the dot product operation of two ternary vectors,
The output of the second subtractor is provided with a first accumulator to perform an accumulation operation using a fifth multiplexer and a second adder,
Wherein the output of the second adder is stored in the first accumulator, and the output of the second adder is compared to a threshold value of a comparator to generate an output value.

The method of claim 2,
When the fused data path engine of the electronic device is configured to support binary data used in a binary neural network (BNN) model, the at least one pop count logic is performed by the electronic device based on the determined operation mode. Processing at least one of the ternary data and the binary data in the XNOR gate and the AND gate by using,
Receiving 1 bit of a first vector and 1 bit of a second vector using a first XNOR gate, and generating a product vector of length N bits as an output of the first XNOR gate;
Receiving 1 bit of a third vector and 1 bit of a fourth vector using a second XNOR gate, and generating a product vector of length N bits as an output of the second XNOR gate;
Providing an output of the first XNOR gate as an input of a first multiplexer;
Providing an output of the second XNOR gate as an input of a second multiplexer, wherein the output of the second multiplexer includes a bit vector obtained after a bitwise XNOR operation between the third vector and the fourth vector;
Receiving an input from the first multiplexer via a first bit length and a first pop counter using a third multiplexer, wherein the output of the first multiplexer is supplied as an input of the first pop counter and Wherein the first pop counter calculates the number of 1s of the bit vector from the outputs of the first multiplexer transmitted to the third and fourth multiplexers; And
Receiving a second bit length, an input from the first multiplexer through the first pop counter, and an input from the second multiplexer through a second pop counter, wherein the second pop counter is the Calculating the number of 1s from the output of the second multiplexer provided to a fourth multiplexer, wherein the output of the second pop counter is shifted to the left by 1;
The left-shifted output of the first pop counter and the output of the third multiplexer are subtracted by a first subtractor, where the output of the first subtractor represents the dot product between the first vector and the second vector,
The left-shifted output of the second pop counter and the output of the fourth multiplexer are subtracted by a second subtractor, where the output of the second subtractor represents the dot product of the third vector and the fourth vector,
The output of the second subtractor and the output of the first subtractor are added by a first adder, and the output of the first adder is provided together with a first accumulator to perform an accumulation operation using a fifth multiplexer and a second adder. ,
Wherein the output of the second adder is stored in the first accumulator, and the output of the second adder is compared to a threshold value of a comparator to generate an output value.

The method of claim 3,
The dot product performs a calculation for a multiply and accumulate (MAC) operation in a BNN model and a TNN model.

The method of claim 1,
The FPL data path performs a dot product between full precision input feature map (IFM) data and one of binary kernel data and ternary kernel data.

The method of claim 1,
The output of one of the FPL data path and the fused data path is selected for accumulation based on a layer type, and the layer type is a full precision layer or a hidden layer.

The method of claim 1,
The fused data path performs a dot product between kernel data and IFM data, and the kernel data and the IFM data are represented by the binary data or the ternary data.

The method of claim 1,
By the electronic device, the fused data path and the dot product between the binary data and the ternary data and one of the non-binary data and the struck out data, and the dot product between the binary data and the struck out data are calculated. Distributing to the FPL data path,
Designing, by the electronic device, to form a processing element (PE) by combining the bit-wise data path and the FPL data path, wherein the PE is a dot product between binary or ternary data pairs in the bit-wise data path. Calculating or calculating the dot product between one full precision data and one binary or ternary data in the FPL data path; And
Distributing, by the electronic device, necessary data to multiple PEs of a two-dimensional PE array using a dispatcher to support both the bitwise data path and the FPL data path without additional storage overhead.

As an electronic device for calculating the dot product for binary data, ternary data, non-binary data, and non ternary data,
Static random access memory (SRAM);
At least one controller configured to transmit an address to the SRAM;
Processing engine (PE) array; And
A dispatcher configured to receive at least one SRAM data and deliver the at least one SRAM data to the PE array,
The PE array includes a PE array controller, an output feature map (OFM) combiner, and a fused data path engine,
The electronic device, wherein the fused data path engine is configured to provide a fused data path for a binary neural network (BNN) model and a ternary neural network (TNN) model.

The method of claim 11,
The BNN model is used to calculate a dot product for a pair of binary data, and the TNN model is used to calculate a dot product of ternary data.

The method of claim 11,
The fused data path engine,
Three AND gates;
Two XNOR gates;
6 multiplexers;
Two subtractors;
Two pop counters;
Two adders;
Two accumulators; And
Includes a comparator,
In the two XNOR gates, a first XNOR gate receives 1 bit of a first vector and 1 bit of a second vector,
In the two XNOR gates, a second XNOR gate receives 1 bit of a third vector and 1 bit of a fourth vector,
In the three AND gates, a first AND gate receives the 1 bit of the first vector and the 1 bit of the second vector,
In the three AND gates, a second AND gate receives the 1 bit of the third vector and the 1 bit of the fourth vector,
In the three AND gates, a third AND gate receives an input from the first AND gate and the second XNOR gate,
In the six multiplexers, a first multiplexer receives an input from the first XNOR gate and the first AND gate,
In the six multiplexers, a second multiplexer receives an input from the second XNOR gate, the second AND gate, and the third AND gate,
In the six multiplexers, a third multiplexer receives a first bit length and an input from the first multiplexer through a first pop counter in the two pop counters,
In the six multiplexers, a fourth multiplexer has a second bit length, an input from the first multiplexer through the first pop counter, and from the second multiplexer through a second pop counter in the two pop counters. Receive input,
In the two subtractors, a first subtractor receives an output shifted to the left of the first pop counter and an output of the third multiplexer,
In the two subtractors, a second subtractor receives an output shifted to the left of the second pop counter and an output of the fourth multiplexer,
In the two adders, a first adder receives an input from the first subtractor and the second subtractor,
In the six multiplexers, a fifth multiplexer receives inputs from the first adder and the second subtractor,
A second adder in the two adders receives an input from a first accumulator in the fifth multiplexer and the two accumulators,
In the six multiplexers, a sixth multiplexer receives an input from the second adder and a second accumulator in the two accumulators,
The comparator receives an input from the second accumulator,
The fused data path engine
Supporting at least one of binary data used in the BNN model and ternary data used in the TNN model,
Combining the data paths for the BNN model and the TNN model into a single fused data path,
Processing the ternary data in a bit manner, and
The electronic device, configured to perform at least one of combining the data path for the BNN model and the TNN model operating on a data type into the single fused data path.

The method of claim 13,
If the fused data path engine is configured to support the binary data used by the BNN model and the ternary data used by the TNN model,
The first XNOR gate receives the 1 bit of the first vector and the 1 bit of the second vector to generate a product vector of length N bits as an output of the first XNOR gate,
The first AND gate receives the 1 bit of the first vector and the 1 bit of the second vector to generate a product vector of length N bits as an output of the first AND gate,
The second XNOR gate receives the 1 bit of the third vector and the 1 bit of the fourth vector to generate a product vector of length N bits as an output of the second XNOR gate,
The second AND gate receives the 1 bit of the third vector and the 1 bit of the fourth vector to generate a product vector of length N bits as an output of the second AND gate,
The output of the first XNOR gate and the output of the first AND gate are provided as inputs of the first multiplexer, where the output of the first multiplexer includes the output of the first XNOR gate for binary data type B or binary In the case of data type A and ternary data, including the output of the first AND gate,
The output of the first AND gate and the output of the second XNOR gate are supplied to the input of the third AND gate,
The second multiplexer receives an input from an output of the second XNOR gate, an output of the second AND gate, and an output of the third AND gate, wherein the output of the second multiplexer is the third in the case of the ternary data. 3 includes the output of the AND gate or includes the output of the second XNOR gate in case of the binary data type B, or includes the output of the second AND gate in case of the binary data type A,
The third multiplexer receives a second bit length and an input from the first multiplexer through the first pop counter, and the output of the first multiplexer is provided to the first pop counter, where the first pop counter Calculates a non-zero element using a bit vector obtained by performing a bitwise operation by the first XNOR gate or performing a bitwise operation by the first AND gate, and the calculation result is the third multiplexer and Provided as the fourth multiplexer,
The fourth multiplexer receives a second bit length, an input from the first multiplexer through the first pop counter, and an input from the second multiplexer through the second pop counter, wherein the second pop counter Calculates a number of 1s from the output of the second multiplexer provided to the fourth multiplexer, where the output of the second pop counter is shifted left by one.
The left-shifted output of the first pop counter and the output of the third multiplexer are subtracted by the first subtractor,
The left-shifted output of the second pop counter and the output of the fourth multiplexer are subtracted by the second subtractor, where the output of the second subtracter represents the dot product between two ternary vector pairs in the case of the ternary data or the For binary data, it represents the dot product between two binary vector pairs,
The output of the second subtractor and the output of the first subtractor are added by the first adder, and an accumulation operation is performed using the fifth multiplexer and the second adder together with the first accumulator, wherein the first The output of the adder supports the binary data used in the BNN model, the output of the second subtracter is the input of the fifth multiplexer for supporting the ternary data used in the TNN model,
The electronic device, wherein the output of the second adder is stored in the first accumulator, and the output of the second adder is compared with a threshold value of the comparator to generate an output value.

The method of claim 13,
If the fused data path engine is configured to support the ternary data used by the TNN model,
The first AND gate receives the 1 bit of the first vector and the 1 bit of the second vector to generate a product vector of length N bits as an output of the first AND gate,
The second XNOR gate receives the 1 bit of the third vector and the 1 bit of the fourth vector to generate a product vector of length N bits as an output of the second XNOR gate,
An output of the first AND gate is provided as an input of the first multiplexer,
The output of the first AND gate and the output of the second XNOR gate are provided as inputs of the third AND gate,
The second multiplexer receives an input from an output of the second XNOR gate and an output of the third AND gate, wherein the output of the second multiplexer is non-zero of a bit vector obtained in a bit-wise operation between two ternary vectors. Contains an element,
The fourth multiplexer receives a second bit length, an input from the second multiplexer through the second pop counter, and an input from the first multiplexer through the first pop counter, wherein the second pop counter Calculates the number of 1 from the output of the second multiplexer transmitted to the fourth multiplexer, the output of the second pop counter is shifted to the left by 1, and the first pop counter is transmitted to the fourth multiplexer. Compute a non-zero element of a bit vector obtained from a bitwise operation between two ternary vectors,
The left-shifted output of the second pop counter and the output of the fourth multiplexer are subtracted by the second subtractor to remove the influence of zero elements from the output of the second subtractor, where the output of the second subtractor is Is the result of the dot product performed on the two ternary vectors, the output of the fourth multiplexer is the output of the first pop counter,
The output of the second subtractor is provided to perform an accumulation operation using the fifth multiplexer and the second adder together with the first accumulator,
The electronic device, wherein the output of the second adder is stored in the first accumulator, and the output of the second adder is compared with a threshold value of the comparator to generate an output value.

The method of claim 13,
If the fused data path engine is configured to support binary data used by the BNN model,
The first XNOR gate receives the 1 bit of the first vector and the 1 bit of the second vector to generate a product vector of length N bits as an output of the first XNOR gate,
The second XNOR gate receives the 1 bit of the third vector and the 1 bit of the fourth vector to generate a product vector of length N bits as an output of the second XNOR gate,
The output of the first XNOR gate is provided as an input of the first multiplexer, and the output of the first multiplexer includes a bit vector after a bit-wise XNOR operation on the first vector and the second vector,
The output of the second XNOR gate is provided as an input of the second multiplexer, and the output of the second multiplexer includes a bit vector obtained after a bit-wise XNOR operation on the third vector and the fourth vector,
The third multiplexer receives a first bit length and an input from the first multiplexer through the first pop counter, and the output of the first multiplexer is provided to the first pop counter, where the first pop counter Calculates the number of 1s in the bit vector obtained after bit-wise XNOR operation between the first vector and the second vector in a dot product operation provided by the third multiplexer and the fourth multiplexer,
The fourth multiplexer receives a second bit length, an input from the first multiplexer through the first pop counter, and an input from the first multiplexer through the second pop counter, wherein the second pop counter is Calculate the number of 1s of bit vectors obtained after bit-wise XNOR operation between the third vector and the fourth vector from the output of the second multiplexer provided to the fourth multiplexer, wherein the output of the second pop counter is Shifted left by 1,
The left-shifted output of the first pop counter and the output of the third multiplexer are subtracted by the first subtractor, where the output of the first subtractor represents the dot product between the first vector and the second vector, and the The output of the third multiplexer is the first bit length,
The left-shifted output of the second pop counter and the output of the fourth multiplexer are subtracted by the second subtractor, where the output of the second subtractor represents the dot product between the third vector and the fourth vector, and the The output of the fourth multiplexer is the second bit length,
The output of the second subtractor and the output of the first subtractor are added by the first adder, and the output of the first adder is performed by using the fifth multiplexer and the second adder together with the accumulator to perform an accumulation operation. Is provided,
The electronic device, wherein the output of the second adder is stored in the first accumulator, and the output of the second adder is compared with a threshold value of the comparator to generate an output value.

The method of claim 14,
The dot product calculates a multiply and accumulate (MAC) operation on the binary data and the ternary data.

The method of claim 11,
The dispatcher is configured to deliver data to the FPL (Full Precision Layer) data path and the fused data path, and the FPL data path and the fused data path are combined together to form a PE in the PE array. Device.

The method of claim 18,
The electronic device, wherein the FPL data path performs a dot product between a full precision IFM layer and one of a binary kernel stream and a ternary kernel stream.

The method of claim 18,
An output of one of the FPL data path and the fused data path is selected for accumulation based on a layer type, wherein the layer type is a full precision layer or a hidden layer.

The method of claim 11,
The electronic device, wherein the PE array performs a plurality of MAC operations in parallel on data provided by the dispatcher.

The method of claim 18,
The fused data path performs a dot product between kernel data and IFM data, wherein the kernel data and the IFM data are represented by the binary data or the ternary data.