KR102459985B1

KR102459985B1 - Error Reduction Techniques for Content Addresable Memory based Binary Neural Network Accelerator

Info

Publication number: KR102459985B1
Application number: KR1020210161147A
Authority: KR
Inventors: 서영교; 최수름; 전영준
Original assignee: 인하대학교 산학협력단
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-10-27

Abstract

Proposed is an error correction technique of a content-addressable memory-based BNN accelerator. The content-addressable memory-based BNN accelerator proposed in the present invention comprises: a first reference match line (ML) that compares a plurality of mismatched additional CAM cells to a single reference ML; and a second reference ML that compares a plurality of additional matching CAM cells to the single reference ML, wherein an XNOR-popcount operation is optionally performed through the plurality of additional CAM cells, the first reference ML, and the second reference ML to reduce calculation errors. Therefore, the present invention is capable of reducing a probability of the error.

Description

Error Reduction Techniques for Content Addressable Memory based Binary Neural Network Accelerator

본 발명은 콘텐츠 주소화 메모리 기반 BNN 가속기의 오류 정정 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for error correction of a content addressable memory-based BNN accelerator.

이진화 신경망(Binarized Neural Network; BNN)은 가중치와 활성화가 데이터의 양을 줄여 하드웨어 측면에서 유리한 두 가지 값으로 제한되기 때문에 경량 신경망에 있어서 많은 관심을 받았다[1], [2]. 하지만, 별도의 처리 장치 및 데이터 저장 요소를 채택하는 기존 컴퓨팅 시스템은 처리 요소와 메모리 간의 데이터 전송으로 인해 여전히 많은 전력 및 성능 오버헤드를 필요로 한다. Binarized Neural Network (BNN) has received a lot of attention in lightweight neural networks because weights and activations are limited to two values that are advantageous in terms of hardware by reducing the amount of data [1], [2]. However, existing computing systems employing separate processing units and data storage elements still require a lot of power and performance overhead due to data transfer between the processing elements and memory.

따라서, 메모리 회로 내에서 직접적으로 동작하는 고성능 및 강력한 곱셈 누적기를 수행하기 위한 새로운 BNN 가속기를 필요로 한다. Therefore, there is a need for new BNN accelerators to perform high performance and robust multiplicative accumulators that operate directly within the memory circuit.

[1] Y. H. Chen et al. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE JSSC, 52, 1, (2017), 127-138. [1] Y. H. Chen et al. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE JSSC, 52, 1, (2017), 127-138. [2] S. Han et al. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. (2016). arXiv:1602.01528. [2] S. Han et al. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. (2016). arXiv:1602.01528. [3] M. Courbariaux et al. 2016. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. (2016). arXiv:1602.02830. [3] M. Courbariaux et al. 2016. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. (2016). arXiv:1602.02830. [4] M. Rastegari et al. 2016. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. In 2016 Computer Vision, 525-542. [4] M. Rastegari et al. 2016. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. In 2016 Computer Vision, 525-542. [5] W. Tang et al. 2017. Wang, "How to Train a Compact Binary Neural Network with High Accuracy?. in 2017 AAAI, 2625- 2631. [5] W. Tang et al. 2017. Wang, “How to Train a Compact Binary Neural Network with High Accuracy?. in 2017 AAAI, 2625- 2631. [6] D. Miyashita et al. 2017. A Neuromorphic Chip Optimized for Deep Learning and CMOS Technology With Time-Domain Analog and Digital Mixed-Signal Processing," IEEE JSSC. 52, 10, (2017), 2679-2689 [6] D. Miyashita et al. 2017. A Neuromorphic Chip Optimized for Deep Learning and CMOS Technology With Time-Domain Analog and Digital Mixed-Signal Processing," IEEE JSSC. 52, 10, (2017), 2679-2689 [7] K. Pagiamtzis et al. 2006. Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey. IEEE JSSC, (2006). 41, 3, 712-727. [7] K. Pagiamtzis et al. 2006. Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey. IEEE JSSC, (2006). 41, 3, 712-727. [8] W. Choi, K. Jeong, K. Choi, K. Lee, and J. Park, "Content addressable memory based binarized neural network accelerator using time-domain signal processing," in ACM/IEEE Design Automation Conference (DAC), 2018. [8] W. Choi, K. Jeong, K. Choi, K. Lee, and J. Park, "Content addressable memory based binarized neural network accelerator using time-domain signal processing," in ACM/IEEE Design Automation Conference (DAC ), 2018. [9] H. Yonekawa et al. 2017. On-Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA. in 2017 IPDPSW, 98-105. [9] H. Yonekawa et al. 2017. On-Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA. in 2017 IPDPSW, 98-105.

본 발명이 이루고자 하는 기술적 과제는 하드웨어 오류 확률을 줄이기 위해 완전 병렬 검색 작업을 활용하여 컨볼루션 작업을 효율적으로 수행하고, 이진화된 데이터를 사용하여 메모리 크기와 컨볼루션 연산의 복잡성을 줄일 수 있는 CAM(Content Addressable Memory) 기반 BNN 가속기 및 그 동작 방법을 제공하는데 있다. 제안하는 CAM 기반 BNN 가속기를 통해 ML(Match Line) 센싱 작동 중 공정 변화로 인한 작동 신뢰성의 저하 및 하드웨어 오류 확률을 줄이고자 한다. The technical problem to be achieved by the present invention is to efficiently perform a convolution operation by utilizing a fully parallel search operation to reduce the hardware error probability, and to reduce the memory size and complexity of the convolution operation using binary data. To provide a content addressable memory)-based BNN accelerator and an operating method thereof. Through the proposed CAM-based BNN accelerator, we want to reduce the probability of hardware error and deterioration of operational reliability due to process change during ML (Match Line) sensing operation.

일 측면에 있어서, 본 발명에서 제안하는 콘텐츠 주소화 메모리 기반 BNN 가속기는 미스매치되는 복수의 추가 CAM 셀을 단일 기준 ML과 비교하는 제1 기준 ML(Match Line) 및 매칭되는 복수의 추가 CAM 셀을 단일 기준 ML과 비교하는 제2 기준 ML을 포함하고, 연산 오류를 감소시키기 위하여 상기 복수의 추가 CAM 셀, 제1 기준 ML 및 제2 기준 ML을 통해 XNOR-popcount 연산을 선택적으로 수행한다. In one aspect, the content addressable memory-based BNN accelerator proposed in the present invention compares a plurality of mismatched additional CAM cells with a single reference ML (Match Line) and a plurality of matching additional CAM cells. and a second criterion ML for comparison with a single criterion ML, and selectively performs an XNOR-popcount operation through the plurality of additional CAM cells, the first criterion ML and the second criterion ML in order to reduce an operation error.

상기 제1 기준 ML 및 제2 기준 ML은 입력 특징맵(input feature map; iFMAP) 연결을 만들기 위해 ML 스위치에 의해 연결되고, 하향 필터 슬라이딩의 경우 iFMAP 쌍을 재배열하기 위해 추가 읽기 및 쓰기 작업을 수행한다. The first criterion ML and the second criterion ML are connected by an ML switch to make an input feature map (iFMAP) connection, and additional read and write operations are performed to rearrange the iFMAP pairs in the case of down-filter sliding. carry out

상기 복수의 추가 CAM 셀은 상기 복수의 추가 CAM 셀을 포함하는 CAM 어레이에서 XNOR-popcount 연산을 수행함으로써 BNN의 컨볼루션 연산을 추가 디지털 논리 회로 없이 CAM 어레이로 대체 가능하며, 공정 변화로 인한 하드웨어 오류율을 감소시키기 위해 추가 클럭 주기와 함께 SRAM 읽기 작업이 뒤따르는 XNOR-popcount 연산을 수행한다. The plurality of additional CAM cells can replace the convolution operation of BNN with a CAM array without additional digital logic circuits by performing XNOR-popcount operation on the CAM array including the plurality of additional CAM cells, and the hardware error rate due to process changes An XNOR-popcount operation is performed followed by an SRAM read operation with an additional clock cycle to reduce the

상기 복수의 추가 CAM 셀은 상기 복수의 추가 CAM 셀을 포함하는 CAM 어레이의 병렬 ML 작업을 활용하여 데이터 처리를 활성화한다. The plurality of additional CAM cells utilize parallel ML operations of a CAM array comprising the plurality of additional CAM cells to enable data processing.

또 다른 일 측면에 있어서, 본 발명에서 제안하는 콘텐츠 주소화 메모리 기반 BNN 가속기의 오류 정정 방법은 제1 기준 ML(Match Line)을 통해 미스매치되는 복수의 추가 CAM 셀을 단일 기준 ML과 비교하는 단계, 제2 기준 ML을 통해 매칭되는 복수의 추가 CAM 셀을 단일 기준 ML과 비교하는 단계 및 연산 오류를 감소시키기 위하여 상기 복수의 추가 CAM 셀, 제1 기준 ML 및 제2 기준 ML을 통해 XNOR-popcount 연산을 선택적으로 수행하는 단계를 포함한다.In another aspect, the error correction method of the content addressable memory-based BNN accelerator proposed in the present invention comprises the steps of comparing a plurality of additional CAM cells mismatched through a first reference ML (Match Line) with a single reference ML; , comparing a plurality of additional CAM cells matched via a second criterion ML to a single criterion ML and XNOR-popcount via the plurality of additional CAM cells, the first criterion ML and the second criterion ML to reduce computational errors. and selectively performing the operation.

본 발명의 실시예들에 따른 CAM(Content Addressable Memory) 기반 BNN 가속기 및 그 동작 방법을 통해 하드웨어 오류 확률을 줄이기 위한 완전 병렬 검색 작업을 활용하여 컨볼루션 작업을 효율적으로 수행하고, 이진화된 데이터를 사용하여 메모리 크기와 컨볼루션 연산의 복잡성을 줄일 수 있다. 제안하는 CAM 기반 BNN 가속기를 통해 ML(Match Line) 센싱 작동 중 공정 변화로 인한 작동 신뢰성의 저하 및 하드웨어 오류 확률을 줄일 수 있다.Through the CAM (Content Addressable Memory)-based BNN accelerator and its operating method according to embodiments of the present invention, the convolution operation is efficiently performed by utilizing a fully parallel search operation for reducing the hardware error probability, and the binarized data is used This reduces the memory size and complexity of the convolution operation. Through the proposed CAM-based BNN accelerator, it is possible to reduce the deterioration of operational reliability and the probability of hardware errors due to process changes during ML (Match Line) sensing operation.

도 1은 본 발명의 일 실시예에 따른 BNN 연산 진리표를 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 BNN 활성화 및 XNOR-popcount를 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 CAM 메모리 기반 XNOR 연산을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 CAM 메모리 기반 BNN 가속기의 회로를 나타내는 도면이다.
도 5는 본 발명의 일 실시예에 따른 ML 스위치를 사용한 ML 연결을 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 다중 기준 센싱 방법을 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 CAM 메모리 기반 BNN 가속기의 오류 정정 방법을 설명하기 위한 흐름도이다.
도 8은 본 발명의 일 실시예에 따른 CAM 메모리 기반 BNN 가속기와 종래기술의 오류 정정 결과를 비교한 도면이다. 1 is a diagram illustrating a BNN arithmetic truth table according to an embodiment of the present invention.
2 is a diagram for explaining BNN activation and XNOR-popcount according to an embodiment of the present invention.
3 is a diagram for explaining a CAM memory-based XNOR operation according to an embodiment of the present invention.
4 is a diagram illustrating a circuit of a CAM memory-based BNN accelerator according to an embodiment of the present invention.
5 is a diagram for explaining an ML connection using an ML switch according to an embodiment of the present invention.
6 is a diagram for explaining a multi-reference sensing method according to an embodiment of the present invention.
7 is a flowchart illustrating an error correction method of a CAM memory-based BNN accelerator according to an embodiment of the present invention.
8 is a diagram comparing the error correction result of the prior art with the CAM memory-based BNN accelerator according to an embodiment of the present invention.

이진화 신경망(Binarized Neural Network; BNN)은 저비용 컨볼루션 연산을 위한 가장 효율적인 신경망 중 하나이다. 이러한 BNN에서는 이진화된 데이터를 사용하여 메모리 크기와 컨볼루션 연산의 복잡성을 줄일 수 있다. Binarized Neural Network (BNN) is one of the most efficient neural networks for low-cost convolutional computations. In such a BNN, it is possible to reduce the memory size and the complexity of the convolution operation by using the binarized data.

CAM(Content Addressable Memory) 기반 BNN 가속기는 CAM에서 완전 병렬 검색 작업을 활용하여 컨볼루션 작업을 효율적으로 수행할 수 있다. 하지만, CAM 기반 BNN 하드웨어의 중요한 문제 중 하나는 ML(Match Line) 센싱 작동 중 공정 변화로 인해 작동 신뢰성이 심각하게 저하된다는 것이다. Content Addressable Memory (CAM)-based BNN accelerators can efficiently perform convolution operations by utilizing fully parallel search operations in CAM. However, one of the major problems with CAM-based BNN hardware is that the operational reliability is severely degraded due to process changes during ML (Match Line) sensing operation.

따라서, 본 발명에서는 하드웨어 오류 확률을 줄일 수 있는 새로운 CAM 어레이 설계를 제안한다. 제안하는 CAM 기반 가속기는 XNOR-popcount 연산에서 62% 감소를 달성하고 Fashion-MNIST 데이터 셋의 분류 정확도 하락은 2.33%에서 1.26%로 감소한다. 이하, 본 발명의 실시 예를 첨부된 도면을 기준하여 상세하게 설명한다.Therefore, the present invention proposes a new CAM array design that can reduce the hardware error probability. The proposed CAM-based accelerator achieves a 62% reduction in XNOR-popcount operation, and the decrease in classification accuracy of the Fashion-MNIST dataset is reduced from 2.33% to 1.26%. Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 BNN 연산 진리표를 나타내는 도면이다. 1 is a diagram illustrating a BNN arithmetic truth table according to an embodiment of the present invention.

도 1(a)는 +1과 -1의 이진화된 비트를 기반으로 한 BNN의 곱셈 연산의 진리표를 나타낸다[3]. 하지만, 단순화된 BNN 하드웨어 구현의 경우 -1을 사용하는 대신 1 또는 0의 이진화된 비트를 처리하는 것을 선호한다[4], [5]. -1을 0으로 변환하면 곱셈 연산을 위한 진리표가 도 1(b)와 같이 바뀌고 XNOR 게이트도 같은 연산을 수행할 수 있다. Fig. 1(a) shows the truth table of the multiplication operation of BNN based on the binarized bits of +1 and -1 [3]. However, for simplified BNN hardware implementations, we prefer to process the binarized bits of 1 or 0 instead of using -1 [4], [5]. When -1 is converted to 0, the truth table for the multiplication operation is changed as shown in FIG. 1(b), and the XNOR gate can also perform the same operation.

도 2는 본 발명의 일 실시예에 따른 BNN 활성화 및 XNOR-popcount를 설명하기 위한 도면이다. 2 is a diagram for explaining BNN activation and XNOR-popcount according to an embodiment of the present invention.

도 2(a)의 BNN에서 post-accumulate signum 함수는 도 2(b)와 같이 n/2(누적 크기의 절반) 임계값의 비교기를 사용하여 비트 카운트 연산에 매핑될 수 있다. In the BNN of FIG. 2(a), the post-accumulate signum function may be mapped to a bit count operation using a comparator of n/2 (half the cumulative size) threshold as shown in FIG. 2(b).

도 3은 본 발명의 일 실시예에 따른 CAM 메모리 기반 XNOR 연산을 설명하기 위한 도면이다. 3 is a diagram for explaining a CAM memory-based XNOR operation according to an embodiment of the present invention.

CAM은 입력된 검색 데이터와 저장된 데이터를 비교하여 일치하는 위치의 주소를 반환한다. 도 3은 6T SRAM과 양쪽에 2-스택 NMOS 트랜지스터로 구성된 CAM 셀을 나타낸다[6], [7]. CAM compares the input search data with the stored data and returns the address of the matching location. Figure 3 shows a CAM cell composed of a 6T SRAM and two-stack NMOS transistors on both sides [6], [7].

CAM 어레이의 ML 센싱 작동은 다음과 같이 설명할 수 있다. ML 사전 충전 단계 이전에 SL(Search-Line) 쌍은 의도하지 않은 ML 방전을 방지하기 위해 접지로 비활성화된다. ML 사전 충전 단계에서 어레이의 모든 ML은 전압을 공급하기 위해 사전 충전된다. SL 쌍이 검색 데이터로 설정되면 2-스택 NMOS 트랜지스터가 검색 데이터와 저장된 데이터를 비교한다. The ML sensing operation of the CAM array can be described as follows. Prior to the ML pre-charge phase, the search-line (SL) pair is disabled to ground to prevent unintentional ML discharge. In the ML pre-charge phase, all MLs in the array are pre-charged to supply voltage. When the SL pair is set as search data, a two-stack NMOS transistor compares the search data with the stored data.

도 3의 진리표와 같이 검색 데이터가 저장된 데이터와 일치하지 않을 경우 풀다운 경로가 형성되어 ML 노드를 방전시킨다. 반면, 검색 데이터와 저장된 데이터가 일치하지 않으면 두 개의 적층된 NMOS 트랜지스터가 꺼지고 ML에서 접지까지 전류 경로는 없어진다[8]. As shown in the truth table of FIG. 3 , when the search data does not match the stored data, a pull-down path is formed to discharge the ML node. On the other hand, if the search data and the stored data do not match, the two stacked NMOS transistors are turned off and the current path from ML to ground is lost [8].

도 4는 본 발명의 일 실시예에 따른 CAM 메모리 기반 BNN 가속기의 회로를 나타내는 도면이다. 4 is a diagram illustrating a circuit of a CAM memory-based BNN accelerator according to an embodiment of the present invention.

제안하는 CAM 메모리 기반 BNN 가속기는 제1 기준 ML(Match Line) (ML_REF0) 및 제2 기준 ML(ML_REF1)을 포함한다. The proposed CAM memory-based BNN accelerator includes a first reference ML (Match Line) (ML _REF0 ) and a second reference ML (ML _REF1 ).

제1 기준 ML은 미스매치되는 복수의 추가 CAM 셀을 단일 기준 ML과 비교한다. 제2 기준 ML은 매칭되는 복수의 추가 CAM 셀을 단일 기준 ML과 비교한다. 이후, 연산 오류를 감소시키기 위하여 상기 복수의 추가 CAM 셀, 제1 기준 ML 및 제2 기준 ML을 통해 XNOR-popcount 연산을 선택적으로 수행할 수 있다. The first reference ML compares a plurality of mismatched additional CAM cells to a single reference ML. The second criterion ML compares the matching plurality of additional CAM cells to the single criterion ML. Thereafter, the XNOR-popcount operation may be selectively performed through the plurality of additional CAM cells, the first reference ML, and the second reference ML in order to reduce an operation error.

본 발명의 실시예에 따른 BNN MAC(Multiply Accumulate) 연산에서 입력 X는 CAM 셀의 저장 노드에 매핑되고 검색 라인 노드는 가중치(Weight) 입력에 의해 결정된다. In the BNN Multiply Accumulate (MAC) operation according to an embodiment of the present invention, an input X is mapped to a storage node of a CAM cell, and a search line node is determined by a Weight input.

다수결 원칙을 고려하여 평가된 ML과 기준 ML의 방전 지연을 비교하여 XNOR-popcount 연산을 수행할 수 있다[4], [8]. 도 4와 같이 필터 가중치가 SL에 연결되고 입력 특징 맵(iFMAP)이 CAM 배열에 저장되어 XNOR-popcount 연산에 사용할 수 있다. Considering the majority rule, the XNOR-popcount operation can be performed by comparing the discharge delay of the evaluated ML and the reference ML [4], [8]. As shown in FIG. 4 , the filter weight is connected to the SL and the input feature map (iFMAP) is stored in the CAM array and can be used for the XNOR-popcount operation.

예를 들어, XNOR 연산은 CAM 셀에서 iFMAP과 가중치를 비교하여 수행된다. ML 방전 지연은 미스매치(mismatch) CAM 셀의 수에 따라 달라지므로 센싱 증폭기를 통해 ML의 전압과 기준 ML을 비교하여 주로 '1' 인지 주로 '0' 인지 알 수 있다. 뱅크 1(Bank 1)과 뱅크 2(Band 2) 사이에 위치한 ML 스위치(ML_Switch)는 필터 슬라이딩에 사용된다.For example, the XNOR operation is performed by comparing iFMAP and weights in the CAM cell. Since the ML discharge delay depends on the number of mismatched CAM cells, the voltage of the ML is compared with the reference ML through the sensing amplifier to determine whether it is mainly '1' or mainly '0'. An ML switch (ML_Switch) located between Bank 1 and Bank 2 is used for filter sliding.

도 5는 본 발명의 일 실시예에 따른 ML 스위치를 사용한 ML 연결을 설명하기 위한 도면이다. 5 is a diagram for explaining an ML connection using an ML switch according to an embodiment of the present invention.

뱅크 1(Bank 1)과 뱅크 2(Band 2) 사이에 위치한 ML 스위치는 필터 슬라이딩에 사용된다. 먼저 도 5(a)와 같이 XNOR-popcount 연산을 4개의 ML에서 동시에 수행한다. 다음 오른쪽 필터 슬라이딩에서 ML 연결은 필요한 iFMAP 연결을 만들기 위해 ML 스위치에 의해 재구성되어야 한다. 그렇지 않으면 하향 필터 슬라이딩의 경우 적절한 iFMAP 쌍을 재배열하기 위해 추가 읽기 및 쓰기 작업이 필요하다. 수정된 CAM 어레이에서 XNOR-popcount 연산을 수행함으로써 BNN의 강력한 컨볼루션 연산을 추가 디지털 논리 회로 없이 CAM 어레이로 대체할 수 있다. 또한 CAM 어레이의 병렬 ML 작업을 활용하여 CAM 기반 BNN 가속기에서 높은 데이터 처리량을 활성화할 수 있다.The ML switch located between Bank 1 and Bank 2 is used for filter sliding. First, as shown in Fig. 5(a), the XNOR-popcount operation is simultaneously performed on four MLs. In the next right filter sliding, the ML connections have to be reconfigured by the ML switches to make the necessary iFMAP connections. Otherwise, in case of down-filter sliding, additional read and write operations are required to rearrange the appropriate iFMAP pairs. By performing the XNOR-popcount operation on the modified CAM array, the powerful convolution operation of BNN can be replaced with a CAM array without additional digital logic circuitry. Additionally, parallel ML operations on CAM arrays can be utilized to enable high data throughput in CAM-based BNN accelerators.

하지만, CAM 기반 BNN 가속기[8]의 주요 단점은 XNOR-popcount 연산의 신뢰성이 공정 변화에 의해 심하게 영향을 받아 큰 센싱 오류를 초래한다는 것이다. 이것은 작은 미스매치 차이의 경우 XNOR-popcount 연산의 연산 신뢰성의 향상을 필요로 한다. However, the main disadvantage of the CAM-based BNN accelerator [8] is that the reliability of the XNOR-popcount operation is severely affected by process changes, resulting in large sensing errors. This requires improving the computational reliability of the XNOR-popcount operation in case of small mismatch differences.

본 발명에서는 추가적인 로직 기반 XNOR-popcount 연산을 선택적으로 수행하여 연산 오류를 줄일 수 있는 새로운 설계 기법을 제안한다. XNOR-popcount 연산 오류는 평가된 ML과 기준 ML이 작은 경우에 주로 발생한다는 사실에 근거한다. 이는 미스매치 차이가 작을 때 ML 방전 속도가 가까울수록(close) ML 센싱 신뢰도가 공정 변화에 영향을 받지 않기 때문이다.The present invention proposes a new design technique that can reduce arithmetic errors by selectively performing an additional logic-based XNOR-popcount operation. The XNOR-popcount operation error is based on the fact that it mainly occurs when the evaluated ML and the reference ML are small. This is because, when the mismatch difference is small, the ML sensing reliability is not affected by the process change as the ML discharge rate is close.

본 발명의 실시예에 따르면, ML 센싱 신뢰도 문제를 해결하기 위해 도 4와 같이 다중 기준을 사용하는 새로운 센싱 방식을 제안한다. 두 개의 개별 기준은 다음과 같이 결정된다. 하나의 기준 ML(REF0)에서, 미스매치된 3개의 추가 CAM 셀은 기존 단일 기준 방식과 비교되어 방전 속도가 빨라진다. 반면 다른 기준 ML(REF1)은 3개의 추가 CAM 셀이 매칭되어, 기존 기준 방식보다 방전 속도가 느려진다. According to an embodiment of the present invention, a new sensing method using multiple criteria as shown in FIG. 4 is proposed to solve the ML sensing reliability problem. The two separate criteria are determined as follows. In one reference ML(REF0), the three additional CAM cells that are mismatched have a faster discharge rate compared to the conventional single reference scheme. On the other hand, in the other reference ML(REF1), three additional CAM cells are matched, and the discharge rate is slower than that of the existing reference method.

도 6은 본 발명의 일 실시예에 따른 다중 기준 센싱 방법을 설명하기 위한 도면이다. 6 is a diagram for explaining a multi-reference sensing method according to an embodiment of the present invention.

기준 ML(REF0)및 기준 ML(REF1)은 각각 5비트 미스매치 및 3비트 미스매치의 경우이다. 미스매치 입력이 없는 경우에는 ML이 방전되지 않아 평가된 ML 전압이 기준 ML(REF0)과 기준 ML(REF1)의 전압보다 높게 된다. 따라서, XNOR-popcount 연산의 오류 가능성은 거의 0 이다. Reference ML(REF0) and reference ML(REF1) are the case of 5-bit mismatch and 3-bit mismatch, respectively. If there is no mismatch input, ML is not discharged, so that the evaluated ML voltage is higher than the voltages of the reference ML(REF0) and the reference ML(REF1). Therefore, the error probability of the XNOR-popcount operation is almost zero.

마찬가지로 모든 미스매치 입력의 평가된 ML은 기준 ML(REF0)및 기준 ML(REF1)보다 훨씬 빠르게 방전되므로 이 경우 XNOR-popcount 연산에서 높은 신뢰성을 가질 수 있다. 반면, 4비트 미스매치(4bit mismatch)의 경우 미스매치 차이가 작기 때문에 XNOR-popcount 연산 신뢰도가 낮다. ML 방전 속도가 두 기준 사이에 위치하는 경우 높은 분류 정확도를 달성하려면 수동으로 디지털 논리 기반 XNOR-popcount 연산을 수행하는 것이 바람직하다. Similarly, the evaluated ML of all mismatched inputs discharges much faster than the reference ML(REF0) and reference ML(REF1), so in this case the XNOR-popcount operation can have high reliability. On the other hand, in the case of a 4-bit mismatch, since the mismatch difference is small, the reliability of the XNOR-popcount operation is low. If the ML discharge rate lies between the two criteria, it is desirable to manually perform a digital logic-based XNOR-popcount operation to achieve high classification accuracy.

따라서, 본 발명에서 제안하는 기법은 공정 변화로 인한 하드웨어 오류율을 완화하기 위해 추가 클럭 주기와 함께 SRAM 읽기 작업이 뒤따르는 디지털 논리 기반 XNOR-popcount 연산을 수행한다. Therefore, the method proposed in the present invention performs a digital logic-based XNOR-popcount operation followed by an SRAM read operation with an additional clock cycle in order to mitigate the hardware error rate due to process change.

도 7은 본 발명의 일 실시예에 따른 CAM 메모리 기반 BNN 가속기의 오류 정정 방법을 설명하기 위한 흐름도이다. 7 is a flowchart illustrating an error correction method of a CAM memory-based BNN accelerator according to an embodiment of the present invention.

제안하는 CAM 메모리 기반 BNN 가속기의 오류 정정 방법은 제1 기준 ML(Match Line)을 통해 미스매치되는 복수의 추가 CAM 셀을 단일 기준 ML과 비교하는 단계(710), 제2 기준 ML을 통해 매칭되는 복수의 추가 CAM 셀을 단일 기준 ML과 비교하는 단계(720) 및 연산 오류를 감소시키기 위하여 상기 복수의 추가 CAM 셀, 제1 기준 ML 및 제2 기준 ML을 통해 XNOR-popcount 연산을 선택적으로 수행하는 단계(730)를 포함한다. The error correction method of the proposed CAM memory-based BNN accelerator compares a plurality of additional CAM cells that are mismatched through a first reference ML (Match Line) with a single reference ML (710), and is matched through a second reference ML Comparing a plurality of additional CAM cells with a single reference ML ( 720 ) and selectively performing an XNOR-popcount operation through the plurality of additional CAM cells, a first reference ML and a second reference ML to reduce computational errors step 730 .

단계(710)에서, 제1 기준 ML(Match Line)을 통해 미스매치되는 복수의 추가 CAM 셀을 단일 기준 ML과 비교한다. In step 710 , a plurality of additional CAM cells mismatched via a first reference Match Line (ML) are compared to a single reference ML.

단계(720)에서, 제2 기준 ML을 통해 매칭되는 복수의 추가 CAM 셀을 단일 기준 ML과 비교한다. In step 720 , a plurality of additional CAM cells matched via the second reference ML are compared to a single reference ML.

단계(730)에서, 연산 오류를 감소시키기 위하여 상기 복수의 추가 CAM 셀, 제1 기준 ML 및 제2 기준 ML을 통해 XNOR-popcount 연산을 선택적으로 수행한다. In step 730, an XNOR-popcount operation is selectively performed through the plurality of additional CAM cells, the first reference ML and the second reference ML in order to reduce the operation error.

본 발명의 실시예에 따르면, 제1 기준 ML 및 제2 기준 ML가 입력 특징맵(input feature map; iFMAP) 연결을 만들기 위해 ML 스위치에 의해 연결되고, 하향 필터 슬라이딩의 경우 iFMAP 쌍을 재배열하기 위해 추가 읽기 및 쓰기 작업을 수행할 수 있다. According to an embodiment of the present invention, the first criterion ML and the second criterion ML are connected by an ML switch to make an input feature map (iFMAP) connection, and rearrange the iFMAP pair in the case of down-filter sliding to perform additional read and write operations.

본 발명의 실시예에 따른 복수의 추가 CAM 셀을 포함하는 CAM 어레이에서 XNOR-popcount 연산을 수행함으로써 BNN의 컨볼루션 연산을 추가 디지털 논리 회로 없이 CAM 어레이로 대체 가능하다. 본 발명에서는 공정 변화로 인한 하드웨어 오류율을 감소시키기 위해 추가 클럭 주기와 함께 SRAM 읽기 작업이 뒤따르는 XNOR-popcount 연산을 수행할 수 있다. 또한, 복수의 추가 CAM 셀을 포함하는 CAM 어레이의 병렬 ML 작업을 활용하여 데이터 처리를 활성화할 수 있다. By performing the XNOR-popcount operation on the CAM array including a plurality of additional CAM cells according to an embodiment of the present invention, the convolution operation of the BNN can be replaced with a CAM array without an additional digital logic circuit. In the present invention, an XNOR-popcount operation followed by an SRAM read operation can be performed with an additional clock cycle to reduce a hardware error rate due to a process change. Additionally, parallel ML operations in a CAM array containing a plurality of additional CAM cells can be utilized to enable data processing.

도 8은 본 발명의 일 실시예에 따른 CAM 메모리 기반 BNN 가속기와 종래기술의 오류 정정 결과를 비교한 도면이다.8 is a diagram comparing the error correction result of the prior art with the CAM memory-based BNN accelerator according to an embodiment of the present invention.

본 발명의 실시예에 따른 CAM 기반 BNN 가속기의 효율성을 검증하기 위해 상용 28nm CMOS 기술을 사용하여 LeNet-5 모델의 두 번째 컨볼루션 레이어를 구현하였다. HSPICE 시뮬레이션을 수행하기 위해 5개의 뱅크(ML당 150셀)가 CAM 어레이 아키텍처에 구현되었다. To verify the efficiency of the CAM-based BNN accelerator according to an embodiment of the present invention, a second convolutional layer of the LeNet-5 model was implemented using commercial 28nm CMOS technology. Five banks (150 cells per ML) were implemented in a CAM array architecture to perform HSPICE simulations.

도 8(a)는 기준 ML과 평가된 ML의 미스매치 수의 차이(Mismatch Difference)와 단일 기준 기반 CAM 어레이에서 XNOR-popcount 연산의 연산 실패율(Error Probability) 간의 관계를 보여준다. Fig. 8(a) shows the relationship between the mismatch difference between the reference ML and the evaluated ML and the error probability of the XNOR-popcount operation in a single criterion-based CAM array.

상술된 바와 같이 미스매치 차이의 절대값이 증가하면 연산 실패율이 감소한다. 제안하는 CAM 어레이(Prop.)는 동작 신뢰성에 취약한 경우에 디지털 논리 회로 기반 XNOR-popcount 연산을 수행하기 때문에 도 8(b)와 같이 단일 기준 방식(1Ref.)보다 훨씬 낮은 동작 실패율을 보인다. As described above, when the absolute value of the mismatch difference increases, the operation failure rate decreases. The proposed CAM array (Prop.) shows a much lower operation failure rate than the single reference method (1Ref.) as shown in FIG.

표 1은 BNN 가속기의 오류 확률 및 분류 정확도를 비교한 표이다. Table 1 is a table comparing the error probability and classification accuracy of the BNN accelerator.

<표 1><Table 1>

표 1과 같이 원래 XNOR-popcount 결과에 ML 센싱 오류 확률을 적용할 때 Fashion MNIST 테스트 이미지에 대한 전체 출력 활성화의 7.51%가 반전되어 LeNet-5에서 상위 1개 정확도의 2.33%가 저하된다. 반면에 제안된 메모리 아키텍처는 XNOR-popcount 연산 오류의 2.88%만 달성할 수 있으며 이는 하드웨어 오류에서 62% 감소를 나타낸다. 이것은 상위 1 정확도와 비교하여 1.26%의 분류 정확도 감소로 이어진다. 또한 제안된 BNN 가속기는 [9]에 표시된 기존 XNOR-popcount 연산에 비해 입력 이미지에 대해 51.72% 감소된 연산 주기를 달성할 수 있다. As shown in Table 1, when applying the ML sensing error probabilities to the original XNOR-popcount results, 7.51% of the overall output activations for the Fashion MNIST test images are inverted, which degrades the top 1 accuracy by 2.33% in LeNet-5. On the other hand, the proposed memory architecture can only achieve 2.88% of XNOR-popcount operation errors, which represents a 62% reduction in hardware errors. This leads to a 1.26% reduction in classification accuracy compared to the top 1 accuracy. In addition, the proposed BNN accelerator can achieve a 51.72% reduced computational cycle for the input image compared to the conventional XNOR-popcount computation shown in [9].

이와 같이, 본 발명에서는 빠르고 강력한 CAM 기반 BNN 가속기를 위한 새로운 신뢰성 향상 기법을 제안한다. BNN의 거대한 컨볼루션 연산은 수정된 CAM 배열에서 XNOR-popcount 연산으로 대체될 수 있다. 병렬 컨볼루션 연산은 CAM 어레이의 완전 병렬 검색으로 수행할 수 있어 연산 성능이 50% 이상 향상된다. 또한, 다중 기준이 있는 제안된 센싱 방식은 XNOR-popcount 오류 확률 문제를 완화하여 Fashion MNIST 분류 정확도를 1% 이상 향상시킬 수 있다. As such, the present invention proposes a new reliability improvement technique for a fast and powerful CAM-based BNN accelerator. The huge convolution operation of BNN can be replaced by XNOR-popcount operation on the modified CAM array. Parallel convolution operation can be performed with a fully parallel search of the CAM array, which improves computational performance by more than 50%. In addition, the proposed sensing scheme with multiple criteria can improve the Fashion MNIST classification accuracy by more than 1% by alleviating the XNOR-popcount error probability problem.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다.　 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다.　 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다.　 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다.　 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that may include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.　 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다.　 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. may be embodied in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다.　 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.　 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.　 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.　 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.　 The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible by those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

a first reference ML (Match Line) comparing a plurality of additional mismatched CAM cells to a single reference ML; and
A second criterion ML comparing a plurality of additional CAM cells that match to a single criterion ML.
including,
Selectively perform XNOR-popcount operation through the plurality of additional CAM cells, the first reference ML and the second reference ML to reduce the operation error,
The first reference ML and the second reference ML are:
connected by ML switches to make an input feature map (iFMAP) connection,
In case of downward filter sliding, additional read and write operations are performed to rearrange the iFMAP pairs.
BNN accelerator.

delete

According to claim 1,
The plurality of additional CAM cells,
By performing the XNOR-popcount operation on the CAM array including the plurality of additional CAM cells, the convolution operation of BNN can be replaced with a CAM array without additional digital logic circuits,
To reduce the hardware error rate due to process variations, an XNOR-popcount operation is performed followed by an SRAM read operation with an additional clock cycle.
BNN accelerator.

4. The method of claim 3,
The plurality of additional CAM cells,
Utilizing parallel ML operations of a CAM array comprising the plurality of additional CAM cells to enable data processing.
BNN accelerator.

comparing a plurality of additional CAM cells mismatched via a first reference Match Line (ML) to a single reference ML;
comparing a plurality of additional CAM cells matched via the second criterion ML to a single criterion ML; and
Selectively performing an XNOR-popcount operation through the plurality of additional CAM cells, a first reference ML and a second reference ML to reduce an operation error;
including,
Selectively performing an XNOR-popcount operation through the plurality of additional CAM cells, a first reference ML and a second reference ML in order to reduce the operation error,
The first reference ML and the second reference ML are connected by an ML switch to make an input feature map (iFMAP) connection, and additional read and write operations are performed to rearrange the iFMAP pairs in the case of down-filter sliding. performing
Error correction method of BNN accelerator.

delete

6. The method of claim 5,
Selectively performing an XNOR-popcount operation through the plurality of additional CAM cells, a first reference ML and a second reference ML in order to reduce the operation error,
By performing the XNOR-popcount operation on the CAM array including the plurality of additional CAM cells, the convolution operation of BNN can be replaced with a CAM array without additional digital logic circuits,
To reduce the hardware error rate due to process variations, an XNOR-popcount operation is performed followed by an SRAM read operation with an additional clock cycle.
Error correction method of BNN accelerator.

8. The method of claim 7,
Utilizing parallel ML operations of a CAM array comprising the plurality of additional CAM cells to enable data processing.
Error correction method of BNN accelerator.