KR102600086B1

KR102600086B1 - A Resource Efficient Integer-Arithmetic-Only FPGA-Based CNN Accelerator for Real-Time Facial Emotion Recognition

Info

Publication number: KR102600086B1
Application number: KR1020210178711A
Authority: KR
Inventors: 강진구; 김재명; 김용우
Original assignee: 인하대학교 산학협력단; 상명대학교 천안산학협력단
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2023-11-08
Also published as: KR20230089927A

Abstract

실시간 얼굴 감정 인식을 위한 자원 효율적인 정수-산술 전용 FPGA 기반 CNN 가속기 및 그 동작 방법이 제시된다. 본 발명에서 제안하는 CPU(Processing System; PS)와 FPGA(Programmable Logic, PL)를 포함하는 이기종 SoC 플랫폼에서 구현되는 FPGA 기반 CNN 가속기의 동작 방법은 CPU에서 CNN 얼굴 표정 인식 성능을 향상시키기 위해 복수의 이미지 처리 알고리즘을 결합한 훈련 데이터셋을 생성하고, 생성된 훈련 데이터셋에 대하여 플로팅-포인트(floating-point) 훈련을 통해 파라미터를 변환하는 단계, CPU에서 플로팅-포인트 훈련을 통해 변환된 파라미터를에 대하여 양자화 인식 훈련을 통해 양자화된 파라미터를 추출하고, 정수-산술 전용(Integer-Arithmetic-Only) CNN 재구성을 수행하는 단계 및 이기종 SoC(System on Chip) 플랫폼의 FPGA(Programmable Logic; PL) 및 CPU(Processing System; PS) 영역에서 상기 정수-산술 전용 CNN을 이용하여 실시간 얼굴 감정 인식을 수행하는 단계를 포함한다. A resource-efficient integer-arithmetic-only FPGA-based CNN accelerator and its operation method for real-time facial emotion recognition are presented. The operating method of the FPGA-based CNN accelerator implemented on a heterogeneous SoC platform including a CPU (Processing System; PS) and an FPGA (Programmable Logic, PL) proposed in the present invention is to use a plurality of CNN accelerators in the CPU to improve CNN facial expression recognition performance. Creating a training dataset that combines image processing algorithms and converting parameters through floating-point training for the generated training dataset. Regarding the parameters converted through floating-point training on the CPU. Extracting quantized parameters through quantization recognition training, performing Integer-Arithmetic-Only CNN reconstruction, and FPGA (Programmable Logic; PL) and CPU (Processing) on heterogeneous SoC (System on Chip) platforms and performing real-time facial emotion recognition using the integer-arithmetic-only CNN in the System; PS) domain.

Description

{A Resource Efficient Integer-Arithmetic-Only FPGA-Based CNN Accelerator for Real-Time Facial Emotion Recognition}

본 발명은 실시간 얼굴 감정 인식을 위한 자원 효율적인 정수-산술 전용 FPGA 기반 CNN 가속기 및 그 동작 방법에 관한 것이다.The present invention relates to a resource-efficient integer-arithmetic dedicated FPGA-based CNN accelerator for real-time facial emotion recognition and a method of operating the same.

컴퓨터는 산업과 사회에서 중심적인 역할을 하고 있으며, 빠르게 일상 생활의 일부가 되고 있다. 이에 따라 인간과 컴퓨터 간의 상호작용에 대한 연구의 필요성이 증가하고 있다. 인간과 컴퓨터 간의 원활한 상호작용을 위해, 컴퓨터는 인간의 의도를 분석하고 그에 따라 반응할 수 있어야 한다. 표정에서 나타나는 감정은 인간의 의도를 표현하는 보편적이고 효과적인 방법이다. 얼굴에서 드러나는 표정을 통해 사람의 의도를 분석하는 것을 얼굴 표정 인식 기술이라고 하며, 자동차와 로봇 산업 등 다양한 분야에서 사용된다. 사람의 얼굴에서 나타나는 표정을 정확히 이해하기 위해서는 컴퓨터가 얼굴을 인식해 특정 표정군에 따라 자동으로 표정을 분류해야 한다.Computers play a central role in industry and society, and are quickly becoming a part of everyday life. Accordingly, the need for research on interaction between humans and computers is increasing. For smooth interaction between humans and computers, computers must be able to analyze human intentions and react accordingly. Emotions expressed in facial expressions are a universal and effective way to express human intentions. Analyzing a person's intentions through facial expressions is called facial expression recognition technology, and is used in various fields, including the automobile and robot industries. In order to accurately understand the expressions on a person's face, a computer must recognize the face and automatically classify the expressions according to specific expression groups.

종래기술에서는 행복, 분노, 두려움, 놀라움, 혐오, 슬픔, 중립의 7가지 기본 감정을 정의하고, 기본적인 감정들이 인간의 문화와 상관없이 같은 방식으로 인식된다는 것을 확인할 수 있다. 컴퓨터 비전 분야에서는 종래기술에서 제안한 얼굴 표정 분석 방법인 FACS(Facial Action Coding System)이 대표적인 얼굴 감정 인식 모델로 사용되었다. FACS는 얼굴 근육의 기본적인 움직임을 AU(Action Units)으로 변성시킨 다음 얼굴 표정을 인식하기 위해 AU의 조합을 사용한다. 그러나, 실시간 자동화 시스템의 개발은 부적절한 계산 능력과 비효율적인 전처리 알고리즘으로 인한 문제가 있다. In the prior art, seven basic emotions are defined: happiness, anger, fear, surprise, disgust, sadness, and neutrality, and it can be confirmed that the basic emotions are recognized in the same way regardless of human culture. In the field of computer vision, FACS (Facial Action Coding System), a facial expression analysis method proposed in the prior art, was used as a representative facial emotion recognition model. FACS transforms the basic movements of facial muscles into AU (Action Units) and then uses a combination of AUs to recognize facial expressions. However, the development of real-time automation systems has problems due to inadequate computational power and inefficient preprocessing algorithms.

이미지 처리 알고리즘의 발달과 컴퓨팅 능력의 향상으로 얼굴 표정 인식 기술은 3상 파이프라인을 통해 수작업으로 만들어진 특징을 추출하는 방법으로 발전했다. 이 접근 방식은 이미지 전처리, 특징 추출 및 표현식 분류 단계로 구성된다. 이미지 전처리 단계는 필터링 및 히스토그램 균등화를 사용하여 관련 없는 정보를 제거하고 얼굴 표정과 관련된 정보를 강화한다. 특징 추출 단계는 가보 웨이블릿 커널(Gabor Wavelet Kernel), LBP(Local Binary Pattern), ASM(Active Shape Model), 하-라이크 피쳐 템플릿(Harr-Like Feature Template)과 같은 특징 추출기를 사용하여 이미지에서 얼굴 특징을 추출한다. 표현식 분류 단계는 서포트 벡터 머신(Support Vector Machine; SVM), KNN(K-Nearest Neighbor) 또는 AdaBoost 와 같은 분류기를 사용하여 추출된 특징을 표정군으로 분류한다. 핸드-크래프트(hand-crafted) 특징을 사용하는 방법은 적절한 특징 추출기와 표현식 분류기를 개별적으로 설계해야 하므로 두 단계를 동시에 최적화할 수 없다. 또한 외부 요인(예를 들어, 자세 변동, 폐색, 조명)은 심각한 성능 저하를 야기할 수 있다.With the development of image processing algorithms and improvements in computing power, facial expression recognition technology has developed into a method of extracting manually created features through a three-phase pipeline. This approach consists of image preprocessing, feature extraction, and expression classification steps. The image preprocessing step uses filtering and histogram equalization to remove irrelevant information and enhance information related to facial expressions. The feature extraction step extracts facial features from the image using feature extractors such as Gabor Wavelet Kernel, Local Binary Pattern (LBP), Active Shape Model (ASM), and Harr-Like Feature Template. Extract . The expression classification step uses a classifier such as Support Vector Machine (SVM), KNN (K-Nearest Neighbor), or AdaBoost to classify the extracted features into expression groups. Methods using hand-crafted features require designing appropriate feature extractors and expression classifiers separately, so the two steps cannot be optimized simultaneously. Additionally, external factors (e.g., posture changes, occlusions, lighting) can cause significant performance degradation.

최근의 하드웨어 기술과 빅데이터 기술의 획기적인 발전으로 인해 딥러닝 기반 방법이 복잡하고 복잡한 문제에 탁월한 성능을 보이기 때문에 다양한 애플리케이션에 적용되고 있다. 특히 이미지 분류 및 물체 감지와 같은 컴퓨터 비전에서는 컨볼루션 신경망(Convolutional Neural Networks; CNN)을 사용하는 방법이 탁월한 성능을 보였다. 기존 방법과 달리 CNN 기반 접근 방식은 노이즈가 있는 환경에서 더 강력하여 이미지 전처리 및 특징 추출에 대한 의존도를 크게 줄인다. 또한 이 방법은 엔드 투 엔드 교육을 통해 파라미터를 한 번에 최적화할 수 있다. 얼굴 표정 인식 연구에서 CNN 기반 방법은 기존 방법보다 더 높은 성능을 보였지만 네트워크를 구축할 때 많은 컨볼루션 계층을 사용한다. 이와 같은 네트워크에는 많은 작업 및 메모리 풋프린트가 필요하다. 따라서, 첨단 운전자 지원 시스템(Advanced Driver Assistance System; ADAS)과 같이 실시간 처리가 필요한 애플리케이션에서 사용될 때는 제한적이다. 또한 가상현실(Virtual Reality; VR)과 같이 하드웨어 리소스가 제한된 환경에서는 적합하지 않다. Due to recent breakthroughs in hardware technology and big data technology, deep learning-based methods are being applied to a variety of applications because they show excellent performance in complex and complex problems. In particular, in computer vision such as image classification and object detection, methods using Convolutional Neural Networks (CNN) have shown excellent performance. Unlike existing methods, CNN-based approaches are more robust in noisy environments, significantly reducing reliance on image preprocessing and feature extraction. Additionally, this method can optimize parameters at once through end-to-end training. In facial expression recognition research, CNN-based methods have shown higher performance than existing methods, but use many convolutional layers when building the network. A network like this requires a lot of work and memory footprint. Therefore, its use in applications that require real-time processing, such as advanced driver assistance systems (ADAS), is limited. Additionally, it is not suitable for environments with limited hardware resources, such as virtual reality (VR).

[1] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, "Learned step size quantization," 2019, arXiv:1902.08153. [Online]. Available: https://arxiv.org/abs/1902.08153[1] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, "Learned step size quantization," 2019, arXiv:1902.08153. [Online]. Available: https://arxiv.org/abs/1902.08153 [2] S. R. Jain, A. Gural, M. Wu, and C. H. Dick, "Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks," in Proc. Mach. Learn. Syst., Mar. 2020, pp. 112-128. [2] S. R. Jain, A. Gural, M. Wu, and C. H. Dick, “Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks,” in Proc. Mach. Learn. Syst., Mar. 2020, pp. 112-128. [3] H. Phan-Xuan, T. Le-Tien, and S. Nguyen-Tan, "FPGA platform applied for facial expression recognition system using convolutional neural networks," Procedia Comput. Sci., vol. 151, pp. 651-658, Jan. 2019. [3] H. Phan-Xuan, T. Le-Tien, and S. Nguyen-Tan, “FPGA platform applied for facial expression recognition system using convolutional neural networks,” Procedia Comput. Sci., vol. 151, pp. 651-658, Jan. 2019. [4] T. N. D. Phuc, N. N. Nhut, N. Trinh, and L. Tran, "Facial expression recognition system using FPGA-based convolutional neural network," in Research in Intelligent and Computing in Engineering, vol. 1254. Singapore: Springer, Jan. 2021, pp. 341-351. [4] T. N. D. Phuc, N. N. Nhut, N. Trinh, and L. Tran, “Facial expression recognition system using FPGA-based convolutional neural network,” in Research in Intelligent and Computing in Engineering, vol. 1254. Singapore: Springer, Jan. 2021, pp. 341-351. [5] P. T. Vinh and T. Q. Vinh, "Facial expression recognition system on SoC FPGA," in Proc. Int. Symp. Electr. Electron. Eng. (ISEE), Ho Chi Minh City, Vietnam, Oct. 2019, pp. 1-4. [5] P. T. Vinh and T. Q. Vinh, “Facial expression recognition system on SoC FPGA,” in Proc. Int. Symp. Electr. Electron. Eng. (ISEE), Ho Chi Minh City, Vietnam, Oct. 2019, pp. 1-4. [6] R. Ding, G. Su, G. Bai, W. Xu, N. Su, and X. Wu, "A FPGA-based accelerator of convolutional neural network for face feature extraction," in Proc. IEEE Int. Conf. Electron Devices Solid-State Circuits (EDSSC), Xi'an, China, Jun. 2019, pp. 1-3. [6] R. Ding, G. Su, G. Bai, W. Xu, N. Su, and X. Wu, “A FPGA-based accelerator of convolutional neural network for face feature extraction,” in Proc. IEEE Int. Conf. Electron Devices Solid-State Circuits (EDSSC), Xi'an, China, Jun. 2019, pp. 1-3. [7] R. Ding, X. Tian, G. Bai, G. Su, and X. Wu, "Hardware implementation of convolutional neural network for face feature extraction," in Proc. IEEE 13th Int. Conf. ASIC (ASICON), Chongqing, China, Oct. 2019, pp. 1-4. [7] R. Ding, X. Tian, G. Bai, G. Su, and X. Wu, “Hardware implementation of convolutional neural network for face feature extraction,” in Proc. IEEE 13th Int. Conf. ASIC (ASICON), Chongqing, China, Oct. 2019, pp. 1-4. [8] Y. Bengio, N. Leonard, and A. Courville, "Estimating or propagating gradients through stochastic neurons for conditional computation," 2013, arXiv:1308.3432. [Online]. Available: https://arxiv.org/abs/1308.3432 [8] Y. Bengio, N. Leonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” 2013, arXiv:1308.3432. [Online]. Available: https://arxiv.org/abs/1308.3432 [9] K. Abdelouahab, M. Pelcat, J. Serot, and F. Berry, "Accelerating CNN inference on FPGAs: A survey," 2018, arXiv:1806.01683. [Online]. Available: https://arxiv.org/abs/1806.01683 [9] K. Abdelouahab, M. Pelcat, J. Serot, and F. Berry, “Accelerating CNN inference on FPGAs: A survey,” 2018, arXiv:1806.01683. [Online]. Available: https://arxiv.org/abs/1806.01683 [10] L. Zhu. THOP: PyTorch-OpCounter. Accessed: Apr. 26, 2021. [Online]. Available: https://github.com/Lyken17/pytorch-OpCounter[10] L. Zhu. THOP: PyTorch-OpCounter. Accessed: Apr. 26, 2021. [Online]. Available: https://github.com/Lyken17/pytorch-OpCounter [11] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and K. Gopalakrishnan, ''PACT: Parameterized clipping activation for quantized neural networks,'' 2018, arXiv:1805.06085. [Online]. Available: https://arxiv.org/abs/1805.06085 [11] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and K. Gopalakrishnan, ''PACT: Parameterized clipping activation for quantized neural networks,'' 2018, arXiv:1805.06085. [Online]. Available: https://arxiv.org/abs/1805.06085 [12] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, ''Quantization and training of neural networks for efficient integer-arithmetic-only inference,'' in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2704-2713. [12] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, ''Quantization and training of neural networks for efficient integer-arithmetic- only inference,'' in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognition., Jun. 2018, pp. 2704-2713. [13] S. Jung, C. Son, S. Lee, J. Son, J. J. Han, Y. Kwak, and C. Choi, ''Learning to quantize deep networks by optimizing quantization intervals with task loss,'' in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 4350-4359. [13] S. Jung, C. Son, S. Lee, J. Son, J. J. Han, Y. Kwak, and C. Choi, ''Learning to quantize deep networks by optimizing quantization intervals with task loss,'' in Proc . IEEE/CVF Conf. Comput. Vis. Pattern Recognition., Jun. 2019, pp. 4350-4359. [14] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, ''Learned step size quantization,'' 2019, arXiv:1902.08153. [Online]. Available: https://arxiv.org/abs/1902.08153[14] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, ''Learned step size quantization,'' 2019, arXiv:1902.08153. [Online]. Available: https://arxiv.org/abs/1902.08153 [15] S. R. Jain, A. Gural, M. Wu, and C. H. Dick, ''Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks,'' in Proc. Mach. Learn. Syst., Mar. 2020, pp. 112-128.[15] S. R. Jain, A. Gural, M. Wu, and C. H. Dick, ''Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks,'' in Proc. Mach. Learn. Syst., Mar. 2020, pp. 112-128. [16] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, ''SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and[16] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, ''SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and

본 발명이 이루고자 하는 기술적 과제는 메모리 설치 공간 및 계산 복잡성을 감소시키기 위한 실시간 얼굴 감정 인식을 위한 자원 효율적인 정수-산술 전용 FPGA 기반 CNN 가속기 및 그 동작 방법을 제공하는데 있다. 또한, 본 발명의 실시예에 따른 CNN 가속기를 통해 정수 산술만을 사용하는 새로운 하드웨어 친화적인 양자화 방법을 제안하고, CNN의 일반화 및 분류 성능을 개선하기 위해 다양한 이미지 처리 알고리즘을 사용하여 만들어진 새로운 훈련 데이터셋인 FERPlus-A 데이터셋을 생성하여 훈련 후 양자화를 수행하는 방법 및 장치를 제공한다. The technical problem to be achieved by the present invention is to provide a resource-efficient integer-arithmetic dedicated FPGA-based CNN accelerator and its operating method for real-time facial emotion recognition to reduce memory installation space and computational complexity. In addition, we propose a new hardware-friendly quantization method using only integer arithmetic through the CNN accelerator according to an embodiment of the present invention, and a new training dataset created using various image processing algorithms to improve the generalization and classification performance of CNN. Provides a method and device for generating a FERPlus-A dataset and performing quantization after training.

일 측면에 있어서, 본 발명에서 제안하는 CPU(Processing System; PS)와 FPGA(Programmable Logic, PL)를 포함하는 이기종 SoC(System on Chip) 플랫폼에서 구현되는 FPGA 기반 CNN 가속기의 동작 방법은 CPU에서 CNN 얼굴 표정 인식 성능을 향상시키기 위해 복수의 이미지 처리 알고리즘을 결합한 훈련 데이터셋을 생성하고, 생성된 훈련 데이터셋에 대하여 플로팅-포인트(floating-point) 훈련을 통해 파라미터를 변환하는 단계, CPU에서 플로팅-포인트 훈련을 통해 변환된 파라미터를에 대하여 양자화 인식 훈련을 통해 양자화된 파라미터를 추출하고, 정수-산술 전용(Integer-Arithmetic-Only) CNN 재구성을 수행하는 단계 및 이기종 SoC 플랫폼의 FPGA 및 CPU 영역에서 상기 정수-산술 전용 CNN을 이용하여 실시간 얼굴 감정 인식을 수행하는 단계를 포함한다. In one aspect, the operating method of the FPGA-based CNN accelerator implemented on a heterogeneous SoC (System on Chip) platform including a CPU (Processing System; PS) and an FPGA (Programmable Logic, PL) proposed in the present invention includes the CNN accelerator in the CPU. Creating a training dataset combining multiple image processing algorithms to improve facial expression recognition performance, and converting parameters through floating-point training for the generated training dataset, floating on the CPU Extracting quantized parameters through quantization recognition training for parameters converted through point training and performing Integer-Arithmetic-Only CNN reconstruction, and performing the above in the FPGA and CPU areas of heterogeneous SoC platforms It includes performing real-time facial emotion recognition using an integer-arithmetic-only CNN.

상기 CPU에서 CNN 얼굴 표정 인식 성능을 향상시키기 위해 복수의 이미지 처리 알고리즘을 결합한 훈련 데이터셋을 생성하고, 생성된 훈련 데이터셋에 대하여 플로팅-포인트 훈련을 통해 파라미터를 변환하는 단계는 제1 계산 모듈(FireA) 및 제2 계산 모듈(FireB)을 포함하는 기본 계산 블록을 통해 연산을 수행하고, 제1 계산 모듈(FireA) 및 제2 계산 모듈(FireB) 각각은, 특징맵의 채널 수를 감소시키고, 이어지는 확장 계층의 연산 수를 감소시키기 위한 스퀴즈 계층 및 채널을 다시 확장하하기 위한 확장 계층을 포함하고, 수신 필드를 유지하여 정확도 저하를 최소화하기 위해 제1 계산 모듈의 스퀴즈 계층과 제2 계산 모듈의 스퀴즈 계층은 서로 다른 커널 크기를 갖는다. The step of generating a training dataset combining a plurality of image processing algorithms in the CPU to improve CNN facial expression recognition performance and converting parameters through floating-point training for the generated training dataset includes the first calculation module ( The operation is performed through a basic calculation block including FireA) and a second calculation module (FireB), and each of the first calculation module (FireA) and the second calculation module (FireB) reduces the number of channels of the feature map, It includes a squeeze layer to reduce the number of operations in the subsequent expansion layer and an expansion layer to expand the channel again, and a squeeze layer of the first calculation module and a squeeze layer of the second calculation module to minimize accuracy degradation by maintaining the reception field. Squeeze layers have different kernel sizes.

특징맵의 해상도를 절반으로 줄이고 최대 풀링 계층을 대체하기 위해 상기 기본 계산 블록의 스트라이드를 2로 설정하고, CNN의 마지막에 추출된 특징을 분류하고 복수의 매개 변수를 포함하는 완전 연결 계층을 글로벌 평균 풀링으로 대체하고 커널 크기가 1인 컨볼루션 계층을 대체하여 상기 파라미터를 감소시키고, CNN의 분류기는 컨볼루션 계층을 통해 클래스 수와 일치하도록 채널 수를 최소화한 후 특징맵의 해상도를 하나의 픽셀로 압축하며, CNN의 수렴 안정성과 성능을 개선하기 위해 컨볼루션 계층과 활성화 함수 사이에 배치 정규화 계층을 삽입한다.
상기 CPU에서 플로팅-포인트 훈련을 통해 변환된 파라미터에 대하여 양자화 인식 훈련을 통해 양자화된 파라미터를 추출하고, 정수-산술 전용 CNN 재구성을 수행하는 단계는 학습된 단계 크기 양자화 방법(Learned Step size Quantization; LSQ)을 이용하여 훈련 가능한 파라미터를 기준화 인수로 설정하고 다른 파라미터와 함께 상기 기준화 인수를 훈련시킴으로써 양자화 프로세스에서 기준화 인수에 대한 사전 보정 프로세스를 필요로 하지 않고, 훈련된 양자화 임계값(Trained Quantization Threshold; TQT) 방법을 이용하여 상기 훈련 가능한 파라미터를 로그 임계값(log₂t)으로 설정하고 상기 기준화 인수를 2의 거듭제곱 항에 매핑함으로써, 기준화 인수를 사용한 곱셈과 나눗셈을 이동 연산(shift operations)으로 대체하며, 사전 훈련된 플로팅-포인트 파라미터를 사용하여 양자화 인식 훈련(Quantization-Aware Training; QAT)을 수행하는 과정에서 대칭 양자화를 사용함으로써 균일한 양자화 중에 발생하는 영점에 대한 계산 오버헤드가 제거되고, 계층별 스케일링을 도입함으로써 주어진 파라미터 또는 입력 텐서의 모든 요소에 대해 하나의 기준화 인수로 양자화를 수행하는 로그레벨 임계값 정량화(Log Level Threshold Quantization; LLTQ) 방법을 이용한다. In order to halve the resolution of the feature map and replace the max pooling layer, the stride of the basic calculation block is set to 2, and the features extracted at the end of the CNN are classified and a fully connected layer containing multiple parameters is global averaged. The above parameters are reduced by replacing it with pooling and a convolution layer with a kernel size of 1, and the CNN classifier minimizes the number of channels to match the number of classes through the convolution layer and then reduces the resolution of the feature map to one pixel. It is compressed, and a batch normalization layer is inserted between the convolution layer and the activation function to improve the convergence stability and performance of CNN.
The step of extracting quantized parameters through quantization recognition training for parameters converted through floating-point training in the CPU and performing integer-arithmetic-only CNN reconstruction is performed using the Learned Step Size Quantization (LSQ) method. ) is used to set a trainable parameter as a reference factor and train the reference factor together with other parameters, thereby eliminating the need for a pre-correction process for the reference factor in the quantization process and creating a trained quantization threshold (Trained Quantization). By using the Threshold (TQT) method to set the trainable parameter to a logarithmic threshold (log ₂ t) and mapping the standardization factor to a power of 2 term, multiplication and division using the standardization factor can be performed as a shift operation ( shift operations) and the computational overhead for zeros that occur during uniform quantization by using symmetric quantization in the process of performing Quantization-Aware Training (QAT) using pre-trained floating-point parameters. is removed, and the Log Level Threshold Quantization (LLTQ) method is used, which performs quantization with one standardization factor for all elements of a given parameter or input tensor by introducing scaling for each layer.

삭제delete

또 다른 일 측면에 있어서, 본 발명에서 제안하는 CPU(Processing System; PS)와 FPGA(Programmable Logic, PL)를 포함하는 이기종 SoC(System on Chip) 플랫폼에서 구현되는 FPGA 기반 CNN 가속기에 있어서, 상기 CPU는 CNN 얼굴 표정 인식 성능을 향상시키기 위해 복수의 이미지 처리 알고리즘을 결합한 훈련 데이터셋을 생성하고, 생성된 훈련 데이터셋에 대하여 플로팅-포인트(floating-point) 훈련을 통해 파라미터를 변환하고, 플로팅-포인트 훈련을 통해 변환된 파라미터를에 대하여 양자화 인식 훈련을 통해 양자화된 파라미터를 추출하고, 정수-산술 전용(Integer-Arithmetic-Only) CNN 재구성을 수행하며, 이기종 SoC 플랫폼의 FPGA 및 CPU 영역에서 상기 정수-산술 전용 CNN을 이용하여 실시간 얼굴 감정 인식을 수행한다.In another aspect, in the FPGA-based CNN accelerator implemented on a heterogeneous SoC (System on Chip) platform including a CPU (Processing System; PS) and an FPGA (Programmable Logic, PL) proposed in the present invention, the CPU To improve CNN facial expression recognition performance, a training dataset is created by combining multiple image processing algorithms, and parameters are converted through floating-point training for the generated training dataset, and floating-point For parameters converted through training, quantized parameters are extracted through quantization recognition training, Integer-Arithmetic-Only CNN reconstruction is performed, and the integer-arithmetic-only CNN is reconstructed in the FPGA and CPU areas of heterogeneous SoC platforms. Real-time facial emotion recognition is performed using an arithmetic-only CNN.

본 발명의 실시예들에 따른 실시간 얼굴 감정 인식을 위한 자원 효율적인 정수-산술 전용 FPGA 기반 CNN 가속기 및 그 동작 방법을 통해 메모리 설치 공간 및 계산 복잡성을 감소시킬 수 있다. 또한, 본 발명의 실시예에 따른 CNN 가속기를 통해 정수 산술만을 사용하는 새로운 하드웨어 친화적인 양자화 방법을 제안하고, 다양한 이미지 처리 알고리즘을 사용하여 만들어진 새로운 훈련 데이터셋인 FERPlus-A 데이터셋을 생성하여 훈련 후 양자화를 수행함으로써 CNN의 일반화 및 분류 성능을 개선할 수 있다. Memory installation space and computational complexity can be reduced through a resource-efficient integer-arithmetic dedicated FPGA-based CNN accelerator and its operation method for real-time facial emotion recognition according to embodiments of the present invention. In addition, we propose a new hardware-friendly quantization method using only integer arithmetic through a CNN accelerator according to an embodiment of the present invention, and generate and train the FERPlus-A dataset, a new training dataset created using various image processing algorithms. By performing post-quantization, the generalization and classification performance of CNN can be improved.

도 1은 본 발명의 일 실시예에 따른 실시간 얼굴 감정 인식을 위한 자원 효율적인 정수-산술 전용 FPGA 기반 CNN 가속기의 동작 방법을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 경량 CNN 아키텍처의 메모리 사용 및 계산 복잡성을 최적화하는 과정을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 양자화 방법 LLTQ를 이용한 균일 양자화 과정과 종래기술의 LSQ 및 TQT 양자화 과정을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 정수 스케일 변환 방법을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 실시간 얼굴 감정 인식을 위한 자원 효율적인 정수-산술 전용 FPGA 기반 CNN 가속기의 아키텍처를 나타낸다.
도 6은 본 발명의 일 실시예에 따른 정수-산술 전용 FPGA 기반 CNN 가속기에 대한 전체 작업 스케줄링 과정을 설명하기 위한 도면이다. 1 is a diagram illustrating the operation method of a resource-efficient integer-arithmetic-only FPGA-based CNN accelerator for real-time facial emotion recognition according to an embodiment of the present invention.
Figure 2 is a diagram illustrating a process for optimizing memory usage and computational complexity of a lightweight CNN architecture according to an embodiment of the present invention.
Figure 3 is a diagram for explaining the uniform quantization process using the quantization method LLTQ according to an embodiment of the present invention and the LSQ and TQT quantization processes of the prior art.
Figure 4 is a diagram for explaining an integer scale conversion method according to an embodiment of the present invention.
Figure 5 shows the architecture of a resource-efficient integer-arithmetic dedicated FPGA-based CNN accelerator for real-time facial emotion recognition according to an embodiment of the present invention.
Figure 6 is a diagram for explaining the entire task scheduling process for an integer-arithmetic dedicated FPGA-based CNN accelerator according to an embodiment of the present invention.

최근 컴퓨터 비전에서 탁월한 성능을 보이는 컨볼루션 신경망(Convolutional Neural Networks; CNN)을 이용한 얼굴 표정 인식에 대한 많은 연구가 수행되고 있다. 높은 분류 정확도를 얻으려면 파라미터가 많고 계산 복잡성이 높은 CNN 아키텍처가 필요하다. 그러나 하드웨어 리소스가 제한된 임베디드 시스템에는 적합하지 않다. Recently, much research has been conducted on facial expression recognition using Convolutional Neural Networks (CNN), which show excellent performance in computer vision. To achieve high classification accuracy, a CNN architecture with many parameters and high computational complexity is required. However, it is not suitable for embedded systems with limited hardware resources.

본 발명에서는 임베디드 시스템에 최적화된 경량 CNN 아키텍처를 제안한다. 제안된 CNN 아키텍처는 메모리 설치 공간이 작고 계산 복잡성이 낮다. 또한, 정수 산술만을 사용하는 새로운 하드웨어 친화적인 양자화 방법을 제안한다. 제안된 하드웨어 친화 양자화 방법은 기준화 인수를 2의 거듭제곱 항으로 매핑하고 기준화 인수를 사용한 곱셈과 나눗셈 연산을 이동 연산으로 대체한다. The present invention proposes a lightweight CNN architecture optimized for embedded systems. The proposed CNN architecture has a small memory footprint and low computational complexity. Additionally, we propose a new hardware-friendly quantization method that uses only integer arithmetic. The proposed hardware-friendly quantization method maps the scaling factor to a power of 2 term and replaces multiplication and division operations using the scaling factor with shift operations.

또한, CNN의 일반화 및 분류 성능을 개선하기 위해 FERPlus-A 데이터셋을 생성한다. 이것은 다양한 이미지 처리 알고리즘을 사용하여 만들어진 새로운 훈련 데이터셋이다. FERPlus-A으로 FERPlus-A 데이터셋을 훈련한 후 양자화를 수행한다. Additionally, the FERPlus-A dataset is created to improve the generalization and classification performance of CNN. This is a new training dataset created using various image processing algorithms. After training the FERPlus-A dataset with FERPlus-A, quantization is performed.

본 발명의 실시예에 따라 양자화된 CNN 파라미터의 크기는 약 0.39 MB이며, 작동 횟수는 약 28M 정수 연산(Integer Operations; IOP)이다. FERPlus 테스트 데이터셋에서 정수 산술만을 사용하여 양자화된 CNN의 성능을 평가했고 분류 정확도는 약 86.58%로서 종래기술의 다른 경량 CNN보다 높은 정확도를 달성했다. 정수 산술만 사용하는 제안된 CNN 아키텍처는 Xilinx ZC706 SoC 플랫폼에서 병렬화 전략과 효율적인 데이터 캐싱 전략을 적용하여 실시간 얼굴 표정 인식을 위해 구현된다. 실시간 얼굴 표정 인식을 위해 구현된 FPGA 기반 CNN 가속기는 250MHz에서 대략 초당 10 프레임(Frame per Second; FPS)을 달성하고 2.3W를 소비한다. According to an embodiment of the present invention, the size of the quantized CNN parameters is about 0.39 MB, and the number of operations is about 28M Integer Operations (IOP). The performance of the quantized CNN was evaluated using only integer arithmetic on the FERPlus test dataset, and the classification accuracy was approximately 86.58%, which was higher than that of other lightweight CNNs in the prior art. The proposed CNN architecture using only integer arithmetic is implemented for real-time facial expression recognition by applying parallelization strategy and efficient data caching strategy on Xilinx ZC706 SoC platform. An FPGA-based CNN accelerator implemented for real-time facial expression recognition achieves approximately 10 frames per second (FPS) at 250MHz and consumes 2.3W.

경량 CNN을 사용한 여러 얼굴 표정 인식 기술에 있어서, 종래기술에서는 크라우드 소싱 및 레이블 분배 방법을 사용하여 FER2013 데이터셋의 문제를 해결한 FERPlus 데이터셋을 구성했으며, 가벼운 VGG를 사용한 네 가지 다른 훈련 기술이 제안되었다. 또 다른 종래기술에서는 DeepID2 와 ResNet 을 통해 식별 특징과 표정 특징을 각각 추출했으며, 각 CNN 출력의 특징을 결합하여 분류했다. 또 다른 종래기술에서는 사전 훈련된 파라미터를 이미지 분류에 자주 사용되는 데이터셋과 함께 사용하는 전송 학습 및 앙상블 방법을 제안했다. 또 다른 종래기술에서는 원래 이미지 데이터셋의 소벨 경사도(Sobel gradient)와 라플라시안(Laplacian)을 통합한 데이터셋으로 MobileNetv2를 훈련시키는 접근 방식을 제안했다. 또 다른 종래기술에서는 CNN의 변별력을 개선하기 위해 글로벌(global) 손실에 센터(center) 손실을 추가하는 훈련 방법을 사용했다.In several facial expression recognition techniques using lightweight CNNs, the prior art used crowdsourcing and label distribution methods to construct the FERPlus dataset, which solved the problems of the FER2013 dataset, and four different training techniques using lightweight VGG were proposed. It has been done. In another prior art, identification features and facial expression features were extracted through DeepID2 and ResNet, respectively, and the features of each CNN output were combined and classified. Another prior art proposed a transfer learning and ensemble method that uses pre-trained parameters with a dataset frequently used for image classification. Another prior art proposed an approach to train MobileNetv2 with a dataset that integrates the Sobel gradient and Laplacian of the original image dataset. Another prior art used a training method that adds center loss to the global loss to improve the discrimination power of CNN.

또 다른 종래기술에서는 경사도 가중 클래스 활성화 매핑(Gradient Weighted Class Activation Mapping; Grad-CAM)을 사용하여 두 개의 사용자 정의 CNN을 구성했다. 그들은 또한 두 개의 CNN에서 추출한 특징들을 결합하여 얼굴 표정 인식에 사용하는 이중 통합 CNN을 제안했다. 또 다른 종래기술에서는 SHCNN을 제안했으며, 이는 비교적 작은 데이터셋의 과적합 문제를 완화하고 정적 및 마이크로 표현을 동시에 인식할 수 있다. 또 다른 종래기술에서는 DenseNet을 통해 추출된 특징에서 클래스 활성화 매핑(Class Activation Mapping; CAM)을 사용하여 다양한 얼굴 영역의 기여를 양자화하는 방법을 도입했다. 또 다른 종래기술에서는 두 개의 유사한 표정군을 하나로 통합하는 새로운 데이터셋을 만들었다. 여기서 얼굴 감지와 얼굴 정렬 방법을 적용하여 DenseNet을 훈련시키는 방법을 제안했다.In another prior art, two custom CNNs were constructed using Gradient Weighted Class Activation Mapping (Grad-CAM). They also proposed a dual-integration CNN that combines features extracted from two CNNs and uses them for facial expression recognition. Another prior art proposed SHCNN, which can alleviate the overfitting problem of relatively small datasets and recognize static and micro expressions simultaneously. Another prior art introduced a method of quantizing the contributions of various facial regions using Class Activation Mapping (CAM) in features extracted through DenseNet. In another prior art, a new dataset was created that integrates two similar facial expression groups into one. Here, we proposed a method to train DenseNet by applying face detection and face alignment methods.

그러나 또 다른 종래기술들은 많은 양의 메모리 공간을 차지하고 높은 계산 복잡성의 문제를 가지고 있었다. 일부 종래기술들에서 CNN은 더 작은 파라미터로 사용되었지만 두 개의 CNN이 추론을 위해 사용되었기 때문에 여전히 많은 양의 메모리 공간을 차지했으며 실시간 처리에 적합하지 않은 상태로 남아 있다. 일부 종래기술들에서 CNN은 파라미터가 작고 컴퓨팅 복잡성이 낮지만 얼굴 표정 인식 정확도는 매우 낮다. 소수의 파라미터를 사용하는 종래기술에서도 조밀한 연결로 인해 너무 많은 메모리 공간을 요구하기 때문에 실시간 처리에 적합하지 않았다. 이것이 DenseNet의 특징이다. 이러한 문제를 해결하기 위해서는 얼굴 표정 인식의 정확성을 유지하면서 메모리 풋프린트와 계산 복잡성을 최적화하여 경량 CNN 아키텍처를 설계하기 위한 기술을 필요로 한다.However, other prior technologies took up a large amount of memory space and had the problem of high computational complexity. In some prior art, CNNs were used with smaller parameters, but because two CNNs were used for inference, they still occupied a large amount of memory space and remained unsuitable for real-time processing. In some prior technologies, CNNs have small parameters and low computational complexity, but their facial expression recognition accuracy is very low. Even the prior art using a small number of parameters was not suitable for real-time processing because it required too much memory space due to dense connections. This is the characteristic of DenseNet. To solve these problems, techniques are needed to design a lightweight CNN architecture by optimizing memory footprint and computational complexity while maintaining the accuracy of facial expression recognition.

양자화는 32비트 플로팅-포인트(Floating-Point; FP)으로 훈련된 파라미터를 저비트 고정 소수점(Fixed-Point; FX) 또는 정수(Integer; INT)로 변환하는 방법이다. 양자화를 적용하면 메모리 사용을 최적화하고 간단한 하드웨어를 사용하여 계산 횟수를 줄일 수 있다. 양자화는 두 가지 방법으로 나뉜다. 훈련 후 양자화(Post-Training Quantization; PTQ) 방법은 사전 훈련된 파라미터를 사용하여 회의할 때 보정을 적용하여 양자화 오류를 최소화한다. 양자화 인식 훈련(Quantization-Aware Training; QAT) 방법은 양자화의 효과를 고려하여 파라미터를 재훈련한다.Quantization is a method of converting parameters trained as 32-bit floating-point (FP) to low-bit fixed-point (FX) or integer (INT). Applying quantization can optimize memory usage and reduce the number of calculations using simple hardware. Quantization is divided into two methods. Post-Training Quantization (PTQ) methods minimize quantization errors by applying corrections at the time of meeting using pre-trained parameters. The Quantization-Aware Training (QAT) method retrains parameters by considering the effect of quantization.

종래기술에서는 양자화 과정 중 편향된 가중치 오류와 불균형을 해결하기 위해 편향 보정 및 교차 계층 평준화 방법을 제안했다. 또 다른 종래기술에서는 플로팅-포인트 값과 양자화 오차 사이의 최소 평균 제곱 오차를 사용하여 양자화 오류를 최소화하는 방법을 도입했다. 또 다른 종래기술에서는 정수 양자화 분석 클리핑(Analytical Clipping for Integer Quantization; ACIQ)을 사용하여 최적의 클리핑 값을 계산하여 양자화 오류를 최소화하였다. 또 다른 종래기술에서는 최적의 양자화 범위를 얻기 위해 배치 정규화(Batch Normalization; BN) 계층의 평균과 분산을 이용한 증류 데이터셋을 제안했다. In the prior art, bias correction and cross-layer equalization methods were proposed to resolve biased weight errors and imbalances during the quantization process. Another prior art introduced a method of minimizing the quantization error using the minimum mean square error between the floating-point value and the quantization error. In another prior art, quantization error was minimized by calculating the optimal clipping value using Analytical Clipping for Integer Quantization (ACIQ). Another prior art proposed a distillation dataset using the mean and variance of a batch normalization (BN) layer to obtain the optimal quantization range.

PTQ 방법은 재훈련 과정이 필요하지 않기 때문에 컴퓨팅 리소스와 최적화 시간을 절약하고 신속한 배치를 가능하게 한다. 단, 편향 보정 및 교차 계층 평준화 방법 및 플로팅-포인트 값과 양자화 오차 사이의 최소 평균 제곱 오차를 사용하여 양자화 오류를 최소화하는 방법에서는 정확도 손실을 최소화하기 위해 미세 조정 값이 필요하다. ACIQ 방법 및 BN 방법에서는 정확도 저하를 최소화하기 위해 각 채널에 대해 서로 다른 비트 수를 할당해야 한다. 다른 종래기술에서는 정확도를 유지하기 위해 각 채널에 대한 별도의 기준화 인수를 구해야 한다. 또한, 모든 PTQ 방법은 플로팅-포인트 파라미터를 요구하기 때문에 하드웨어를 구현할 때 전용 연산자가 필요하다. The PTQ method does not require a retraining process, saving computing resources and optimization time and enabling rapid deployment. However, in bias correction and cross-layer equalization methods and methods that minimize quantization error using the minimum mean square error between floating-point values and quantization error, fine-tuning values are required to minimize accuracy loss. In the ACIQ method and the BN method, a different number of bits must be allocated to each channel to minimize accuracy degradation. In other prior art, a separate standardization factor must be obtained for each channel to maintain accuracy. Additionally, since all PTQ methods require floating-point parameters, dedicated operators are needed when implementing hardware.

또 다른 종래기술에서는 적절한 기준화 인수를 찾기 위한 파라미터화된 클리핑 활성화(Parameterized Clipping Activation; PACT) 함수를 정의하여 양자화 오류를 최소화하는 방법을 제안했다. 또 다른 종래기술에서는 정수-산술만을 사용하여 행렬 곱셈을 수행하는 양자화기의 작동을 정의하고 계층 융합 방법을 제시하였다. 또 다른 종래기술에서는 파라미터화된 양자화 간격을 사용하여 작업 손실을 최소화하는 방법을 제안했다. 또 다른 종래기술에서는 기준화 인수를 학습 가능한 파라미터로 만들어 최적의 양자화 매핑을 학습하는 방법을 제안했다. 또 다른 종래기술에서는 기준화 인수 변환 방정식을 사용하여 기준화 인수를 2의 거듭제곱 항에 매핑하는 방법을 제안했다.Another prior art proposed a method of minimizing quantization error by defining a Parameterized Clipping Activation (PACT) function to find an appropriate standardization factor. Another prior art defined the operation of a quantizer that performs matrix multiplication using only integer-arithmetic and proposed a hierarchical fusion method. Another prior art proposed a method of minimizing work loss using a parameterized quantization interval. Another prior art proposed a method of learning optimal quantization mapping by making the standardization argument into a learnable parameter. Another prior art proposed a method of mapping a scaling factor to a power of 2 term using a scaling factor conversion equation.

QAT 방법은 다른 파라미터로 양자화 파라미터를 훈련시킴으로써 PTQ 방법보다 높은 정확도를 얻을 수 있다. 그러나 일부 종래기술에서는 BN 계층을 양자화하지 않았다. 일부 종래기술에서는 양자화에 필요한 파라미터는 플로팅-포인트며 하드웨어를 구현할 때 전용 연산자가 필요하다. 다른 접근법에서는 QAT를 시작하기 전에 활성화 기준화 인수의 보정이 필요하다. 이러한 문제를 보완하기 위해서는 사전 보정 단계가 필요하지 않은 하드웨어 친화적인 양자화 방법이 필요하다. The QAT method can achieve higher accuracy than the PTQ method by training the quantization parameter with other parameters. However, in some prior technologies, the BN layer was not quantized. In some prior technologies, the parameters required for quantization are floating-points, and dedicated operators are required when implementing hardware. Other approaches require calibration of the activation criterion factor before starting QAT. To compensate for this problem, a hardware-friendly quantization method that does not require a pre-compensation step is needed.

GPU를 사용하여 얼굴 표정 인식을 수행하는 것은 처리량과 속도 면에서 최고의 성능을 제공한다. 그러나 임베디드 시스템에서 GPU를 사용하는 것은 어렵다. 이러한 시스템은 에너지 소비와 하드웨어 리소스 사용을 줄이는 솔루션을 필요로 한다. 필드 프로그램 가능 게이트 어레이(Field Programmable Gate Array; FPGA)는 설계자가 애플리케이션에 맞게 설계를 조정할 수 있는 가능성을 가지고 있다. 최적의 작동 속도와 합리적인 전력 소비를 가능하게 프로그래밍할 수 있기 때문에 임베디드 시스템의 실시간 처리에 적합하다.Performing facial expression recognition using GPUs provides the best performance in terms of throughput and speed. However, using GPUs in embedded systems is difficult. These systems require solutions that reduce energy consumption and hardware resource usage. Field Programmable Gate Arrays (FPGAs) have the potential to allow designers to tailor their designs to their applications. It is suitable for real-time processing of embedded systems because it can be programmed for optimal operating speed and reasonable power consumption.

이러한 장점 때문에 얼굴 표정 인식을 위한 여러 FPGA 기반 CNN 가속기가 제안되었다. 종래기술에서는 높은 수준의 합성(High Level-Synthesis; HLS)을 사용하여 Xilinx Zynq-XC7Z020 FPGA에 CNN 가속기를 구현했다. Because of these advantages, several FPGA-based CNN accelerators have been proposed for facial expression recognition. In the prior art, a CNN accelerator was implemented on the Xilinx Zynq-XC7Z020 FPGA using High Level-Synthesis (HLS).

블록램(Block RAM; BRAM) 사용량을 줄이기 위해 D램에 계층의 결과와 입력을 저장하는 CNN 가속기를 설계하고 고속 D램 접속을 위해 VDMA를 사용했다. 또 다른 종래기술에서는 얼굴 감정 인식을 위한 효율적인 구조와 데이터 전처리 방법을 적용하여 CNN 가속기를 설계하고 Verilog HDL을 사용하여 Altera DE-10 FPGA에 구현했다. 또 다른 종래기술에서 가속기는 Altera DE-10 FPGA의 처리 엔진 코어를 구성하여 컨볼루션 작동을 가속화하였다. 또 다른 종래기술에서는 CNN을 구성하는 다양한 컨볼루션 계층에 대해 서로 다른 병렬 처리를 가진 Verilog HDL을 사용하는 Altera Cyclone-V FPGA 가속기를 제안했다. 또 다른 종래기술에서 CNN 가속기는 Verilog HDL을 사용하여 Altera Cyclone-V FPGA에 구현되었다. 그들은 FPGA에서 DSP 사용을 최대화하기 위해 구성 가능한 컨볼루션 컴퓨팅 어레이를 설계했다.To reduce Block RAM (BRAM) usage, we designed a CNN accelerator that stores layer results and inputs in DRAM and used VDMA for high-speed DRAM access. In another prior art, a CNN accelerator was designed by applying an efficient structure and data preprocessing method for facial emotion recognition and implemented on Altera DE-10 FPGA using Verilog HDL. In another prior art, an accelerator comprised the processing engine core of an Altera DE-10 FPGA to accelerate convolution operations. Another prior art proposed an Altera Cyclone-V FPGA accelerator using Verilog HDL with different parallel processing for the various convolutional layers that make up the CNN. In another prior art, a CNN accelerator was implemented on an Altera Cyclone-V FPGA using Verilog HDL. They designed a configurable convolutional computing array to maximize DSP usage in FPGAs.

일부 종래기술에서는 완전 연결, 컨볼루션 및 풀링 계층을 포함하는 간단한 CNN 아키텍처를 구현했다. 파라미터의 수와 계산 복잡성은 낮았지만 정확성도 낮아서 매우 정확한 얼굴 표정 인식이 필요한 애플리케이션에는 적합하지 않았다. 일부 종래기술에서 DeepID 아키텍처는 많은 하드웨어 리소스를 필요로 하는 FPGA에서 구현되어 임베디드 시스템에서는 사용할 수 없다. 따라서 실시간 얼굴 표정 인식을 위한 저전력 저비용 FPGA 기반 CNN 가속기를 설계하고 높은 얼굴 표정 인식 성능을 유지해야 한다. Some prior art implementations of simple CNN architectures include fully connected, convolutional, and pooling layers. Although the number of parameters and computational complexity were low, the accuracy was also low, making it unsuitable for applications that require highly accurate facial expression recognition. In some prior technologies, the DeepID architecture is implemented in FPGAs, which require a lot of hardware resources and cannot be used in embedded systems. Therefore, it is necessary to design a low-power, low-cost FPGA-based CNN accelerator for real-time facial expression recognition and maintain high facial expression recognition performance.

도 1은 본 발명의 일 실시예에 따른 실시간 얼굴 감정 인식을 위한 자원 효율적인 정수-산술 전용 FPGA 기반 CNN 가속기의 동작 방법을 설명하기 위한 도면이다. 1 is a diagram illustrating the operation method of a resource-efficient integer-arithmetic-only FPGA-based CNN accelerator for real-time facial emotion recognition according to an embodiment of the present invention.

실시간 얼굴 감정 인식을 위한 자원 효율적인 정수-산술 전용 FPGA 기반 CNN 가속기의 동작 방법은 얼굴 표정 인식의 정확성을 유지하면서 메모리 풋프린트와 계산 복잡성을 최적화함으로써 달성된다. 또한, 종래기술의 단점을 해결하기 위한 로그레벨 임계값 정량화(Log Level Threshold Quantization; LLTQ)가 제안된다[1], [2]. 또한 실시간 얼굴 표정 인식을 위해 저전력 저비용 FPGA 기반 CNN 가속기가 제안된다. The operation method of a resource-efficient integer-arithmetic-only FPGA-based CNN accelerator for real-time facial emotion recognition is achieved by optimizing the memory footprint and computational complexity while maintaining the accuracy of facial expression recognition. Additionally, Log Level Threshold Quantization (LLTQ) is proposed to solve the shortcomings of the prior art [1], [2]. Additionally, a low-power, low-cost FPGA-based CNN accelerator is proposed for real-time facial expression recognition.

제안하는 CPU(Processing System; PS)와 FPGA(Programmable Logic, PL)를 포함하는 이기종 SoC 플랫폼에서 구현되는 FPGA 기반 CNN 가속기의 동작 방법은 CPU에서 CNN 얼굴 표정 인식 성능을 향상시키기 위해 복수의 이미지 처리 알고리즘을 결합한 훈련 데이터셋을 생성하고, 생성된 훈련 데이터셋에 대하여 플로팅-포인트(floating-point) 훈련을 통해 파라미터를 변환하는 단계(110), CPU에서 플로팅-포인트 훈련을 통해 변환된 파라미터를에 대하여 양자화 인식 훈련을 통해 양자화된 파라미터를 추출하고, 정수-산술 전용(Integer-Arithmetic-Only) CNN 재구성을 수행하는 단계(120) 및 이기종 SoC(System on Chip) 플랫폼의 FPGA(Programmable Logic; PL) 및 CPU(Processing System; PS) 영역에서 상기 정수-산술 전용 CNN을 이용하여 실시간 얼굴 감정 인식을 수행하는 단계(130)를 포함한다. The operating method of the proposed FPGA-based CNN accelerator implemented on a heterogeneous SoC platform including CPU (Processing System; PS) and FPGA (Programmable Logic, PL) is to use multiple image processing algorithms to improve CNN facial expression recognition performance in the CPU. A step of generating a training dataset that combines and converting parameters through floating-point training for the generated training dataset (110), with respect to the parameters converted through floating-point training in the CPU. Step 120 of extracting quantized parameters through quantization recognition training and performing Integer-Arithmetic-Only CNN reconstruction, and FPGA (Programmable Logic; PL) of heterogeneous System on Chip (SoC) platforms and It includes a step 130 of performing real-time facial emotion recognition using the integer-arithmetic dedicated CNN in the CPU (Processing System; PS) domain.

단계(110)에서, CPU에서는 CNN 얼굴 표정 인식 성능을 향상시키기 위해 복수의 이미지 처리 알고리즘을 결합한 훈련 데이터셋을 생성하고(111), 본 발명의 실시예에 따른 새로운 경량 CNN 아키텍처(112)를 이용하여, 상기 생성된 훈련 데이터셋에 대한 플로팅-포인트 훈련을 통해 파라미터를 변환하고(113), PTH File을 생성한다(114). 본 발명의 실시예에 따른 PTH File은 플로팅-포인트 훈련된 파라미터들이다. In step 110, the CPU generates a training dataset combining a plurality of image processing algorithms to improve CNN facial expression recognition performance (111), and uses a new lightweight CNN architecture (112) according to an embodiment of the present invention. Thus, parameters are converted through floating-point training for the generated training dataset (113), and a PTH File is created (114). PTH File according to an embodiment of the present invention is floating-point trained parameters.

단계(110)에서, 제1 계산 모듈(FireA) 및 제2 계산 모듈(FireB)을 포함하는 기본 계산 블록을 통해 연산을 수행할 수 있다. 제1 계산 모듈(FireA) 및 제2 계산 모듈(FireB) 각각은, 특징맵의 채널 수를 감소시키고, 이어지는 확장 계층의 연산 수를 감소시키기 위한 스퀴즈 계층 및 채널을 다시 확장하하기 위한 확장 계층을 포함한다. 이때, 수신 필드를 유지하여 정확도 저하를 최소화하기 위해 제1 계산 모듈의 스퀴즈 계층과 제2 계산 모듈의 스퀴즈 계층은 서로 다른 커널 크기를 갖는다. In step 110, an operation may be performed through a basic calculation block including a first calculation module (FireA) and a second calculation module (FireB). The first calculation module (FireA) and the second calculation module (FireB) each include a squeeze layer to reduce the number of channels in the feature map and the number of operations in the subsequent expansion layer, and an expansion layer to expand the channels again. Includes. At this time, in order to maintain the reception field and minimize accuracy degradation, the squeeze layer of the first calculation module and the squeeze layer of the second calculation module have different kernel sizes.

본 발명의 실시예에 따르면, 특징맵의 해상도를 절반으로 줄이고 최대 풀링 계층을 대체하기 위해 상기 기본 계산 블록의 스트라이드를 2로 설정할 수 있다. 또한, CNN의 마지막에 추출된 특징을 분류하고 복수의 매개 변수를 포함하는 완전 연결 계층을 글로벌 평균 풀링으로 대체하고 커널 크기가 1인 컨볼루션 계층을 대체하여 상기 파라미터를 감소시킬 수 있다. According to an embodiment of the present invention, the resolution of the feature map can be reduced by half and the stride of the basic calculation block can be set to 2 to replace the maximum pooling layer. In addition, the parameters can be reduced by classifying the features extracted at the end of the CNN, replacing the fully connected layer containing multiple parameters with global average pooling, and replacing the convolution layer with a kernel size of 1.

본 발명의 실시예에 따르면, CNN의 분류기는 컨볼루션 계층을 통해 클래스 수와 일치하도록 채널 수를 최소화한 후 특징맵의 해상도를 하나의 픽셀로 압축하며, CNN의 수렴 안정성과 성능을 개선하기 위해 컨볼루션 계층과 활성화 함수 사이에 배치 정규화 계층을 삽입할 수 있다. According to an embodiment of the present invention, the CNN classifier minimizes the number of channels to match the number of classes through a convolution layer and then compresses the resolution of the feature map to one pixel, in order to improve the convergence stability and performance of the CNN. A batch normalization layer can be inserted between the convolution layer and the activation function.

단계(120)에서, CPU에서 플로팅-포인트 훈련을 통해 변환된 파라미터를에 대하여 양자화 인식 훈련(121)을 통해 양자화된 파라미터를 추출하고(122), 정수-산술 전용(Integer-Arithmetic-Only) CNN 재구성을 수행하여(123), BIN File을 생성한다(124). 본 발명의 실시예에 따른 BIN File은 직렬화된 정수 파라미터들이다. In step 120, quantized parameters are extracted through quantization recognition training (121) for the parameters converted through floating-point training in the CPU (122), and an Integer-Arithmetic-Only CNN is used. Perform reconstruction (123) and create a BIN File (124). BIN File according to an embodiment of the present invention is serialized integer parameters.

본 발명의 실시예에 따른 로그레벨 임계값 정량화(LLTQ) 방법은 학습된 단계 크기 양자화 방법(Learned Step size Quantization; LSQ)을 이용하여 훈련 가능한 파라미터를 기준화 인수로 설정하고 다른 파라미터와 함께 상기 기준화 인수를 훈련시킴으로써 양자화 프로세스에서 기준화 인수에 대한 사전 보정 프로세스를 필요로 하지 않고, 훈련된 양자화 임계값(Trained Quantization Threshold; TQT) 방법을 이용하여 상기 훈련 가능한 파라미터를 로그 임계값(log₂t)으로 설정하고 상기 기준화 인수를 2의 거듭제곱 항에 매핑함으로써, 기준화 인수를 사용한 곱셈과 나눗셈을 이동 연산(shift operations)으로 대체하며, 사전 훈련된 플로팅-포인트 파라미터를 사용하여 양자화 인식 훈련(Quantization-Aware Training; QAT)을 수행하는 과정에서 대칭 양자화를 사용함으로써 균일한 양자화 중에 발생하는 영점에 대한 계산 오버헤드가 제거되고, 계층별 스케일링을 도입함으로써 주어진 파라미터 또는 입력 텐서의 모든 요소에 대해 하나의 기준화 인수로 양자화를 수행한다.The log-level threshold quantification (LLTQ) method according to an embodiment of the present invention uses a learned step size quantization (LSQ) method to set a trainable parameter as a standardization factor and set the standard together with other parameters. By training the quantization factor, the quantization process does not require a pre-calibration process for the standardization factor, and uses the Trained Quantization Threshold (TQT) method to set the trainable parameter to a logarithmic threshold (log ₂ t ) and mapping the scaling factor to a power-of-2 term, replacing multiplication and division using the scaling factor with shift operations, and training quantization recognition using pre-trained floating-point parameters. By using symmetric quantization in the process of performing Quantization-Aware Training (QAT), the computational overhead for zero points that occurs during uniform quantization is eliminated, and by introducing layer-wise scaling, for all elements of a given parameter or input tensor. Quantization is performed with one standardization argument.

단계(130)에서는, 이기종 SoC(System on Chip) 플랫폼의 FPGA(Programmable Logic; PL) 영역(131) 및 CPU(Processing System; PS) 영역(132)에서 상기 정수-산술 전용 CNN을 이용하여 실시간 얼굴 감정 인식을 수행한다. In step 130, real-time face analysis is performed using the integer-arithmetic-only CNN in the FPGA (Programmable Logic; PL) area 131 and the CPU (Processing System; PS) area 132 of the heterogeneous System on Chip (SoC) platform. Perform emotion recognition.

본 발명의 실시예에 따른 FPGA 영역(131)에서는 XSA File(143)을 이용할 수 있다. XSA(Xilinx support Ardhive) File 생성 과정(140)은 HLS 컴파일러(141) 및 CNN 가속기 로직(142)을 거쳐 생성될 수 있다. The XSA File (143) can be used in the FPGA area (131) according to the embodiment of the present invention. The XSA (Xilinx support Ardhive) file creation process 140 can be created through the HLS compiler 141 and CNN accelerator logic 142.

본 발명의 실시예에 따른 CPU 영역(132)에서는 ELF File(153)을 이용할 수 있다. ELF(Executable and linkable) File 생성 과정(150)은 표준 C/C++ 컴파일러(151) 및 장치 드라이버(152)을 거쳐 생성될 수 있다.The ELF File (153) can be used in the CPU area (132) according to the embodiment of the present invention. The ELF (Executable and linkable) File creation process 150 can be created through a standard C/C++ compiler 151 and a device driver 152.

이와 같이, 본 발명의 실시예에 따른 CNN 가속기 로직(142) 및 장치 드라이버(152)는 실시간 얼굴 표정 인식 애플리케이션을 위해 이기종 SoC 플랫폼의 PL 및 PS 영역에 구현되었다. As such, the CNN accelerator logic 142 and device driver 152 according to an embodiment of the present invention were implemented in the PL and PS areas of heterogeneous SoC platforms for real-time facial expression recognition applications.

도 2는 본 발명의 일 실시예에 따른 경량 CNN 아키텍처의 메모리 사용 및 계산 복잡성을 최적화하는 과정을 설명하기 위한 도면이다. Figure 2 is a diagram illustrating a process for optimizing memory usage and computational complexity of a lightweight CNN architecture according to an embodiment of the present invention.

도 2를 참조하면, 21개의 컨볼루션 계층과 하나의 글로벌 평균 풀링 계층으로 구성된 본 발명의 실시예에 따른 경량 CNN 아키텍처를 나타낸다. 계산 블록에는 두 개의 기본 계산 모듈인 FireA와 FireB가 포함되어 있다. 다음 계산 블록의 파라미터인 스퀴즈 계층(s)의 필터 수와 확장 계층(e)의 필터 수는 다음 계산 블록에서 두 배 증가한다. 본 발명의 실시예에 따른 CNN은 다음과 같은 기법을 적용하여 설계되었다: 기본 연산 모듈, 모든 컨볼루션 및 정규 네트워크 및 2의 거듭제곱 해상도 및 4 배수 출력 채널. Referring to Figure 2, it shows a lightweight CNN architecture according to an embodiment of the present invention consisting of 21 convolutional layers and one global average pooling layer. The computation block contains two basic computation modules, FireA and FireB. The parameters of the next calculation block, the number of filters in the squeeze layer (s) and the number of filters in the expansion layer (e), are doubled in the next calculation block. A CNN according to an embodiment of the present invention was designed by applying the following techniques: basic computational modules, all convolutional and regular networks, and power-of-two resolution and quadruple output channels.

먼저, 본 발명의 실시예에 따른 기본 연산 모듈에 대하여 설명한다. First, a basic operation module according to an embodiment of the present invention will be described.

본 발명의 실시예에 따른 경량 CNN은 스퀴즈넷(SqueezeNet)에 제안된 fire 모듈을 응용한다. fire 모듈은 두 가지 유형의 계층으로 구성된다. 스퀴즈 계층(Squeeze Layer; SQ)은 특징맵의 채널 수를 줄이고, 이어지는 확장 계층의 연산 수를 줄일 수 있다. 확장 계층은 채널을 다시 확장하는 역할을 하며 커널 크기가 3인 계층과 커널 크기가 1인 계층 두 가지 계층으로 구성된다. SQ의 출력 특징맵은 두 개의 경로로 나뉘며 각각 EX1과 EX3으로 입력된다. 또한 EX1과 EX3의 출력 특징맵은 채널별 연결을 통해 병합된 후 다음 계층으로 들어간다.The lightweight CNN according to an embodiment of the present invention applies the fire module proposed in SqueezeNet. The fire module consists of two types of layers: The Squeeze Layer (SQ) can reduce the number of channels in the feature map and reduce the number of operations in the subsequent expansion layer. The expansion layer plays the role of expanding the channel again and consists of two layers: a layer with a kernel size of 3 and a layer with a kernel size of 1. The output feature map of SQ is divided into two paths and input as EX1 and EX3, respectively. Additionally, the output feature maps of EX1 and EX3 are merged through channel-specific connection and then entered the next layer.

본 발명의 실시예에 따른 경량 CNN은 기본 계산 모듈로 수정된 fire 모듈을 사용하며 두 가지 유형(FireA 및 FireB)으로 구성된다. FireA와 FireB의 확장 계층 구성은 동일하다. 그러나 SQ의 커널 크기는 FireA에서 3으로, FireB에서 1로 설정된다. FireA에서 커널 크기를 3으로 설정하는 이유는 적절한 수신 필드를 유지하여 정확도 저하를 최소화하기 위함이다.The lightweight CNN according to an embodiment of the present invention uses a modified fire module as the basic calculation module and is composed of two types (FireA and FireB). The expansion layer configuration of FireA and FireB is the same. However, the kernel size of SQ is set to 3 in FireA and 1 in FireB. The reason for setting the kernel size to 3 in FireA is to minimize accuracy degradation by maintaining appropriate reception fields.

다음으로, 본 발명의 실시예에 따른 모든 컨볼루션 및 정규 네트워크에 대하여 설명한다.Next, all convolutional and regular networks according to embodiments of the present invention will be described.

본 발명의 실시예에 따른 경량 CNN은 커널 크기가 3 또는 1인 컨볼루션 계층만 사용하므로 하드웨어에서 구현할 때 다른 커널 크기에 대해서는 가속 방법을 고려할 필요가 없다. 최대 풀링 계층은 특징맵의 해상도를 절반으로 줄이고 정보를 압축한다. 특징맵의 최대값을 찾아야 하기 때문에 하드웨어에서 구현될 때 병목 현상이 발생할 수 있다. 제안하는 CNN은 FireA의 스트라이드를 2로 설정한다. 이는 특징맵의 해상도를 절반으로 줄이고 최대 풀링 계층을 대체한다.Since the lightweight CNN according to an embodiment of the present invention uses only convolutional layers with a kernel size of 3 or 1, there is no need to consider acceleration methods for other kernel sizes when implemented in hardware. The max pooling layer reduces the resolution of the feature map by half and compresses the information. Because the maximum value of the feature map must be found, a bottleneck may occur when implemented in hardware. The proposed CNN sets the stride of FireA to 2. This halves the resolution of the feature maps and replaces the max pooling layer.

본 발명의 실시예에 따른 완전 연결(Fully-Connected; FC) 계층은 CNN의 맨 마지막에 추출된 특징을 분류하며 많은 매개 변수를 포함한다. 파라미터의 수는 FC 계층을 글로벌 평균 풀링으로 대체하고 커널 크기가 1인 컨볼루션 계층을 대체하여 줄일 수 있다. 구성된 분류기는 컨볼루션 계층을 통해 클래스 수와 일치하도록 채널 수를 최소화한 다음 특징맵의 해상도를 한 픽셀로 압축한다. 모든 계층의 활성화 함수는 ReLU라는 레이블이 붙어 있다. 하드웨어에서 구현될 때 ReLU 함수는 부호 비트 비교만 사용하기 때문에 매우 효율적이다. 이와 같은 방법으로 CNN의 모든 계층은 컨볼루션 계층으로 만들어졌고 매우 일반적인 네트워크로 설계되었다. 또한 CNN 수렴 안정성과 성능을 개선하기 위해 컨볼루션 계층과 활성화 함수 사이에 BN 계층을 삽입했다. BN 계층은 BN 계층 융합을 통해 컨볼루션 계층과 결합될 수 있기 때문에, 하드웨어에서 구현할 때 BN 계층에 의해 생성된 추가 연산과 파라미터의 수는 무시될 수 있다.The Fully-Connected (FC) layer according to an embodiment of the present invention classifies the features extracted at the end of the CNN and includes many parameters. The number of parameters can be reduced by replacing the FC layer with global average pooling and a convolutional layer with a kernel size of 1. The constructed classifier minimizes the number of channels to match the number of classes through a convolutional layer, and then compresses the resolution of the feature map to one pixel. The activation function of every layer is labeled ReLU. When implemented in hardware, the ReLU function is very efficient because it uses only sign bit comparisons. In this way, all layers of CNN were made of convolutional layers and designed as a very general network. Additionally, a BN layer was inserted between the convolution layer and the activation function to improve CNN convergence stability and performance. Since the BN layer can be combined with the convolutional layer through BN layer fusion, the number of additional operations and parameters generated by the BN layer can be neglected when implemented in hardware.

다음으로, 본 발명의 실시예에 따른 2의 거듭제곱 해상도 및 4 배수 출력 채널에 대하여 설명한다. Next, power-of-2 resolution and quadruple output channels according to an embodiment of the present invention will be described.

본 발명의 실시예에 따르면, 특징맵의 너비와 높이를 모두 2의 거듭제곱 항으로 만듦으로써 특징맵의 주소를 계산하는 부분의 곱셈과 나눗셈 연산을 이동 연산으로 대체할 수 있다. 또한, 컨볼루션 계층의 출력 채널은 모두 효율적 병렬 다중 누적(Multiply-Accumulate; MAC) 하드웨어 작동을 위해 4의 배수로 설정되었다. According to an embodiment of the present invention, by making both the width and height of the feature map a power of 2 term, the multiplication and division operations in the part where the address of the feature map is calculated can be replaced with a move operation. Additionally, the output channels of the convolutional layer were all set to multiples of 4 for efficient parallel multi-accumulate (MAC) hardware operation.

도 3은 본 발명의 일 실시예에 따른 양자화 방법 LLTQ를 이용한 균일 양자화 과정과 종래기술의 LSQ 및 TQT 양자화 과정을 설명하기 위한 도면이다. Figure 3 is a diagram for explaining the uniform quantization process using the quantization method LLTQ according to an embodiment of the present invention and the LSQ and TQT quantization processes of the prior art.

본 발명의 실시예에 따른 로그레벨 임계값 양자화(LLTQ) 기법에 관하여 설명한다. A log-level threshold quantization (LLTQ) technique according to an embodiment of the present invention will be described.

학습된 단계 크기 양자화(Learned Step size Quantization; LSQ)방법[1]과 훈련된 양자화 임계값(Trained Quantization Threshold; TQT) 방법[2]은 균일한 양자화를 사용한다. 균일한 양자화기가 삽입된 CNN은 사전 훈련된 플로팅-포인트 파라미터를 사용하여 양자화 인식 훈련(Quantization-Aware Training; QAT)을 수행한다. The Learned Step Size Quantization (LSQ) method [1] and the Trained Quantization Threshold (TQT) method [2] use uniform quantization. A CNN with a uniform quantizer inserted performs quantization-aware training (QAT) using pre-trained floating-point parameters.

도 3은 균일한 양자화기(320)가 삽입된 컨볼루션 계층(330)의 구조와 양자화기(310)의 4단계 작동을 보여준다. 첫 번째 단계인 스케일 프로세스, 양자화기는 실제 값 범위를 지정된 비트의 정수 값 범위에 매핑한다. 라운드 프로세스라고 하는 두 번째 단계는 정수 값 범위로 매핑된 모든 실수 값을 정수로 변환한다. 세 번째 단계는 클램프 프로세스이다. 이 프로세스는 양자화 수준에 맞게 양자화 범위를 초과하는 요소를 제거한다. 양자화 비트가 n인 경우, 클램프 프로세스 후의 값은 (1)에 표시된 바와 같이 바운드(Bound) 내부에 위치한다.Figure 3 shows the structure of the convolutional layer 330 into which the uniform quantizer 320 is inserted and the four-step operation of the quantizer 310. The first step, the scale process, the quantizer maps the real value range to the integer value range of the specified bits. The second step, called the round process, converts all real values mapped to a range of integer values to integers. The third step is the clamp process. This process removes elements that exceed the quantization range to match the quantization level. When the quantization bit is n, the value after the clamp process is located inside the Bound, as shown in (1).

(1) (One)

네 번째 단계인 역-양자화(de-quant) 프로세스는 정수 범위에 매핑된 값을 양자화의 영향을 받는 플로팅-포인트 값으로 변환한다. 양자화된 값을 정수로 다시 플로팅-포인트로 변환하는 이유는 GPU를 사용하는 대부분의 기존 딥러닝 프레임워크가 플로팅-포인트 훈련에 최적화되어 있기 때문이다. 순방향 경로에 대한 전체 식은 다음과 같이 설명할 수 있다.The fourth step, the de-quant process, converts values mapped to the integer range into floating-point values subject to quantization. The reason for converting quantized values from integers back to floating-points is that most existing deep learning frameworks using GPUs are optimized for floating-point training. The overall equation for the forward path can be described as follows.

(2) (2)

여기서 x_f는 양자화기의 입력인 플로팅-포인트 값이고 x_d는 역-양자화 프로세스를 통과한 양자화기의 출력 값이다. clamp(x, Bound) 함수는 균일 양자화기의 두 번째 단계이며 반올림된 값(x)을 경계 내에 맞도록 자른다. round(·) 함수는 균일 양자화기의 세 번째 단계이다. 실제 데이터를 가장 가까운 정수로 반올림한다. STE(Straight-Through Estimator)[8]를 사용하여 반올림 공정에서 발생하는 불연속성을 해결하였다.Here, x _f is a floating-point value that is the input of the quantizer, and x _d is the output value of the quantizer that has passed the inverse-quantization process. The clamp(x, Bound) function is the second stage of the uniform quantizer and truncates the rounded value (x) to fit within the bounds. The round(·) function is the third stage of the uniform quantizer. Round the actual data to the nearest integer. Discontinuities occurring in the rounding process were resolved using the Straight-Through Estimator (STE) [8].

LSQ [1] 방법은 훈련 가능한 파라미터를 기준화 인수로 설정하고 다른 파라미터와 함께 기준화 인수를 훈련시킨다. 따라서 LSQ는 양자화 공정에서 기준화 인수를 변환할 필요가 없다. 이것은 제안하는 경량 CNN에 사용하기 위해 잘 알려져 있지만 하드웨어에 이를 구현할 때 전용 연산자가 필요하다는 단점이 있다. 이는 플로팅-포인트 기준화 인수이기 때문이다. TQT [2] 방법은 훈련 가능한 파라미터를 로그 임계값(log₂t)으로 설정하고 기준화 인수를 다음과 같이 2의 거듭제곱 항에 매핑한다.The LSQ [1] method sets a trainable parameter as a baseline argument and trains the baseline argument together with other parameters. Therefore, LSQ does not need to convert the reference factor in the quantization process. This is well known for use in the proposed lightweight CNN, but has the disadvantage of requiring dedicated operators when implementing it in hardware. This is because it is a floating-point standardization argument. The TQT [2] method sets the trainable parameters to a logarithmic threshold (log ₂ t) and maps the criterion factor to a power of 2 term as follows.

(3) (3)

여기서 q_level은 식(1)의 Bound 최대값에 1을 더하는 값이다.Here, q_level is a value that adds 1 to the maximum Bound value in equation (1).

TQT는 기준화 인수를 2의 거듭제곱 항에 매핑하기 때문에 기준화 인수를 사용한 곱셈과 나눗셈을 이동 연산으로 대체할 수 있다는 장점이 있다. 단, 재훈련을 시작하기 전에 활성화 기준화 인수의 보정이 필요하다. 따라서 본 발명에서는 이전 연구 LSQ, TQT의 문제를 해결하는 새로운 하드웨어 친화적 양자화 방법을 제안한다. Because TQT maps the standardization factor to a power of 2 term, it has the advantage of being able to replace multiplication and division using the standardization factor with shift operations. However, before starting retraining, the activation standardization factor needs to be calibrated. Therefore, the present invention proposes a new hardware-friendly quantization method that solves the problems of LSQ and TQT in previous studies.

제안하는 LLTQ는 훈련 가능한 파라미터를 직접 2의 거듭제곱 인수에 매핑할 수 있다. 또한 TQT와 달리 LLTQ 방법을 사용하면 학습된 단계 크기 양자화 방법(Learned Step size Quantization; LSQ)을 이용하여 훈련 가능한 파라미터를 기준화 인수로 설정하고 다른 파라미터와 함께 상기 기준화 인수를 훈련시킴으로써 양자화 프로세스에서 기준화 인수에 대한 사전 보정 프로세스를 필요로 하지 않고, 훈련된 양자화 임계값(Trained Quantization Threshold; TQT) 방법을 이용하여 상기 훈련 가능한 파라미터를 로그 임계값(log₂t)으로 설정하고 상기 기준화 인수를 2의 거듭제곱 항에 매핑함으로써, 기준화 인수를 사용한 곱셈과 나눗셈을 이동 연산(shift operations)으로 대체하며, 사전 훈련된 플로팅-포인트 파라미터를 사용하여 양자화 인식 훈련(Quantization-Aware Training; QAT)을 수행하는 과정에서 대칭 양자화를 사용함으로써 균일한 양자화 중에 발생하는 영점에 대한 계산 오버헤드가 제거되고, 계층별 스케일링을 도입함으로써 주어진 파라미터 또는 입력 텐서의 모든 요소에 대해 하나의 기준화 인수로 양자화를 수행한다.The proposed LLTQ can map trainable parameters directly to power-of-2 factors. Also, unlike TQT, the LLTQ method uses the Learned Step Size Quantization (LSQ) method to set a trainable parameter as a standardization factor and train the standardization factor along with other parameters in the quantization process. Instead of requiring a pre-calibration process for the reference factor, the trainable parameter is set to a logarithmic threshold (log ₂ t) using the Trained Quantization Threshold (TQT) method and the reference factor to power-of-2 terms, replacing multiplication and division using scaling factors with shift operations, and Quantization-Aware Training (QAT) using pre-trained floating-point parameters. By using symmetric quantization in the process of performing, the computational overhead for zeros that occurs during uniform quantization is eliminated, and by introducing layer-by-layer scaling, quantization is performed with one baseline factor for a given parameter or all elements of the input tensor. Perform.

본 발명의 실시예에 따른 LLTQ 방법의 성능은 정수 스케일 변환 방법을 통한 정수 산술만을 사용하여 평가된다. LLTQ는 훈련 가능한 파라미터를 레벨 임계값(level_th)으로 설정하고 다음 식을 사용하여 기준화 인수를 2의 거듭제곱 항에 매핑한다.The performance of the LLTQ method according to an embodiment of the present invention is evaluated using only integer arithmetic through an integer scale transformation method. LLTQ sets the trainable parameter to a level threshold (level_th) and maps the criterion factor to a power-of-2 term using the following equation:

(4) (4)

여기서 q_level은 식(3)에서와 같다. 기준화 인수를 계산하려면 많은 연산이 필요하다. 그러나 정수 산술만 사용하도록 재구성된 CNN을 사용하여 추론을 수행할 경우 2의 거듭제곱 항에 매핑된 기준화 인수의 이동량만 필요하다.Here, q_level is the same as in equation (3). Calculating the standardization factor requires many operations. However, when performing inference using a CNN reconfigured to use only integer arithmetic, only the amount of movement of the scaling factor mapped to a power-of-2 term is needed.

또한, LLTQ에 대하여 하드웨어 친화적인 다양한 양자화 방식이 적용되었다. 대칭 양자화를 사용함으로써 균일한 양자화 중에 발생하는 영점에 대한 계산 오버헤드가 제거된다. 계층별 스케일링을 도입함으로써 주어진 파라미터 또는 입력 텐서의 모든 요소에 대해 하나의 기준화 인수로 양자화를 수행할 수 있다. 컨볼루션 층과 BN 계층의 파라미터를 결합한 BN 계층 융합이 적용되었다. 이 방식은 BN 계층에서 양자화를 진행할 필요가 없다. 이는 BN 계층에 필요한 곱셈과 나눗셈 연산을 제거함으로써 연산의 수와 추론 시간을 줄일 수 있다. fire 모듈은 스퀴즈 계층의 출력이 두 확장 계층의 입력으로 들어가는 구조이다. 플로팅-포인트에서는 두 확장 계층의 입력이 동일하다. 그러나 QAT에서 각 컨볼루션 계층에는 서로 다른 기준화 인수가 필요하다. 따라서 스퀴즈 계층은 각 확장 계층의 출력을 별도로 계산해야 한다. 이는 하드웨어에서 CNN 추론을 수행할 때 병목 현상을 일으킬 수 있다. 이 문제를 해결하기 위해 두 확장 계층의 양자화기를 단일 공유 양자화기로 만들어 QAT를 수행했다.Additionally, various hardware-friendly quantization methods were applied to LLTQ. By using symmetric quantization, the computational overhead for zeros that occurs during uniform quantization is eliminated. By introducing layer-wise scaling, quantization can be performed with one baseline factor for a given parameter or all elements of the input tensor. BN layer fusion, which combines the parameters of the convolution layer and the BN layer, was applied. This method does not require quantization in the BN layer. This can reduce the number of operations and inference time by eliminating the multiplication and division operations required for the BN layer. The fire module is structured so that the output of the squeeze layer goes into the input of two expansion layers. In floating-point, the inputs of both expansion layers are the same. However, in QAT, each convolutional layer requires a different normalization factor. Therefore, the squeeze layer must calculate the output of each expansion layer separately. This can cause a bottleneck when performing CNN inference in hardware. To solve this problem, QAT was performed by making the quantizers of both expansion layers into a single shared quantizer.

본 발명의 실시예에 따른 로그레벨 임계값 양자화(LLTQ)은 학습된 단계 크기 양자화(LSQ) 방법[1]과 훈련된 양자화 임계값(TQT) 방법[2]은 균일한 양자화를 사용한다. 균일한 양자화기가 삽입된 CNN은 사전 훈련된 플로팅-포인트 파라미터를 사용하여 양자화 인식 훈련(QAT)을 수행한다. Log-level threshold quantization (LLTQ) according to an embodiment of the present invention uses the learned step size quantization (LSQ) method [1] and the trained quantization threshold (TQT) method [2] uses uniform quantization. A CNN with a uniform quantizer embedded performs quantization-aware training (QAT) using pre-trained floating-point parameters.

도 3은 균일한 양자화기가 삽입된 컨볼루션 계층의 구조와 양자화기의 4단계 작동을 보여준다. 첫 번째 단계인 스케일 프로세스, 양자화기는 실제 값 범위를 지정된 비트의 정수 값 범위에 매핑한다. 라운드 프로세스라고 하는 두 번째 단계는 정수 값 범위로 매핑된 모든 실수 값을 정수로 변환한다. 세 번째 단계는 클램프 프로세스이다. 이 프로세스는 양자화 수준에 맞게 양자화 범위를 초과하는 요소를 제거한다. 양자화 비트가 n인 경우, 클램프 프로세스 후의 값은 식(1)에 표시된 바와 같이 Bound 내부에 위치한다.Figure 3 shows the structure of a convolutional layer with a uniform quantizer inserted and the four-step operation of the quantizer. The first step, the scale process, the quantizer maps the real value range to the integer value range of the specified bits. The second step, called the round process, converts all real values mapped to a range of integer values to integers. The third step is the clamp process. This process removes elements that exceed the quantization range to match the quantization level. When the quantization bit is n, the value after the clamp process is located inside the Bound as shown in equation (1).

네 번째 단계인 de-quant 프로세스는 정수 범위에 매핑된 값을 양자화의 영향을 받는 플로팅-포인트 값으로 변환한다. 양자화된 값을 정수로 다시 플로팅-포인트로 변환하는 이유는 GPU를 사용하는 대부분의 기존 딥러닝 프레임워크가 플로팅-포인트 훈련에 최적화되어 있기 때문이다. 순방향 경로에 대한 전체 식은 다음과 같이 설명할 수 있다.The fourth step, the de-quant process, converts values mapped to the integer range into floating-point values subject to quantization. The reason for converting quantized values from integers back to floating-points is that most existing deep learning frameworks using GPUs are optimized for floating-point training. The overall equation for the forward path can be described as follows.

여기서 xf는 양자화기의 입력인 플로팅-포인트 값이고 xd는 de-quant 프로세스를 통과한 양자화기의 출력 값이다. clamp(x, Bound) 함수는 균일 양자화기의 두 번째 단계이며 반올림된 값(x)을 경계 내에 맞도록 자른다. round(·) 함수는 균일 양자화기의 세 번째 단계이다. 실제 데이터를 가장 가까운 정수로 반올림한다. straight-through 추정기(STE)[8]를 사용하여 반올림 공정에서 발생하는 불연속성을 해결하였다.Here, xf is the floating-point value that is the input of the quantizer, and xd is the output value of the quantizer that passed the de-quant process. The clamp(x, Bound) function is the second stage of the uniform quantizer and truncates the rounded value (x) to fit within the bounds. The round(·) function is the third stage of the uniform quantizer. Round the actual data to the nearest integer. Discontinuities arising from the rounding process were resolved using the straight-through estimator (STE) [8].

LSQ [1] 방법은 훈련 가능한 파라미터를 기준화 인수로 설정하고 다른 파라미터와 함께 기준화 인수를 훈련시킨다. 따라서 LSQ는 양자화 공정에서 기준화 인수를 변환할 필요가 없다. 섹션 III-B에 기술된 경량 CNN에 사용하기 위해 잘 알려져 있지만 하드웨어에 이를 구현할 때 전용 연산자가 필요하다는 단점이 있다. 이는 플로팅-포인트 기준화 인수이기 때문이다. TQT [2] 방법은 훈련 가능한 파라미터를 로그 임계값(log₂t)으로 설정하고 기준화 인수를 다음과 같이 2의 거듭제곱 항에 매핑한다.The LSQ [1] method sets a trainable parameter as a baseline argument and trains the baseline argument together with other parameters. Therefore, LSQ does not need to convert the reference factor in the quantization process. Although it is well known for use in lightweight CNNs described in Section III-B, it has the disadvantage of requiring dedicated operators when implementing it in hardware. This is because it is a floating-point standardization argument. The TQT [2] method sets the trainable parameters to a logarithmic threshold (log ₂ t) and maps the criterion factor to a power of 2 term as follows.

여기서 q_level은 (1)의 Bound 최대값에 1을 더하는 값이다.Here, q_level is a value that adds 1 to the maximum Bound value of (1).

TQT는 기준화 인수를 2의 거듭제곱 항에 매핑하기 때문에 기준화 인수를 사용한 곱셈과 나눗셈을 이동 연산으로 대체할 수 있다는 장점이 있다. 단, 재훈련을 시작하기 전에 활성화 기준화 인수의 보정이 필요하다. 따라서 본 논문에서는 이전 연구 [1], [2]의 문제를 해결하는 새로운 하드웨어 친화적 양자화 방법을 제안한다. 제안된 LLTQ는 훈련 가능한 파라미터를 직접 2의 거듭제곱 인수에 매핑할 수 있다. 또한 TQT[2]와 달리 LLTQ 방법을 사용하면 첫 번째 epoch의 첫 번째 배치의 활성화 데이터 통계 값으로 활성 기준화 인수를 초기화할 수 있으므로 사전 보정 프로세스가 필요하지 않다. 또한 기준화 인수를 2의 거듭제곱 항에 매핑함으로써 스케일에서 발생하는 곱셈 및 나눗셈 연산과 de-quant 프로세스는 모두 이동 연산으로 대체될 수 있다. LLTQ 방법의 성능은 섹션 III-C-2에서 설명하는 정수 스케일 변환 방법을 통한 정수 산술만을 사용하여 평가된다. LLTQ는 훈련 가능한 파라미터를 레벨 임계값(level_th)으로 설정하고 다음 식을 사용하여 기준화 인수를 2의 거듭제곱 항에 매핑한다.Because TQT maps the standardization factor to a power of 2 term, it has the advantage of being able to replace multiplication and division using the standardization factor with shift operations. However, before starting retraining, the activation standardization factor needs to be calibrated. Therefore, in this paper, we propose a new hardware-friendly quantization method that solves the problems of previous studies [1], [2]. The proposed LLTQ can map trainable parameters directly to power-of-two factors. Additionally, unlike TQT [2], the LLTQ method allows the activation criterion factor to be initialized with the activation data statistic values of the first batch of the first epoch, thus eliminating the need for a pre-calibration process. Additionally, by mapping the scaling factor to a power-of-2 term, the multiplication and division operations that occur in scale, as well as de-quant processes, can all be replaced by shift operations. The performance of the LLTQ method is evaluated using only integer arithmetic via the integer scale transformation method described in Section III-C-2. LLTQ sets the trainable parameter to a level threshold (level_th) and maps the criterion factor to a power-of-2 term using the following equation:

여기서 q_level은 (3)에서와 같다. 기준화 인수를 계산하려면 많은 연산이 필요하다. 그러나 정수 산술만 사용하도록 재구성된 CNN을 사용하여 추론을 수행할 경우 2의 거듭제곱 항에 매핑된 기준화 인수의 이동량만 필요하다.Here, q_level is the same as in (3). Calculating the standardization factor requires many operations. However, when performing inference using a CNN reconfigured to use only integer arithmetic, only the amount of movement of the scaling factor mapped to a power-of-2 term is needed.

또한 LLTQ에 대해 하드웨어 친화적인 양자화 방식이 여러 개 적용되었다. 대칭 양자화를 사용함으로써 균일한 양자화 중에 발생하는 영점에 대한 계산 오버헤드가 제거된다. 계층별 스케일링을 도입함으로써 주어진 파라미터 또는 입력 텐서의 모든 요소에 대해 하나의 기준화 인수로 양자화를 수행할 수 있다. 컨볼루션 층과 BN 계층의 파라미터를 결합한 BN 계층 융합이 적용되었다. 이 방식은 BN 계층에서 양자화를 진행할 필요가 없다. 이는 BN 계층에 필요한 곱셈과 나눗셈 연산을 제거함으로써 연산의 수와 추론 시간을 줄일 수 있다. fire 모듈은 스퀴즈 계층의 출력이 두 확장 계층의 입력으로 들어가는 구조이다. 플로팅-포인트에서는 두 확장 계층의 입력이 동일하다. 그러나 QAT에서 각 컨볼루션 계층에는 서로 다른 기준화 인수가 필요하다. 따라서 스퀴즈 계층은 각 확장 계층의 출력을 별도로 계산해야 한다. 이는 하드웨어에서 CNN 추론을 수행할 때 병목 현상을 일으킬 수 있다. 이 문제를 해결하기 위해 두 확장 계층의 양자화기를 단일 공유 양자화기로 만들어 QAT를 수행했다.Additionally, several hardware-friendly quantization methods have been applied to LLTQ. By using symmetric quantization, the computational overhead for zeros that occurs during uniform quantization is eliminated. By introducing layer-wise scaling, quantization can be performed with one baseline factor for a given parameter or all elements of the input tensor. BN layer fusion, which combines the parameters of the convolution layer and the BN layer, was applied. This method does not require quantization in the BN layer. This can reduce the number of operations and inference time by eliminating the multiplication and division operations required for the BN layer. The fire module is structured so that the output of the squeeze layer goes into the input of two expansion layers. In floating-point, the inputs of both expansion layers are the same. However, in QAT, each convolutional layer requires a different normalization factor. Therefore, the squeeze layer must calculate the output of each expansion layer separately. This can cause a bottleneck when performing CNN inference in hardware. To solve this problem, QAT was performed by making the quantizers of both expansion layers into a single shared quantizer.

도 4는 본 발명의 일 실시예에 따른 정수 스케일 변환 방법을 설명하기 위한 도면이다. Figure 4 is a diagram for explaining an integer scale conversion method according to an embodiment of the present invention.

제안하는 양자화 방법에 의해 만들어진 파라미터를 사용한 CNN은 제안하는 정수 스케일 변환 방법을 통해 정수 산술만을 사용하여 재구성되었다. 도 4(b)및 도 4(d)와 같이 W_k ^f, B_k ^f, W_k+1 ^f, B_k+1 ^f, W_k+2 ^f, B_k+2 ^f 의 화살표는 플로팅-포인트 데이터의 흐름을 나타내고 나머지 화살표는 정수 데이터의 흐름을 나타낸다. 각 문자의 위첨자는 데이터 유형을 나타낸다. f는 플로팅-포인트 데이터를 나타내고 i는 정수를 나타내며 아래 첨자 k, k + 1, k + 2, k + 3은 각 계층의 수를 나타낸다. Q_a는 정수-산술 전용(Integer-Arithmetic-Only; IAO) 활성화 양자화기, Q_b는 IAO 편향 양자화기, Q_w는 IAO 가중치 양자화기이다. S_w,k, S_w,k+1, S_w,k+2는 가중치 기준화 인수, S_b,k, S_b,k+1, S_b,k+2는 편향 기준화 인수, S_a,k, S_a,k+1, S_a,k+2, S_a,k+3은 활성 기준화 인수이다. 여기서 데이터 흐름 화살표에 표시된 A, W, B, C는 각각 계층 입력, 가중치, 편향, 활성화 함수 출력을 의미한다. 도 4에서 오프라인(Offline)으로 표시된 부분은 양자화에 영향을 받는 플로팅-포인트 파라미터를 정수로 변환하는 과정이다. 이 프로세스는 추론 전에 수행되므로 추론 시간에 영향을 미치지 않는다. 양자화에 영향을 받는 플로팅-포인트 파라미터를 정수 파라미터로 변환하는 프로세스는 다음과 같이 표시된다.A CNN using parameters created by the proposed quantization method was reconstructed using only integer arithmetic through the proposed integer scale transformation method. As shown in Figures 4(b) and 4(d), the arrows of W _k ^f , B _k ^f , W _k+1 ^f , B _k+1 ^f , W _k+2 ^f , and B _k+2 ^f are floating-points. It represents the flow of data, and the remaining arrows represent the flow of integer data. The superscript of each letter indicates the data type. f represents floating-point data, i represents an integer, and the subscripts k, k + 1, k + 2, k + 3 represent the number of each layer. Q _a is an Integer-Arithmetic-Only (IAO) activation quantizer, Q _b is an IAO bias quantizer, and Q _w is an IAO weight quantizer. S _w,k , S _w,k+1, S _w,k+2 are weight standardization factors, S _b,k , S _b,k+1, S _b,k+2 are bias standardization factors, S _{a ,k} , S _a,k+1 , S _a,k+2 , S _a,k+3 are the active standardization factors. Here, A, W, B, and C indicated on the data flow arrows represent layer input, weight, bias, and activation function output, respectively. The part marked offline in Figure 4 is the process of converting the floating-point parameter affected by quantization into an integer. This process is performed before inference, so it does not affect inference time. The process of converting floating-point parameters affected by quantization to integer parameters is shown as follows.

(5) (5)

(6) (6)

컨볼루션 계층의 입력은 QAT에서 지정한 비트 정밀도로 양자화된 정수라고 가정했다. 도 4(a)와 도 4(b)에서와 같이, 일반 컨볼루션 계층은 정수 스케일 가중치와 정수 스케일 활성화에서 MAC 연산을 수행하고 정수 스케일 편향을 MAC 결과에 추가한다. 식(6)에 나타낸 것처럼 편향 값은 MAC 작동 결과와 동일한 스케일로 변환되어야 한다. 이러한 방법을 편향 업스케일링이라고 하며, 이 프로세스가 없으면 편향과 MAC 작동 결과가 비트 정밀도가 다르기 때문에 오류가 발생한다. 편향을 추가한 후 데이터는 활성화 함수 ReLU를 통과하며 ReLU의 결과는 식(7)과 같이 나타낼 수 있다.The input to the convolution layer was assumed to be an integer quantized with the bit precision specified by QAT. As shown in Figures 4(a) and 4(b), the general convolutional layer performs MAC operations on integer-scale weights and integer-scale activations and adds integer-scale bias to the MAC results. As shown in equation (6), the bias value must be converted to the same scale as the MAC operation result. This method is called bias upscaling, and without this process, errors will occur because the bias and MAC operation results have different bit precisions. After adding the bias, the data passes the activation function ReLU, and the result of ReLU can be expressed as equation (7).

(7) (7)

여기서 C_k ⁱ는 ReLU의 출력값이고, 현재 계층의 비트 정밀도에서 변환되어야 한다. MAC 연산은 다음 계층으로 입력되기 전에 QAT에서 지정한 비트 정밀도로 수행된다. 이 방법을 활성화 다운스케일링이라고 하며 현재 계층의 기준화 인수에 대한 영향을 취소하고 다음 계층의 입력 정밀도로 스케일링한다. 활성화 다운스케일링은 식(8)과 같이 나타낼 수 있다.Here, C _k ⁱ is the output value of ReLU and must be converted to the bit precision of the current layer. MAC operations are performed with the bit precision specified by QAT before being input to the next layer. This method is called activation downscaling, and cancels the influence of the current layer's baseline factor and scales it to the input precision of the next layer. Activation downscaling can be expressed as Equation (8).

(8) (8)

도 4(c)와 도 4(d)에 표시된 fire 모듈에서 스퀴즈 계층의 습성은 식(5), 식(6), 식(7)이 적용되는 일반 컨볼루션 계층과 동일하다. 공유 양자화기의 도입으로 인해 단일 기준화 인수로 충분하다. 따라서 각 확장 계층에 대한 입력을 계산하지 않고도 동일한 입력을 사용할 수 있다. 각 확장 계층의 입력은 식(9)과 같이 설명될 수 있다.The behavior of the squeeze layer in the fire module shown in Figures 4(c) and 4(d) is the same as the general convolution layer to which Equations (5), (6), and (7) are applied. Due to the introduction of a shared quantizer, a single normalization factor is sufficient. Therefore, the same input can be used without having to calculate the input for each expansion layer. The input of each expansion layer can be described as Equation (9).

(9) (9)

식(8)을 사용하여 활성화 다운스케일링 후 각 확장 계층의 출력을 채널별로 연결해야 한다. 채널별 연결은 값에 영향을 미치지 않으므로 확장이 필요하지 않다. 이 프로세스는 식(10)에 표시된다.After activation downscaling using equation (8), the output of each expansion layer must be connected for each channel. Channel-specific connections do not affect the value, so no expansion is required. This process is shown in equation (10).

(10) (10)

도 5는 본 발명의 일 실시예에 따른 실시간 얼굴 감정 인식을 위한 자원 효율적인 정수-산술 전용 FPGA 기반 CNN 가속기의 아키텍처를 나타낸다. Figure 5 shows the architecture of a resource-efficient integer-arithmetic dedicated FPGA-based CNN accelerator for real-time facial emotion recognition according to an embodiment of the present invention.

본 발명의 실시예에 따른 실시간 얼굴 표정 인식을 위한 CNN 가속기는 CPU(Processing System; PS)(510)와 FPGA(Programmable Logic; PL)(520)로 구성된 이기종 SoC 플랫폼에서 구현되었다. CPU(510)는 메모리 영역을 할당하고 각 계층 구성 정보인 커널 크기, 스트라이드, 패딩, 입력 채널, 출력 채널, 이동량 및 주소를 AXI_LITE 버스를 통해 FPGA(520) 가속기 IP로 전송한다. 입력 이미지, 특징맵, 파라미터와 같이 MAC 연산을 위한 데이터는 AXI_MEM 버스를 통해 DRAM에서 FPGA(520) 가속기로 전송된다. CNN 공유를 구성하는 각 계층은 작동을 수행하기 위해 다중 누적 단위(MACU)와 후처리 모듈(PPM)을 공유한다. 연산 결과는 AXI_MEM 버스를 사용하여 FPGA(520) 가속기에서 DRAM으로 전달된다.The CNN accelerator for real-time facial expression recognition according to an embodiment of the present invention was implemented on a heterogeneous SoC platform consisting of a CPU (Processing System; PS) 510 and an FPGA (Programmable Logic; PL) 520. The CPU 510 allocates a memory area and transmits each layer configuration information, such as kernel size, stride, padding, input channel, output channel, movement amount, and address, to the FPGA 520 accelerator IP through the AXI_LITE bus. Data for MAC operation, such as input images, feature maps, and parameters, are transferred from DRAM to the FPGA (520) accelerator through the AXI_MEM bus. Each layer that makes up the CNN shares a multiple accumulation unit (MACU) and a post-processing module (PPM) to perform operations. The calculation results are transferred from the FPGA (520) accelerator to DRAM using the AXI_MEM bus.

본 발명의 실시예에 따른 범용 임베디드 CPU는 6개의 중첩된 루프가 있는 컨볼루션 작업을 수행하는 속도가 매우 느리다. 이 문제를 해결하기 위해 SoC 플랫폼의 PL 영역(520)에 병렬 처리 모듈을 구현하여 CNN 추론 프로세스를 가속화했다. 병렬 처리 모듈을 구현하기 위해 병렬 처리 전략을 6개의 중첩 루프에 적용할 수 있는 부분을 찾아 한다. 여러 가지 가능한 병렬화 전략 중 두 가지가 임베디드 장치에서 루프 수준 병렬화를 효율적으로 구현하기 위해 사용되었다[9].The general-purpose embedded CPU according to an embodiment of the present invention performs a convolution operation with six nested loops very slowly. To solve this problem, a parallel processing module was implemented in the PL region 520 of the SoC platform to accelerate the CNN inference process. To implement the parallel processing module, we find out where the parallel processing strategy can be applied to the six nested loops. Among several possible parallelization strategies, two have been used to efficiently implement loop-level parallelism in embedded devices [9].

첫째, 컨볼루션 내 병렬화 전략이 실행되었다. 도 5에 도시된 MACU 내부 작동에 해당하는 루프를 완전 언롤링(unrolling)함으로써 컨볼루션 내 병렬화를 달성하였다. 완전 언롤링된 루프(fully unrolled loop)는 9개의 병렬 승수와 3×3 컨볼루션 연산을 위한 8개의 가산기로 구성된 MACU로 구현되었다. 둘째, 특징간 맵 병렬화 전략이 적용되었다. 도 5에 도시된 MACU는 출력 특징맵의 다른 채널에 대한 루프를 부분적으로 해제하여 특징간 맵 병렬화를 달성하기 위해 사용되었다. 부분적 풀린 루프의 언롤링(unrolling) 인수가 4였기 때문에, 가속기는 출력 특징맵의 다른 4개 채널에 대해 MAC 연산을 병렬로 실행할 수 있다. 1×1 연산은 3×3 컨볼루션 연산을 위해 완전 언롤링된(fully unrolled) MACU의 일부를 재사용함으로써 수행된다. First, a parallelization strategy within the convolution was implemented. Parallelization within the convolution was achieved by completely unrolling the loop corresponding to the internal operation of the MACU shown in Figure 5. The fully unrolled loop was implemented as a MACU consisting of 9 parallel multipliers and 8 adders for 3×3 convolution operations. Second, an inter-feature map parallelization strategy was applied. The MACU shown in Figure 5 was used to achieve inter-feature map parallelism by partially unblocking loops for other channels of the output feature map. Because the unrolling factor of the partially unrolled loop was 4, the accelerator can execute MAC operations on the other four channels of the output feature map in parallel. The 1×1 operation is performed by reusing a portion of the fully unrolled MACU for the 3×3 convolution operation.

SoC 플랫폼을 고려할 때, 더 많은 병렬화 전략을 사용할 수 있다. 그러나, 특징내 맵 병렬 처리의 경우, 동일한 출력 채널의 픽셀을 대상으로 하는 각 입력 채널에 대한 MAC 연산의 모든 결과를 축적해야 했기 때문에 데이터 의존성으로부터 자유롭지 못했다. 따라서 많은 리소스가 활용될 가능성이 높다. 콘볼루션 간 병렬화의 경우, 많은 양의 데이터가 온칩 버퍼에 미리 로드되어야 하기 때문에 임베디드 장치에서의 구현에는 적합하지 않았다.When considering SoC platforms, more parallelization strategies are available. However, in the case of intra-feature map parallel processing, it was not free from data dependency because all results of MAC operations for each input channel targeting pixels of the same output channel had to be accumulated. Therefore, it is likely that many resources will be utilized. Parallelization between convolutions required large amounts of data to be preloaded into on-chip buffers, making it unsuitable for implementation in embedded devices.

본 발명의 실시예에 따른 데이터 캐싱 모듈에 있어서, MAC 연산을 위해 매번 오프칩 메모리에서 데이터에 액세스하는 프로세스는 대기 시간과 에너지 측면에서 비용이 많이 든다. 또한 PL 영역(520) 내부의 온칩 메모리는 CNN의 모든 파라미터와 계산 결과를 저장할 만큼 크지 않다. 따라서 온칩 메모리에서 데이터 재사용을 극대화하여 오프칩 메모리에 대한 액세스 수를 최소화할 필요가 있다. PL 영역(520)에는 3개의 데이터 캐싱 모듈(FMCM, PCM, OCAM)이 사용되었다.In the data caching module according to an embodiment of the present invention, the process of accessing data from off-chip memory each time for MAC operation is costly in terms of waiting time and energy. Additionally, the on-chip memory inside the PL area 520 is not large enough to store all parameters and calculation results of the CNN. Therefore, there is a need to maximize data reuse in on-chip memory to minimize the number of accesses to off-chip memory. Three data caching modules (FMCM, PCM, OCAM) were used in the PL area 520.

컨볼루션 내 병렬 처리 전략이 적용된 MACU에서 3×3 컨볼루션 필터는 MAC 작동을 위해 9픽셀의 입력 특징맵을 필요로 한다. 입력 특징맵의 효율적인 미리 읽어오기를 위해 가속기에는 적어도 3개의 라인을 저장할 수 있는 라인 버퍼가 필요하다. 주소 지정에 편리하도록 입력 특징맵의 4개 라인을 저장할 수 있는 라인 버퍼가 만들어졌다. 이러한 라인 버퍼는 함께 번들되었으며 특징맵 캐싱 모듈(FMCM)이라고 불렸다. 파라미터 데이터 재사용을 극대화하기 위해, 파라미터 캐싱 모듈(PCM)이라 불리는, 작업이 수행되는 계층의 모든 파라미터를 저장할 수 있는 온칩 메모리가 생성되었으며, 각 프로세싱 모듈의 PCM을 분할하여 특징간 맵 병렬화 전략을 사용하였다. 메모리는 MAC 연산 중에 출력 특징맵의 채널에 대한 MAC 연산 결과를 축적하기 위해 생성되었으며, 메모리는 출력 채널 축적 모듈(OCAM)이라고 한다.In MACU, where the intra-convolutional parallel processing strategy is applied, the 3×3 convolutional filter requires an input feature map of 9 pixels for MAC operation. For efficient pre-reading of the input feature map, the accelerator requires a line buffer that can store at least three lines. To facilitate addressing, a line buffer was created that can store four lines of the input feature map. These line buffers were bundled together and called the Feature Map Caching Module (FMCM). To maximize parameter data reuse, an on-chip memory, called Parameter Caching Module (PCM), was created to store all parameters of the layer where the task is performed, and an inter-feature map parallelization strategy is used by partitioning the PCM of each processing module. did. The memory was created to accumulate the MAC operation results for the channel of the output feature map during MAC operation, and the memory is called the output channel accumulation module (OCAM).

본 발명의 실시예에 따른 후처리 모듈에 있어서, 컨볼루션 연산을 제외한 나머지 모든 연산은 PPM에서 수행된다. PPM은 이동 연산, 추가 연산 및 부호 비트 비교 연산으로 구성된다. PPM은 OCAM의 최종 출력 값을 입력으로 수신한다. 또한 병렬 처리 모듈처럼, 특징간 맵 병렬화 전략은 언롤링된(unrolled) 인수를 4로 설정한 상태에서 적용된다. PPM의 최종 출력은 다음 계층의 입력을 입력하기 위한 DRAM으로의 후기입(writeback)이다. In the post-processing module according to an embodiment of the present invention, all operations except the convolution operation are performed in PPM. PPM consists of a shift operation, an addition operation, and a sign bit comparison operation. PPM receives the final output value of OCAM as input. Also, like the parallel processing module, the feature-to-feature map parallelization strategy is applied with the unrolled argument set to 4. The final output of PPM is a writeback to DRAM to input the input of the next layer.

도 6은 본 발명의 일 실시예에 따른 정수-산술 전용 FPGA 기반 CNN 가속기에 대한 전체 작업 스케줄링 과정을 설명하기 위한 도면이다. Figure 6 is a diagram for explaining the entire task scheduling process for an integer-arithmetic dedicated FPGA-based CNN accelerator according to an embodiment of the present invention.

도 6을 참조하면 첫째, CPU는 CNN 작업을 시작하기 전에 미리 계산된 네트워크 제약 조건을 사용하여 데이터 캐싱 모듈의 크기와 계층 및 데이터 영역의 주소를 계산한다(610). 계산된 주소와 계층 구성은 AXI_LITE 버스를 통해 PS에서 PL 구성 레지스터로 전송된다. 데이터 캐싱 모듈(FMCM, PCM)은 사전 전송된 주소 및 계층 구성을 사용하여 AXI_MEM 버스를 통해 DRAM에서 데이터를 사전 추출한다(620). 미리 읽기된 데이터는 픽셀 버퍼와 파라미터 버퍼에 로드되고 MAC 연산은 병렬 처리 모듈에서 수행된다(630). OCAM은 MAC 연산의 중간 결과를 저장하고 축적한다. 축적 완료 후 PPM은 나머지 연산을 실행한다(640). 이 데이터 흐름은 계층의 모든 프로세스가 종료될 때까지 지속된다(650).Referring to Figure 6, first, before starting the CNN task, the CPU calculates the size and layer of the data caching module and the address of the data area using pre-calculated network constraints (610). The calculated address and layer configuration are transferred from PS to PL configuration registers via the AXI_LITE bus. Data caching modules (FMCM, PCM) pre-fetch data from DRAM via the AXI_MEM bus using the pre-transferred address and layer configuration (620). The pre-read data is loaded into the pixel buffer and the parameter buffer, and the MAC operation is performed in the parallel processing module (630). OCAM stores and accumulates intermediate results of MAC operations. After completion of accumulation, PPM executes the remaining operations (640). This data flow continues until all processes in the layer are terminated (650).

앞서 설명된 바와 같이, 본 발명에서는 실시간 얼굴 표정 인식을 위한 FPGA 기반 CNN 가속기를 제안한다. 가속기는 임베디드 시스템에 최적화되었으며 새로운 하드웨어 친화적인 양자화 방법을 적용하였다. As previously explained, the present invention proposes an FPGA-based CNN accelerator for real-time facial expression recognition. The accelerator is optimized for embedded systems and applies a new hardware-friendly quantization method.

본 발명의 실시예에 따르면, 얼굴 표정 인식 성능을 향상시키기 위해 FERPlus-A를 제시한다. 이것은 양방향 필터링, CLAHE(Contrast Limited Adaptive History Equalization) 및 엣지 강화와 같은 다양한 이미지 처리 알고리즘을 적용하여 생성된 새로운 훈련 데이터셋이다. According to an embodiment of the present invention, FERPlus-A is proposed to improve facial expression recognition performance. This is a new training dataset created by applying various image processing algorithms such as interactive filtering, Contrast Limited Adaptive History Equalization (CLAHE), and edge enhancement.

본 발명의 실시예에 따르면, 임베디드 장치를 위해 경량 CNN 아키텍처를 제안한다. 제안된 CNN은 스퀴즈넷(SqueezeNet)의 수정된 파이어(fire) 모듈을 기본 계산 모듈로 사용하여 작동 수와 메모리 설치 공간을 최적화한다. 제안된 CNN은 FERPlus 테스트 데이터셋을 사용하여 평가할 때 약 86.58%의 얼굴 표정 인식 정확도를 보였다. 이는 경량 CNN을 사용한 이전 작업과 비교하여 보고된 가장 높은 성능이다.According to an embodiment of the present invention, a lightweight CNN architecture is proposed for embedded devices. The proposed CNN uses SqueezeNet's modified fire module as the basic computational module to optimize the number of operations and memory footprint. The proposed CNN showed facial expression recognition accuracy of approximately 86.58% when evaluated using the FERPlus test dataset. This is the highest performance reported compared to previous work using lightweight CNNs.

본 발명의 실시예에 따르면, 로그 레벨 임계값 양자화(Log Level Threshold Quantization; LLTQ) 방법을 제안하고, 이는 새로운 하드웨어 친화적인 양자화 방법이다. 제안된 방법은 기준화 인수를 사용한 곱셈 및 나눗셈 연산을 이동 연산으로 대체할 수 있다. 제안된 양자화 CNN 파라미터의 크기는 약 0.39MB이며, 작동 횟수는 약 28M 정수 연산(Integer Operations; IOP)이다. 또한 8비트 정수 산술만을 사용하는 CNN의 성능은 32비트 플로팅-포인트(floating-point) 사용에 비해 정확도가 떨어지지 않는 것을 확인하였다. According to an embodiment of the present invention, a Log Level Threshold Quantization (LLTQ) method is proposed, which is a new hardware-friendly quantization method. The proposed method can replace multiplication and division operations using standardization factors with shift operations. The size of the proposed quantized CNN parameters is approximately 0.39MB, and the number of operations is approximately 28M Integer Operations (IOP). Additionally, it was confirmed that the performance of CNN using only 8-bit integer arithmetic does not decrease accuracy compared to using 32-bit floating-point.

본 발명의 실시예에 따르면, 임베디드 장치에 최적화된 제안된 CNN은 실시간 얼굴 표정 인식을 위한 병렬화 전략과 효율적인 데이터 캐싱 전략을 적용하여 Xilinx ZC706 평가판에서 구현된다. FPGA 기반 CNN 가속기, 8비트 정수 산술만 사용하며, LUT 5,090개, 플립플롭 7,588개, BRAM 61개, DSP 49개를 활용했다. 가속기는 2.3W의 전력을 소모하면서 대략 초당 10 프레임(Frame per Second; FPS)의 프레임률을 달성했다. According to an embodiment of the present invention, the proposed CNN optimized for embedded devices is implemented on the Xilinx ZC706 evaluation version by applying a parallelization strategy and an efficient data caching strategy for real-time facial expression recognition. FPGA-based CNN accelerator, using only 8-bit integer arithmetic, utilized 5,090 LUTs, 7,588 flip-flops, 61 BRAMs, and 49 DSPs. The accelerator achieved a frame rate of approximately 10 frames per second (FPS) while consuming 2.3W of power.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다.　 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다.　 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다.　 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다.　 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general-purpose or special-purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may execute an operating system (OS) and one or more software applications that run on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include a plurality of processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.　 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다.　 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. It can be embodied in . Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다.　 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.　 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.　 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.　 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.　 The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, various modifications and variations can be made by those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the claims described below.

삭제delete

Claims

In a method of operating an FPGA-based CNN accelerator implemented on a heterogeneous SoC (System on Chip) platform including a CPU (Processing System; PS) and FPGA (Programmable Logic, PL),
Generating a training dataset combining a plurality of image processing algorithms in the CPU to improve CNN facial expression recognition performance, and converting parameters of the generated training dataset through floating-point training;
Extracting quantized parameters through quantization recognition training for parameters converted through floating-point training in the CPU, and performing Integer-Arithmetic-Only CNN reconstruction; and
Performing real-time facial emotion recognition using the integer-arithmetic dedicated CNN in the FPGA and CPU areas of a heterogeneous SoC platform.
Including,
The step of extracting quantized parameters through quantization recognition training for parameters converted through floating-point training in the CPU and performing integer-arithmetic-only CNN reconstruction,
The Learned Step Size Quantization (LSQ) method is used to set a trainable parameter as a reference factor and train the reference factor along with other parameters to perform a pre-calibration process for the reference factor in the quantization process. without need,
By setting the trainable parameter to a logarithmic threshold (log ₂ t) using the Trained Quantization Threshold (TQT) method and mapping the quantization factor to a power of 2 term, the quantization factor is Replace the multiplication and division used with shift operations,
By using symmetric quantization in the process of performing quantization-aware training (QAT) using pre-trained floating-point parameters, the computational overhead for zero points that occurs during uniform quantization is eliminated, and layer-by-layer scaling is eliminated. By introducing a Log Level Threshold Quantization (LLTQ) method that performs quantization with one reference factor for a given parameter or all elements of the input tensor.
How an FPGA-based CNN accelerator works.

According to paragraph 1,
The step of generating a training dataset combining a plurality of image processing algorithms in the CPU to improve CNN facial expression recognition performance, and converting parameters through floating-point training for the generated training dataset,
Perform calculations through a basic calculation block including a first calculation module (FireA) and a second calculation module (FireB),
Each of the first calculation module (FireA) and the second calculation module (FireB),
A squeeze layer to reduce the number of channels in the feature map and the number of operations in the subsequent expansion layer; and
Includes an expansion layer to receive the output of the squeeze layer as input and expand the channel again,
In order to minimize accuracy degradation by maintaining the reception field, the kernel sizes of the filters included in each of the squeeze layer of the first calculation module and the squeeze layer of the second calculation module have different kernel sizes.
How an FPGA-based CNN accelerator works.

According to paragraph 2,
The step of generating a training dataset combining a plurality of image processing algorithms in the CPU to improve CNN facial expression recognition performance and converting parameters through floating-point training for the generated training dataset,
In order to reduce the resolution of the feature map by half, the stride, which is the filter sampling size of each of the squeeze layer and expansion layer of the basic calculation block, is set to 2,
A CNN classifier that includes multiple convolutional layers and one global average pooling layer reduces the output parameters by placing a global average pooling layer with a filter kernel size of 1 at the output stage,
CNN's classifier minimizes the number of channels to match the number of classes through a convolutional layer and then compresses the resolution of the feature map to one pixel.
The convergence stability and performance of CNN are improved through a batch normalization layer included between the convolution layer and the activation function.
How an FPGA-based CNN accelerator works.

delete

In the FPGA-based CNN accelerator implemented on a heterogeneous SoC (System on Chip) platform including a CPU (Processing System; PS) and FPGA (Programmable Logic, PL),
The CPU is,
To improve CNN facial expression recognition performance, a training dataset combining multiple image processing algorithms is created, and parameters are converted through floating-point training for the generated training dataset,
For parameters converted through floating-point training, quantized parameters are extracted through quantization recognition training, and Integer-Arithmetic-Only CNN reconstruction is performed.
Perform real-time facial emotion recognition using the integer-arithmetic-only CNN in the FPGA and CPU areas of heterogeneous SoC platforms,
The CPU is,
In the process of extracting quantized parameters through quantization recognition training for parameters converted through floating-point training and performing Integer-Arithmetic-Only CNN reconstruction,
The Learned Step Size Quantization (LSQ) method is used to set a trainable parameter as a reference factor and train the reference factor along with other parameters to perform a pre-calibration process for the reference factor in the quantization process. without need,
By setting the trainable parameter to a logarithmic threshold (log ₂ t) using the Trained Quantization Threshold (TQT) method and mapping the quantization factor to a power of 2 term, the quantization factor is Replace the multiplication and division used with shift operations,
By using symmetric quantization in the process of performing quantization-aware training (QAT) using pre-trained floating-point parameters, the computational overhead for zero points that occurs during uniform quantization is eliminated, and layer-by-layer scaling is eliminated. By introducing a Log Level Threshold Quantization (LLTQ) method that performs quantization with one reference factor for a given parameter or all elements of the input tensor.
FPGA-based CNN accelerator.

According to clause 5,
The CPU is,
In the process of creating a training dataset combining multiple image processing algorithms to improve CNN facial expression recognition performance, and converting parameters through floating-point training for the generated training dataset,
Perform calculations through a basic calculation block including a first calculation module (FireA) and a second calculation module (FireB),
Each of the first calculation module (FireA) and the second calculation module (FireB),
A squeeze layer to reduce the number of channels in the feature map and the number of operations in the subsequent expansion layer; and
Contains an expansion layer to expand the channel again,
In order to minimize accuracy degradation by maintaining the reception field, the kernel sizes of the filters included in each of the squeeze layer of the first calculation module and the squeeze layer of the second calculation module have different kernel sizes.
FPGA-based CNN accelerator.

According to clause 6,
The CPU is,
In order to reduce the resolution of the feature map by half, the stride, which is the filter sampling size of each of the squeeze layer and expansion layer of the basic calculation block, is set to 2,
A CNN classifier that includes multiple convolutional layers and one global average pooling layer reduces the output parameters by placing a global average pooling layer with a filter kernel size of 1 at the output stage,
CNN's classifier minimizes the number of channels to match the number of classes through a convolutional layer and then compresses the resolution of the feature map to one pixel.
The convergence stability and performance of CNN are improved through a batch normalization layer included between the convolution layer and the activation function.
FPGA-based CNN accelerator.

delete