KR20240047799A

KR20240047799A - Facial expression recognition method for low-spec devices

Info

Publication number: KR20240047799A
Application number: KR1020220127298A
Authority: KR
Inventors: 박세호; 이경택
Original assignee: 한국전자기술연구원
Priority date: 2022-10-05
Filing date: 2022-10-05
Publication date: 2024-04-12

Abstract

얼굴 감정 인식 방법이 개시된다. 이 방법은 카메라에 의해, 얼굴을 포함하는 입력 이미지를 획득하는 단계; 멀티태스크 캐스케이드 컨볼루션 네트워크(Multi task Cascaded Convolutional Networks: MTCNN)에 의해, 상기 입력 이미지로부터 얼굴 영역을 검출하는 단계; 및 인코더에 의해, 상기 검출된 얼굴 영역을 인코딩하여 감정 인식 결과를 출력하는 단계를 포함한다.A facial emotion recognition method is disclosed. The method includes acquiring an input image containing a face by a camera; Detecting a face region from the input image using a Multi task Cascaded Convolutional Network (MTCNN); and encoding the detected facial area by an encoder to output an emotion recognition result.

Description

Facial emotion recognition method for low-spec devices {FACIAL EXPRESSION RECOGNITION METHOD FOR LOW-SPEC DEVICES}

본 발명은 감정 인식 방법에 관한 것으로, 보다 상세하게는 저사양 디바이스를 위한 얼굴 감정 인식 방법에 관한 것이다.The present invention relates to an emotion recognition method, and more specifically, to a facial emotion recognition method for low-end devices.

최근 딥러닝을 이용한 감정 인식 기술에 대한 관심이 높아지고 있다. 딥러닝은 정확도가 높은 추론 결과(예측 결과)를 제공하지만, 기존의 key-point detection, feature extraction, feature classifier로 이어지는 방식보다 계산량이 크다는 단점이 있다. Recently, interest in emotion recognition technology using deep learning is increasing. Deep learning provides highly accurate inference results (prediction results), but it has the disadvantage of requiring more calculations than existing methods of key-point detection, feature extraction, and feature classifier.

특히 딥러닝 기술 중에 하나인 CNN(Convolutional Neural Network)을 이용한 감정 인식 기술은 감성 인식 정확도가 높지만, 일반 서버나 컴퓨터보다 계산 능력이 현저히 떨어지는 엣지 디바이스(예, 스마트폰)에 적용하는 데 있어서 어려움이 따른다.In particular, emotion recognition technology using CNN (Convolutional Neural Network), one of the deep learning technologies, has high emotion recognition accuracy, but it is difficult to apply to edge devices (e.g. smartphones), which have significantly lower computing power than general servers or computers. Follow.

상술한 문제점을 해결하기 위한 본 발명의 목적은 스마트폰과 같은 엣지 디바이스에서 감정 인식의 특성을 고려하여 계산량을 줄이지만 동시에 딥러닝 방식을 이용하는 감정 인식 방법을 제시한다. The purpose of the present invention to solve the above-mentioned problems is to reduce the amount of calculation by considering the characteristics of emotion recognition in edge devices such as smartphones, but at the same time propose an emotion recognition method using a deep learning method.

그 방법으로는 첫번?? 단계에서 MTCNN(Multi task Cascaded Convolutional Networks)이라는 가벼운 객체인식 알고리즘의 사용하여 얼굴을 인식하고 두번째 단계에서는 MobileNet과 같이 모델 파라미터 수가 적은 CNN기반의 인코더를 사용하여 얼굴로부터 감정을 인식하는 방법을 제시한다.The first method?? In the first step, we recognize faces using a lightweight object recognition algorithm called MTCNN (Multi task Cascaded Convolutional Networks), and in the second step, we present a method to recognize emotions from faces using a CNN-based encoder with a small number of model parameters, such as MobileNet.

또한 이러한 모든 계산을 FP(Floating Point) 16을 기반으로 하여 속도를 높이는 방법을 제시한다.Additionally, a method to speed up all these calculations is presented based on FP (Floating Point) 16.

상술한 목적을 달성하기 위한 본 발명의 일면에 따른 얼굴 감정 인식 방법은, 카메라에 의해, 얼굴을 포함하는 입력 이미지를 획득하는 단계; 멀티태스크 캐스케이드 컨볼루션 네트워크(Multi task Cascaded Convolutional Networks: MTCNN)에 의해, 상기 입력 이미지로부터 얼굴 영역을 검출하는 단계; 및 인코더에 의해, 상기 검출된 얼굴 영역을 인코딩하여 감정 인식 결과를 출력하는 단계를 포함한다.A facial emotion recognition method according to one aspect of the present invention for achieving the above-described object includes obtaining an input image including a face using a camera; Detecting a face region from the input image using a Multi task Cascaded Convolutional Network (MTCNN); and encoding the detected facial area by an encoder to output an emotion recognition result.

본 발명의 다른 일면에 따른 엣지 디바이스는, 얼굴을 포함하는 입력 이미지를 획득하는 카메라; 및 얼굴 감정 인식 모델의 동작을 제어하는 프로세서를 포함하고, 상기 얼굴 감정 인식 모델은, 상기 입력 이미지로부터 얼굴 영역을 검출하는 멀티태스크 캐스케이드 컨볼루션 네트워크(Multi task Cascaded Convolutional Networks: MTCNN); 및 상기 검출된 얼굴 영역을 인코딩하여 감정 인식 결과를 출력하는 CNN 기반의 모바일넷(MobileNet v3)을 포함한다.An edge device according to another aspect of the present invention includes a camera that acquires an input image including a face; and a processor that controls the operation of a facial emotion recognition model, wherein the facial emotion recognition model includes: a multi-task cascaded convolutional network (MTCNN) that detects a facial region from the input image; and a CNN-based MobileNet (MobileNet v3) that encodes the detected facial area and outputs emotion recognition results.

실시 예에서, 상기 멀티태스크 캐스케이드 컨볼루션 네트워크와 상기 모바일넷(MobileNet v3)는, 16-bit Floating Point(FP 16) 기반의 믹스 정밀도 훈련(mixed precision training) 기법에 따라 학습된 CNN일 수 있다.In an embodiment, the multi-task cascade convolutional network and the MobileNet (MobileNet v3) may be CNNs learned according to a 16-bit Floating Point (FP 16)-based mixed precision training technique.

본 발명에 따르면, 감정인식을 위해 FP 16(Floating Point-16) 방식으로 계산하는 멀티태스크 캐스케이드 컨볼루션 네트워크(Multi task Cascaded Convolutional Networks: MTCNN)과 MobileNet을 사용함으로써, 저사양의 엣지 디바이스에서도 빠른 속도로 연산할 수 있다.According to the present invention, by using Multi task Cascaded Convolutional Networks (MTCNN) and MobileNet, which calculate using FP 16 (Floating Point-16) method for emotion recognition, even low-end edge devices can achieve high speed. It can be calculated.

도 1은 본 발명의 실시 예에 따른 엣지 디바이스의 블록도이다.
도 2는 도 1에 도시된 프로세서의 제어에 따라 동작하는 CNN 기반의 얼굴 감정 인식 모델의 구성도이다.
도 3은 본 발명의 실시 예에 따른 저사양 디바이스를 위한 얼굴 감정 인식 방법을 나타내는 흐름도이다.1 is a block diagram of an edge device according to an embodiment of the present invention.
FIG. 2 is a configuration diagram of a CNN-based facial emotion recognition model operating under the control of the processor shown in FIG. 1.
Figure 3 is a flowchart showing a facial emotion recognition method for a low-end device according to an embodiment of the present invention.

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used herein are only used to describe specific embodiments and are not intended to limit the invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, terms such as “comprise” or “have” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are not intended to indicate the presence of one or more other features. It should be understood that this does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the technical field to which the present invention pertains. Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted in an ideal or excessively formal sense unless explicitly defined in the present application. No.

본 발명에서는 CNN기반이지만 가벼운 계산량을 가지는 네트워크를 이용하고, 또한 FP16을 이용하여 기존의 FP32 방식보다 계산 속도를 두배 정도 끌어올리는 방식을 통해 CNN기반의 모델을 실제 환경에서 사용하기 용이하게 하고자 하며 이러한 방법을 제시하고자 한다.In the present invention, we aim to make it easier to use CNN-based models in a real environment by using a network that is CNN-based but has a light calculation amount, and also uses FP16 to double the calculation speed compared to the existing FP32 method. I would like to suggest a method.

또한 본 발명은 감정인식이 많은 경우, 핸드폰과 같은 엣지 디바이스(edge device)에서 사용되는 것을 고려, 가벼운 계산량을 가진 Convolutional Neural Network (CNN) 기반의 감정인식 모델의 훈련 방법에 대하여 기술한다. In addition, the present invention describes a training method for an emotion recognition model based on a Convolutional Neural Network (CNN) with a light computational amount, considering that it is used in edge devices such as mobile phones when there is a lot of emotion recognition.

첫째로 MobileNet과 같이 Parameter 수가 적은 CNN 기반의 인코더를 활용하고, 둘째로 Mixed-Precision을 활용한다. 특히, 이 두 가지 방법을 통해 모델의 훈련뿐만 아니라 실제 감정 인식이 많이 사용되는 핸드폰과 같은 edge device도 적은 계산량으로 CNN기반의 감정 인식 모델을 사용할 수 있게 하였다는 장점이 있다.First, it uses a CNN-based encoder with a small number of parameters, such as MobileNet, and second, it uses Mixed-Precision. In particular, these two methods have the advantage of not only training the model, but also enabling the use of a CNN-based emotion recognition model with a small amount of calculation on edge devices such as mobile phones, which are often used for actual emotion recognition.

종래 기술과 차이점Differences from prior art

종래 발명은 얼굴 감정 인식의 경우 Faster-RCNN과 같은 계산량이 높은 객체인식 알고리즘을 사용하였기 때문에 성능이 높더라도 엣지 디바이스와 같은 실제적 환경에는 어울리지 않는 면이 있었다. 이에 반해 본 발명에서는 계산량이 적은 객체 인식 알고리즘을 얼굴 인식을 위하여 이용하는 방법을 제시한다.In the case of facial emotion recognition, the conventional invention used an object recognition algorithm with a high computational amount such as Faster-RCNN, so even though the performance was high, it was not suitable for practical environments such as edge devices. In contrast, the present invention proposes a method of using an object recognition algorithm with a small amount of calculation for face recognition.

또한 종래 발명은 일반적인 CNN기반 인코더를 사용하여 얼굴로부터 감정을 인식하지만, 본 발명에서는 계산량이 적은 CNN기반 인코더를 활용하여 인식된 얼굴로부터 감정을 구분하는 방법을 제시한다.In addition, the conventional invention recognizes emotions from faces using a general CNN-based encoder, but the present invention proposes a method for distinguishing emotions from recognized faces using a CNN-based encoder with a small amount of calculation.

또한 종래 발명은 얼굴 인식과 감정인식 모두에 FP32 방식을 사용하지만 본 발명에서는 FP16 방식을 사용하여 속도를 높이는 방식을 제시한다.In addition, the conventional invention uses the FP32 method for both face recognition and emotion recognition, but the present invention proposes a method to increase speed by using the FP16 method.

또한 종래 발명에 사용되지 않은 포컬로스, LARS, cosine decay scheduler 등과 같은 최신 훈련 기법을 사용하여 성능을 높이는 방식을 제시한다.In addition, we present a method to increase performance by using the latest training techniques such as foculos, LARS, and cosine decay scheduler, which were not used in prior inventions.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the attached drawings. In order to facilitate overall understanding when describing the present invention, the same reference numerals are used for the same components in the drawings, and duplicate descriptions for the same components are omitted.

도 1은 본 발명의 실시 예에 따른 엣지 디바이스의 블록도이다.1 is a block diagram of an edge device according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시 예에 따른 엣지 디바이스(100)는 저사양의 컴퓨팅 장치로서, CNN 기반의 얼굴 감정 인식을 수행하도록 구성될 수 있다. 저사양의 엣지 디바이스 예로, 스마트 폰, 태블릿 PC, 웨어러블 기기, 스마트 와치 등이 있을 수 있다.Referring to FIG. 1, the edge device 100 according to an embodiment of the present invention is a low-spec computing device and may be configured to perform CNN-based facial emotion recognition. Examples of low-end edge devices may include smartphones, tablet PCs, wearable devices, and smart watches.

CNN 기반의 얼굴 감정 인식을 수행하기 위해, 엣지 디바이스(100)는 프로세서(110), 메모리(120), 표시부(130), 카메라(140), 통신 모듈(150), 저장 매체(160) 및 이들을 연결하는 시스템 버스(170)를 포함하도록 구성될 수 있으며, 추가로, 통상의 컴퓨팅 장치에서 탑재가능한 구성들, 예를 들면, 입출력 인터페이스, 스피커, 마이크, 충전 가능한 배터리, 안테나, 램프 등을 더 포함하도록 구성될 수 있다.In order to perform CNN-based facial emotion recognition, the edge device 100 includes a processor 110, a memory 120, a display unit 130, a camera 140, a communication module 150, a storage medium 160, and the like. It may be configured to include a system bus 170 for connection, and may further include components that can be mounted on a typical computing device, such as an input/output interface, a speaker, a microphone, a rechargeable battery, an antenna, a lamp, etc. It can be configured to do so.

프로세서(110)는 얼굴 감정 인식을 수행하기 위한 CNN 기반의 얼굴 감정 인식 모델의 동작을 제어하는 저사양의 프로세서로서, 예를 들면, 적어도 하나의 CPU, 적어도 하나의 GPU, 이들을 포함하는 마이크로 컨트롤러 유닛(MCU) 또는 시스템 온 칩(SoC) 등으로 구현될 수 있다. 또한, 프로세서(110)는 주변 구성들(120~160)의 동작을 제어 및 관리할 수 있다. 또한 프로세서(110)는 CNN 기반의 얼굴 감정 인식 모델의 학습 동작을 제어할 수 있다.The processor 110 is a low-specification processor that controls the operation of a CNN-based facial emotion recognition model for performing facial emotion recognition, for example, at least one CPU, at least one GPU, and a microcontroller unit including these ( It can be implemented as an MCU) or system-on-chip (SoC). Additionally, the processor 110 may control and manage the operations of the peripheral components 120 to 160. Additionally, the processor 110 may control the learning operation of the CNN-based facial emotion recognition model.

메모리(120)는 CNN 기반의 얼굴 감정 인식 모델의 동작 및 실행에 필요한 다양한 명령어, 코드, 함수, 값 등을 일시적으로 저장하는 휘발성 메모리일 수 있으며, 그러한 명령어, 코드, 함수, 값 등을 프로세서의 요청에 따라 프로세서(110)로 입력하거나 프로세서(110)에 의해 처리된 중간 데이터 또는 결과 데이터를 등을 일시적으로 저장할 수 있다.The memory 120 may be a volatile memory that temporarily stores various commands, codes, functions, values, etc. required for the operation and execution of the CNN-based facial emotion recognition model, and may store such commands, codes, functions, values, etc. in the processor. Upon request, intermediate data or result data input to the processor 110 or processed by the processor 110 may be temporarily stored.

표시부(130)는 프로세서(110)의 제어에 따라 얼굴 감정 인식 모델의 처리 과정, 얼굴 감정 인식 모델에 의해 예측(추론)된 얼굴 감정 인식 결과 등을 표시하며, 예를 들면, LCD, LED 등으로 구현될 수 있다. 또한, 표시부(130)는 얼굴 감정 인식 모델의 실행 및 동작에 필요한 사용자 입력을 수신하는 입력 기능을 가질 수 있다. 여기서, 입력 기능은 터치 기능일 수 있다.The display unit 130 displays the processing process of the facial emotion recognition model and the facial emotion recognition results predicted (inferred) by the facial emotion recognition model under the control of the processor 110, for example, through LCD, LED, etc. It can be implemented. Additionally, the display unit 130 may have an input function to receive user input required for execution and operation of the facial emotion recognition model. Here, the input function may be a touch function.

카메라(140)는 프로세서(110)의 제어에 따라 얼굴 감정 인식 대상에 해당하는 사용자의 얼굴을 촬영하여 연속된 프레임 단위의 얼굴 영상 및/또는 얼굴 이미지를 획득할 수 있다.The camera 140 may capture a face of a user corresponding to a facial emotion recognition target under the control of the processor 110 and obtain facial images and/or face images in successive frame units.

통신 모듈(150)은 외부 장치와의 유선 또는 무선 통신을 지원하는 부품으로서, 얼굴 감정 인식 결과를 외부 장치로 송신할 수 있다. 여기서, 외부 장치는 엣지 디바이스, 서버 등을 포함하며, 무선 통신은, 예를 들면, 무선 인터넷, 이동 통신(3G LTE, 4G, 5G), 블루투스, 와이파이 등을 포함하며, 유선 통신은, 예를 들면, USB 통신 등을 포함할 수 있다.The communication module 150 is a component that supports wired or wireless communication with an external device, and can transmit facial emotion recognition results to an external device. Here, external devices include edge devices, servers, etc., wireless communications include, for example, wireless Internet, mobile communications (3G LTE, 4G, 5G), Bluetooth, Wi-Fi, etc., and wired communications include, for example, For example, it may include USB communication, etc.

저장 매체(160)는 통신 모듈(150)을 통해 서버와 같은 외부 장치로부터 수신된 사전 학습된 CNN 기반의 얼굴 감정 인식 모델 또는 업데이트된 얼굴 감정 인식 모델을 저장하도록 구성된 비휘발성 저장매체 일 수 있다.The storage medium 160 may be a non-volatile storage medium configured to store a pre-trained CNN-based facial emotion recognition model or an updated facial emotion recognition model received from an external device such as a server through the communication module 150.

도 2는 도 1에 도시된 프로세서의 제어에 따라 동작하는 CNN 기반의 얼굴 감정 인식 모델의 구성도이다.FIG. 2 is a configuration diagram of a CNN-based facial emotion recognition model operating under the control of the processor shown in FIG. 1.

도 2를 참조하면, CNN 기반의 얼굴 감정 인식 모델(200)은 카메라(140)에서 획득한 얼굴을 포함하는 입력 이미지에서 얼굴 영역을 검출하는 CNN 기반의 멀티태스크 캐스케이드 컨볼루션 네트워크(Multi task Cascaded Convolutional Networks: MTCNN)(210)와 상기 MTCNN(210)에 의해 검출된 얼굴 영역을 인코딩하여 감정 인식을 수행하는 CNN 기반의 인코더(230)를 포함하며, 추가로, MTCNN(210)의 로스 또는 로스 함수(예, 포컬 로스(Focal Loss))를 계산하는 로스 함수 블록(220)을 더 포함하도록 구성될 수 있다. 여기서, CNN 기반의 인코더는, 모바일넷(MobileNet v3)일 수 있으며, 더 바람직하게는 상기 검출된 얼굴 영역을 FP-16(Floating Point-16) 방식으로 인코딩하는 모바일넷(MobileNet v3)일 수 있다.Referring to FIG. 2, the CNN-based facial emotion recognition model 200 is a CNN-based multi-task cascaded convolutional network that detects a face area in an input image containing a face acquired by the camera 140. Networks: MTCNN (210) and a CNN-based encoder (230) that performs emotion recognition by encoding the face area detected by the MTCNN (210), and in addition, the loss or loss function of the MTCNN (210) It may be configured to further include a loss function block 220 that calculates (e.g., focal loss). Here, the CNN-based encoder may be MobileNet (MobileNet v3), and more preferably, it may be MobileNet (MobileNet v3) that encodes the detected face area in FP-16 (Floating Point-16) method. .

무엇보다도 MTCNN(210)은 32-bit Floating Point (FP 32) 기반의 싱글 정밀도 훈련(single precision training) 기법에 따라 학습을 수행하는 기존의 딥러닝 모델(Deep Learning Model)과는 다르게 16-bit Floating Point(FP 16) 기반의 믹스 정밀도 훈련(mixed precision training) 기법에 따라 학습을 수행함으로써, MTCNN(210)의 계산량을 줄일 수 있다.Above all, MTCNN (210) is different from the existing Deep Learning Model that performs learning according to the single precision training technique based on 32-bit Floating Point (FP 32), 16-bit Floating By performing learning according to the mixed precision training technique based on Point (FP 16), the amount of calculation of the MTCNN (210) can be reduced.

이러한 FP 16 기반의 믹스 정밀도 훈련(mixed precision training) 기법에 따르면, 우선 마스터-가중치(Master-Weight)를 FP32에서 FP16으로 변환한 뒤, FP16으로 변환된 마스터-가중치(Master-Weight)에 대해 Forward Propagation 연산과 Backward Propagation 연산을 수행하여 가중치 그라디언트(Weights gradients)를 획득한다. 이후, 가중치 그라디언트(Weights gradients)를 다시 FP32로 변환한 후, FP32로 변환된 가중치 그라디언트(Weights gradients)를 FP32 기반의 마스터-가중치(Master-Weight)에 반영함으로써, 마스터-가중치(Master-Weight)가 업데이트된다.According to this FP 16-based mixed precision training technique, first, the Master-Weight is converted from FP32 to FP16, and then the Master-Weight converted to FP16 is forwarded. Weights gradients are obtained by performing propagation and backward propagation operations. Afterwards, the weights gradients are converted back to FP32, and the weights gradients converted to FP32 are reflected in the FP32-based Master-Weight, thereby creating the Master-Weight. is updated.

한편, 계산량을 줄이기 위해 FP 16 기반의 믹스 정밀도 훈련(mixed precision training) 기법에 따라 학습된 MTCNN(210)는 카메라(140)로부터 입력된 입력 이미지에서 얼굴 영역을 검출하기 위해 P-Net(211), R-Net(213), 그리고 O-Net(215)을 포함하도록 구성될 수 있다.Meanwhile, in order to reduce the amount of calculation, MTCNN (210), learned according to the FP 16-based mixed precision training technique, uses P-Net (211) to detect the face area in the input image input from the camera 140. , R-Net (213), and O-Net (215).

P-Net(211)P-Net(211)

P-Net(211)의 전단에는 입력 이미지를 Resize하여 이미지 피라미드를 생성하는 전처리부(도시하지 않음)가 구비될 수 있다. 전처리부는, 예를 들어, 300 x 200 크기의 이미지가 입력되면 이를 200 x 166, 100 x 66, 30 x 20 크기로 리사이지 한 이미지의 리스트(list)를 만든다. 이렇게 하는 이유는 작은 얼굴도 검출하기 위함이다. The front end of the P-Net 211 may be equipped with a preprocessor (not shown) that resizes the input image to create an image pyramid. For example, when an image with a size of 300 The reason for doing this is to detect even small faces.

전처리부에 생성된 이미지 피라미드는 P-Net(211)으로 입력된다.The image pyramid generated in the preprocessor is input to P-Net (211).

P-Net(211)은, 예를 들어, 12 x 12 x 3 크기의 작은 이미지를 입력받을 수 있다. 그리고 컨볼루션만 거쳐서 (Fully Connected Layer 없음) 해당 영역이 얼굴인지 아닌지를 각각 나타내는 face classification, 얼굴 영역을 나타내는 좌측 상단 꼭지점의 x, y 좌표와 박스의 너비, 크기를 나타내는 4개의 bounding box regression 값, 그리고 양쪽 눈, 코, 양쪽 입고리의 x, y 좌표를 나타내는 10개의 landmark localization 값을 결과로 리턴한다. For example, the P-Net 211 can receive a small image with a size of 12 x 12 x 3. Then, through only convolution (no Fully Connected Layer), face classification indicating whether the relevant area is a face or not, x, y coordinates of the upper left vertex indicating the face area, four bounding box regression values indicating the width and size of the box, In addition, 10 landmark localization values representing the x and y coordinates of both eyes, nose, and both mouths are returned as a result.

P-Net(211)은 앞서서 생성된 이미지 피라미드를 입력 받아서 각각의 이미지에 12x12 크기의 윈도우로 스캔을 하며 얼굴에 해당하는 영역을 찾아낸다. 12x12의 아주 작은 크기로 윈도우를 설정하였기 때문에 작은 얼굴일지라도 잘 찾아낼 수 있다. P-Net (211) receives the previously created image pyramid, scans each image with a 12x12 window, and finds the area corresponding to the face. Since the window is set to a very small size of 12x12, even small faces can be found well.

P-Net(211)은 이렇게 찾은 얼굴 영역들을 다시 원래의 이미지 크기로 되돌린다. P-Net(211)은, 예를 들어,　30x20 크기 이미지에서 찾은 얼굴 영역 좌표를 300x200 이미지에 해당하는 좌표로 변환한다. 그 다음 이렇게 찾은 박스들을 대상으로　Non-Maximum-Suppression(NMS)과　bounding box regression을 적용한다. 여기서, NMS는 동일한 얼굴에 여러 번 박스가 쳐진 경우, 가장 얼굴일 확률이 높은 것만 남기고 제거하는 과정을 일컫는다.P-Net (211) returns the face regions found in this way to the original image size. For example, the P-Net 211 converts the face area coordinates found in a 30x20 size image into coordinates corresponding to a 300x200 image. Then, Non-Maximum-Suppression (NMS) and bounding box regression are applied to the boxes found in this way. Here, NMS refers to the process of removing the box that is most likely to be the face when the same face is boxed multiple times.

R-Net(213)R-Net(213)

전술한 바와 같이 P-Net(211)을 통해서 얼굴로 추정되는 박스들의 리스트를 얻을 수 있다. R-Net(213)은 이 박스들 중에서도 진짜 얼굴에 해당하는 영역들을 추정하고, bounding box regression을 더 정교하게 수행하는 작업을 처리한다. As described above, a list of boxes estimated to be faces can be obtained through the P-Net 211. R-Net (213) estimates the areas corresponding to the real face among these boxes and performs the task of performing bounding box regression more precisely.

먼저 앞서 구한 박스들을 모두 24x24 크기로 resize하고, 그 다음 R-Net(213)을 통과시킨다. R-Net(213)의 구조는 다음과 같다.First, all previously obtained boxes are resized to 24x24, and then passed through R-Net (213). The structure of R-Net(213) is as follows.

R-Net(213)은 P-Net(211)과 매우 유사하지만 Fully Connected Layer를 사용한 점에서 차이가 있다. 전술한 P-Net(211)의 경우 위치 정보를 잃어버리는 것을 방지하기 위해서 FC 레이어를 배제하고 Conv 레이어만으로 네트워크를 구성한 것과 대조된다. R-Net (213) is very similar to P-Net (211), but differs in that it uses a fully connected layer. This is in contrast to the case of the P-Net 211 described above, in which the FC layer was excluded and the network was constructed with only the Conv layer to prevent loss of location information.

전체 이미지에서 얼굴에 해당하는 부분을 추측하는 역할은 P-Net(211)이 수행하고, R-Net(213)은 이를 더 정교하게 만드는 역할을 한다. R-Net(213)에서 찾아낸 박스는 마찬가지로 원래 입력 이미지 크기로 되돌린 다음, NMS와 BBR을 적용하고, 여기서 살아남은 박스들만 O-Net(215)로 입력된다.P-Net (211) is responsible for guessing the part corresponding to the face in the entire image, and R-Net (213) is responsible for making it more precise. The boxes found in R-Net (213) are similarly returned to the original input image size, then NMS and BBR are applied, and only the surviving boxes are input to O-Net (215).

O-Net(215)O-Net(215)

O-Net(215)은 R-Net(213)을 통해 찾아낸 박스들을 모두 48x48 크기로 resize한 것을 입력받는다. 점점 필터의 크기를 키우면서 얼굴에 해당하는 더 추상적인 정보를 찾아내기 위함이다. 여러 Conv 레이어와 FC 레이어를 거친 뒤 세 종류의 output을 내게 되며, 이것이 최종 Face Detection(얼굴 영역), Face Alignment 결과 값이 된다.O-Net (215) receives all boxes found through R-Net (213) resized to 48x48. The purpose is to gradually increase the size of the filter to find more abstract information related to the face. After going through several Conv layers and FC layers, three types of output are produced, which become the final Face Detection (face area) and Face Alignment results.

MTCNN(210) 또는 MTCNN(210)의 O-Net(215)에 의해 검출된 얼굴 영역은 CNN 기반의 인코더(230)로 입력되고, CNN 기반의 인코더(230)는 상기 얼굴 영역을 인코딩하여 획득한 벡터를 감정 인식 결과로서 출력한다. 이때, CNN 기반의 인코더(230)는 모바일넷(MobileNet v3)으로 구현됨으로써, 엣지 디바이스와 같이 저사양의 디바이스 환경에서 사용을 그 속도가 다른 CNN 기반의 인코더들에 비해 현저히 빠르다.The face area detected by the MTCNN (210) or the O-Net (215) of the MTCNN (210) is input to the CNN-based encoder (230), and the CNN-based encoder (230) encodes the face area and obtains A vector is output as the emotion recognition result. At this time, the CNN-based encoder 230 is implemented with MobileNet v3, so it can be used in low-specification device environments such as edge devices, and its speed is significantly faster than other CNN-based encoders.

로컬 함수 블록(220)은 모바일넷(MobileNet v3)으로 구현된 CNN 기반의 인코더(230)에 의해 인코딩된 벡터에 포컬 로스(Focal Loss)를 사용하여 그라디언트(gradient)를 계산한다. 여기서, 포컬 로스(Focal Loss: L_focal)는 아래 수학식1에 의해 계산될 수 있다.The local function block 220 calculates the gradient using focal loss on the vector encoded by the CNN-based encoder 230 implemented in MobileNet v3. Here, Focal Loss (L _focal ) can be calculated by Equation 1 below.

Gradient update는 Stochastic Gradient Descent에 LARS를 사용하고, learning rate을 linear warm-up후 cosine형태로 decay 하는 방식을 통해 진행하여 그 성능을 높일 수 있다.Gradient update uses LARS for Stochastic Gradient Descent, and the performance can be improved by linearly warming up the learning rate and then decaying it to a cosine form.

이상 설명한 바와 같이, 일반적으로 딥러닝 모델을 FP16으로 계산하더라도 batch normalization부분은 FP32로 계산해야 하는데 MTCNN(210)의 경우, 정규화(normalization) 계산이 없기 때문에 FP16을 사용하는 경우, 계산속도가 더 빨라진다. As explained above, even if a deep learning model is generally calculated with FP16, the batch normalization part must be calculated with FP32, but in the case of MTCNN (210), since there is no normalization calculation, the calculation speed is faster when FP16 is used. .

또한 모바일넷(MobileNet v3)은 스마트 폰에서의 사용을 염두해 둔 CNN 기반의 인코더이기 때문에, 그 속도가 다른 기존의 CNN기반의 인코더들에 비해 현저히 빠르다.Additionally, because MobileNet v3 is a CNN-based encoder designed for use in smart phones, its speed is significantly faster than other existing CNN-based encoders.

이처럼 종래의 기술에서는 얼굴 인식과 얼굴을 통한 감정 인식을 딥러닝을 활용하여 그 성능을 높였지만, 본 발명은 동일하게 딥러닝 방식을 채택하여 성능은 높이고, 딥러닝 방식 내에서도 그 속도가 빠른 네트워크 구조와 floating point-16 방식의 계산을 통해 계산 속도를 현저히 높이는 방법을 적용하였다.In this way, in the conventional technology, the performance of face recognition and emotion recognition through faces was improved by using deep learning, but the present invention adopts the same deep learning method to improve performance and has a network structure that is fast even within the deep learning method. A method was applied to significantly increase calculation speed through floating point-16 calculation.

감정인식 기술이 엣지 디바이스에서 활용될 여지가 크다는 것을 생각할 때, FP16 계산과 MTCNN(210)과 MobileNet(230)을 이용하여 계산속도를 높이는 방식은 저사양의 엣지 디바이스에 딥러닝 기술의 적용을 용이하게 할 수 있다.Considering that emotion recognition technology has great potential to be utilized in edge devices, the method of increasing calculation speed using FP16 calculation and MTCNN (210) and MobileNet (230) facilitates the application of deep learning technology to low-end edge devices. can do.

도 3은 본 발명의 실시 예에 따른 저사양 디바이스를 위한 얼굴 감정 인식 방법을 나타내는 흐름도이다.Figure 3 is a flowchart showing a facial emotion recognition method for a low-end device according to an embodiment of the present invention.

도 3을 참조하면, 먼저, S310에서, 카메라(140)에 의해, 얼굴을 포함하는 입력 이미지를 획득하는 과정이 수행된다.Referring to FIG. 3, first, in S310, a process of acquiring an input image including a face is performed by the camera 140.

이어, S320에서, MTCNN(210)에 의해, 상기 입력 이미지로부터 얼굴 영역을 검출하는 과정이 수행된다.Next, in S320, a process of detecting a face area from the input image is performed by the MTCNN 210.

이어, S330에서, 인코더(230)에 의해, 상기 검출된 얼굴 영역을 인코딩하여 감정 인식 결과를 출력하는 과정이 수행된다.Next, in S330, the encoder 230 performs a process of encoding the detected face area and outputting an emotion recognition result.

실시 예에서, 상기 S320는 상기 입력 이미지를 FP-16(Floating Point-16) 방식으로 계산하여 상기 얼굴 영역을 검출하는 과정일 수 있다.In an embodiment, S320 may be a process of detecting the face area by calculating the input image using FP-16 (Floating Point-16).

실시 예에서, 상기 S330은, 상기 검출된 얼굴 영역을 FP-16(Floating Point-16) 방식으로 계산하여 인코딩하는 과정일 수 있다.In an embodiment, S330 may be a process of calculating and encoding the detected face area using FP-16 (Floating Point-16) method.

실시 예에서, 상기 S330은, CNN 기반의 모바일넷(MobileNet v3)으로 구현된 상기 인코더에 의해, 상기 검출된 얼굴 영역을 인코딩하여 감정 인식 결과를 출력하는 과정일 수 있다.In an embodiment, S330 may be a process of encoding the detected facial area and outputting an emotion recognition result by the encoder implemented in CNN-based MobileNet (MobileNet v3).

실시 예에서, 상기 S310이전에, 상기 MTCNN(210)을 16-bit Floating Point(FP 16) 기반의 믹스 정밀도 훈련(mixed precision training) 기법에 따라 학습하는 과정과 CNN 기반의 모바일넷(MobileNet v3)으로 구현된 상기 인코더를 16-bit Floating Point(FP 16) 기반의 믹스 정밀도 훈련(mixed precision training) 기법에 따라 학습하는 과정이 수행될 수 있다.In an embodiment, before S310, the process of learning the MTCNN (210) according to a 16-bit Floating Point (FP 16)-based mixed precision training technique and CNN-based MobileNet (MobileNet v3) A process of learning the encoder implemented as follows can be performed according to a mixed precision training technique based on 16-bit floating point (FP 16).

이상과 같이 실시 예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described with limited examples and drawings as described above, various modifications and variations can be made from the above description by those skilled in the art. The described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or are used with other components or equivalents. Appropriate results can be achieved even if replaced or substituted by .

Claims

acquiring, by a camera, an input image including a face;
Detecting a face region from the input image using a Multi task Cascaded Convolutional Network (MTCNN); and
Encoding the detected facial area by an encoder to output an emotion recognition result
A facial emotion recognition method comprising:

In paragraph 1:
The step of detecting the face area is,
A facial emotion recognition method comprising detecting the facial area by calculating the input image using FP-16 (Floating Point-16) method.

In paragraph 1:
The step of outputting the emotion recognition result is,
A facial emotion recognition method that calculates and encodes the detected facial area using the FP-16 (Floating Point-16) method.

In paragraph 1:
The step of outputting the emotion recognition result is,
A facial emotion recognition method comprising encoding the detected facial area and outputting an emotion recognition result using the encoder implemented in CNN-based MobileNet (MobileNet v3).

In paragraph 1:
Before acquiring the input image,
Learning the MTCNN according to a mixed precision training technique based on 16-bit Floating Point (FP 16)
A facial emotion recognition method further comprising:

In paragraph 1:
Before acquiring the input image,
Learning the encoder implemented with CNN-based MobileNet (MobileNet v3) according to a 16-bit Floating Point (FP 16)-based mixed precision training technique.
A facial emotion recognition method further comprising:

A camera that acquires an input image containing a face; and
Includes a processor that controls the operation of the facial emotion recognition model,
The facial emotion recognition model is,
Multi-task Cascaded Convolutional Networks (MTCNN) for detecting face regions from the input image; and
CNN-based MobileNet (MobileNet v3) that encodes the detected facial area and outputs emotion recognition results
Edge device including.

In paragraph 7:
The multi-task cascade convolutional network and the MobileNet v3,
An edge device that is a CNN learned using a mixed precision training technique based on 16-bit Floating Point (FP 16).