KR20210018600A

KR20210018600A - System for recognizing facial expression

Info

Publication number: KR20210018600A
Application number: KR1020190095385A
Authority: KR
Inventors: 심현철; 최동윤; 이민규; 송병철
Original assignee: 현대자동차주식회사; 인하대학교 산학협력단; 기아자동차주식회사
Priority date: 2019-08-06
Filing date: 2019-08-06
Publication date: 2021-02-18

Abstract

According to one embodiment of the present invention, a facial expression recognition system simultaneously extracts a local appearance feature and a scene feature having space-time information by using a whole sequence and individual frames from an input video. The scene feature has a change of a general facial expression in an image, and the appearance feature connotes detailed information indicating each frame. According to one embodiment of the present invention, the facial expression recognition system is capable of extracting a facial expression by early fusing a scene feature and an appearance feature, thereby being operated very strongly in a video of an actual environment including a pose change and low illuminance as well as an image of an artificial environment.

Description

Facial expression recognition system {SYSTEM FOR RECOGNIZING FACIAL EXPRESSION}

본 발명은 얼굴 표정 인식 시스템에 관한 것으로 더욱 상세하게는 실제 환경에서 취득된 데이터에서 강인한 얼굴 표정 인식을 가능하게 하는 얼굴 표정 인식 시스템에 관한 것이다.The present invention relates to a facial expression recognition system, and more particularly, to a facial expression recognition system that enables robust facial expression recognition from data acquired in an actual environment.

얼굴 표정 인식은 컴퓨터 비전과 인간-컴퓨터 상호작용(human-computer interaction) 분야에서 지속적으로 관심을 받고 있는 기술 분야이다. 얼굴 표정 인식 기술은 차량 내 운전자 모니터링, 인간-로봇 상호작용(human-robot interaction), 심리학, 디지털 엔터테인먼트 등 다양한 분야에서 활용될 수 있다.Facial expression recognition is a technology field that continues to receive attention in the fields of computer vision and human-computer interaction. Facial expression recognition technology can be used in various fields such as driver monitoring in a vehicle, human-robot interaction, psychology, and digital entertainment.

최근 딥 러닝 같은 기계학습의 발달과 함께 얼굴 표정 인식 기술의 수준이 상당한 진척이 이루기는 했지만 여전히 강인하게 동작하는데 있어서는 어려움이 있다. 왜냐하면, 실제 환경은 어클루젼(occlusion), 복잡한 모션, 저조도 등 여러가지 성능 저하 요인을 포함하고 있기 때문이다.Although considerable progress has been made in the level of facial expression recognition technology with the recent development of machine learning such as deep learning, there are still difficulties in operating robustly. This is because the actual environment includes various performance degradation factors such as occlusion, complex motion, and low light.

초기 얼굴 표정 인식 기술들은 한 장의 이미지 속 인물의 감정을 분류하는 것이 주류였지만, 최근에는 동영상 속 인물의 감정을 분석하는 연구도 활발하다. 한편, 동영상 기반 얼굴 표정인식은 학습시키거나 검증하기 위한 데이터 셋들도 다양하다. In the early days, facial expression recognition technologies were mainly used to classify the emotions of a person in a single image, but recently, research on analyzing the emotions of a person in a video is also active. Meanwhile, there are various data sets for learning or verifying facial expression recognition based on video.

도 1은 종래의 얼굴 표정 인식 기술에서 사용되는 CK+(The Extended Cohn-Kanade)의 데이터 셋의 영상들의 일례를 도시한 도면이며, 도 2는 실제 환경에서 촬영된 동영상에서 나타나는 얼굴 표정을 도시한 도면이다. 1 is a diagram showing an example of images of a data set of CK+ (The Extended Cohn-Kanade) used in a conventional facial expression recognition technology, and FIG. 2 is a diagram showing facial expressions appearing in a video captured in an actual environment to be.

예를 들어, 도 1과 같이 CK+(The Extended Cohn-Kanade)의 데이터 셋의 영상들은 피험자들이 정면을 보고 부자연스럽게 인위적인 감정을 표현하는 영상들이 대부분이다. 또한, 일정한 조도 환경에서 정면의 얼굴 영상들이 취득되었기 때문에 포즈와 조도 등이 알고리즘 개발에서 고려되지 않는다. 그러나, 도 2와 같이 실제 환경에서 촬영된 동영상들은 다양한 포즈 및 조도를 갖기 때문에 이에 강인한 얼굴 표정인식 기법이 요구된다.For example, as shown in FIG. 1, most of the images in the CK+ (The Extended Cohn-Kanade) data set are images in which subjects look at the front and express unnaturally artificial emotions. Also, since face images of the front face were acquired in a constant illumination environment, poses and illumination levels are not considered in the algorithm development. However, since videos captured in an actual environment as shown in FIG. 2 have various poses and illuminances, a robust facial expression recognition technique is required.

상기의 배경기술로서 설명된 사항들은 본 발명의 배경에 대한 이해 증진을 위한 것일 뿐, 이 기술분야에서 통상의 지식을 가진 자에게 이미 알려진 종래기술에 해당함을 인정하는 것으로 받아들여져서는 안 될 것이다.The matters described as the background art are only for enhancing an understanding of the background of the present invention, and should not be taken as acknowledging that they correspond to the prior art already known to those of ordinary skill in the art.

KR 10-2007-0012395 AKR 10-2007-0012395 A

이에 본 발명은, 다양한 포즈, 저조도 환경 등과 같이 실제 환경에서 취득된 데이터에서 얼굴 표정을 강인하게 인식할 수 있는 얼굴 표정 인식 시스템을 제공하는 것을 해결하고자 하는 기술적 과제로 한다.Accordingly, an object of the present invention is to provide a facial expression recognition system capable of strongly recognizing facial expressions from data acquired in an actual environment such as various poses and low-light environments.

상기 기술적 과제를 해결하기 위한 수단으로서 본 발명은,The present invention as a means for solving the above technical problem,

입력된 동영상에 포함된 각 프레임으로부터 지역 정보를 포함하는 외형 특징을 추출하는 제1 신경망;A first neural network for extracting external features including region information from each frame included in the input video;

상기 동영상의 시퀀스로부터 시공간 정보를 포함하는 장면 특징을 추출하는 제2 신경망; 및A second neural network for extracting scene features including spatiotemporal information from the sequence of the video; And

상기 외형 특징과 상기 장면 특징을 융합하여 감정 분류를 수행하는 제3 신경망을 포함하는 얼굴 표정 인식 시스템을 제공한다.It provides a facial expression recognition system including a third neural network that performs emotion classification by fusing the external features and the scene features.

본 발명의 일 실시형태는, 상기 동영상에서 얼굴을 인식하고 인식된 영역을 크롭하여 정렬하며, 상기 동영상이 사전 설정된 기준 보다 짧은 경우 프레임 보간 기법을 적용하여 상기 동영상의 길이를 증가시키는 전처리부를 더 포함할 수 있다.An embodiment of the present invention further includes a preprocessor for recognizing a face in the video, cropping and aligning the recognized area, and increasing the length of the video by applying a frame interpolation technique when the video is shorter than a preset reference can do.

본 발명의 일 실시형태에서, 상기 제1 신경망은 미세 튜닝된 2차원 콘볼루션 신경망으로서 DenseNet을 기반으로 할 수 있다.In one embodiment of the present invention, the first neural network may be a finely tuned 2D convolutional neural network based on DenseNet.

본 발명의 일 실시형태에서, 상기 제1 신경망은 전체 연결 레이어(fully-connected layer)를 갖는 DenseNet일 수 있다.In one embodiment of the present invention, the first neural network may be a DenseNet having a fully-connected layer.

본 발명의 일 실시형태에서, 상기 DenseNet은 ImageNet 데이터 셋을 이용하여 사전 트레이닝 된 후 감정 데이터로 구성된 FER2013 데이터 셋으로 재학습 되어 미세 튜닝될 수 있다.In one embodiment of the present invention, the DenseNet may be pretrained using the ImageNet data set and then retrained to the FER2013 data set composed of emotion data to be fine tuned.

본 발명의 일 실시형태에서, 상기 제2 신경망은 3차원 콘볼루션 신경망일 수 있다.In one embodiment of the present invention, the second neural network may be a 3D convolutional neural network.

본 발명의 일 실시형태에서, 상기 3차원 콘볼루션 신경망은 Sports-1M 데이터 셋 또는 Kinetic 데이터 셋으로 사전 트레이닝 될 수 있다.In one embodiment of the present invention, the 3D convolutional neural network may be pretrained with a Sports-1M data set or a kinetic data set.

본 발명의 일 실시형태에서, 상기 3차원 콘볼루션 신경망은, 전체 연결 레이어(fully-connected layer), 배치 정규화(Batch Normalization), 드롭 아웃 및 ReLU를 포함하는 보조 분류기를 포함할 수 있다.In one embodiment of the present invention, the 3D convolutional neural network may include an auxiliary classifier including a fully-connected layer, batch normalization, dropout, and ReLU.

본 발명의 일 실시형태에서, 상기 3차원 콘볼루션 신경망은, 그에 포함된 콘볼루션 블록은 학습되지 않도록 파라미터들이 고정되고 그에 포함된 전체 연결 레이어들 및 상기 보조 분류기만 학습될 수 있다.In an embodiment of the present invention, in the 3D convolutional neural network, parameters are fixed so that a convolutional block included therein is not learned, and only all connection layers included therein and the auxiliary classifier may be learned.

본 발명의 일 실시형태에서, 상기 제1 신경망은 감정 관련 데이터를 이용하여 사전 트레이닝 되고 상기 제2 신경망은 행동 인식 관련 데이터를 이용하여 사전 트레이닝 될 수 있다.In one embodiment of the present invention, the first neural network may be pre-trained using emotion-related data, and the second neural network may be pre-trained using behavior recognition-related data.

본 발명의 일 실시형태에서, 상기 제3 신경망은, 상기 장면 특징과 상기 외형 특징을 특징 레벨에서 융합하는 시간적 네트워크인 장면 인식(scene-aware) 리커런트 신경망(Recurrent Neural Network)일 수 있다.In an embodiment of the present invention, the third neural network may be a scene-aware recurrent neural network, which is a temporal network in which the scene feature and the external feature are fused at a feature level.

본 발명의 일 실시형태에서, 상기 장면 인식 리커런트 신경망은 LSTM(Long Short-Term Memory) 모델이 적용될 수 있다.In an embodiment of the present invention, a long short-term memory (LSTM) model may be applied to the scene recognition recurrent neural network.

본 발명의 일 실시형태에서, 상기 LSTM(Long Short-Term Memory) 모델은 상기 제2 신경망에서 추출된 장면 특징을 히든 스테이트로 사용할 수 있다.In an embodiment of the present invention, the Long Short-Term Memory (LSTM) model may use the scene feature extracted from the second neural network as a hidden state.

상기 얼굴 표정 인식 시스템에 따르면, 동영상의 전반적인 장면(scene)에 대한 시간적 변화 정보를 효과적으로 추출하여 낮은 조도 및 포즈 변화 등에 대해서도 강인하게 얼굴 표정을 인식할 수 있다. 이로 인해 인위적인 환경에서 취득된 특정 영상 뿐만 아니라 실제 환경에서 취득된 영상에 대해서도 강인한 얼굴 표정 인식이 이루어질 수 있다.According to the facial expression recognition system, by effectively extracting temporal change information for an overall scene of a video, it is possible to recognize a facial expression robustly even with low illumination and pose change. As a result, strong facial expression recognition can be achieved not only for specific images acquired in an artificial environment, but also for images acquired in an actual environment.

또한, 상기 얼굴 표정 인식 시스템에 따르면, 얼리(early) 융합을 통해 각 신호의 고유한 특징을 보존하면서 시너지를 발휘하도록 융합할 수 있다. In addition, according to the facial expression recognition system, it is possible to achieve synergy while preserving unique characteristics of each signal through early fusion.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtainable in the present invention are not limited to the above-mentioned effects, and other effects not mentioned can be clearly understood by those of ordinary skill in the art from the following description. will be.

도 1은 종래의 얼굴 표정 인식 기술에서 사용되는 CK+(The Extended Cohn-Kanade)의 데이터 셋의 영상들의 일례를 도시한 도면이다.
도 2는 실제 환경에서 촬영된 동영상에서 나타나는 얼굴 표정을 도시한 도면이다.
도 3은 본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템의 블록 구성도이다.
도 4는 본 발명의 일 실시형태의 제1 신경망으로 채용된 두 개의 전체 연결 레이어를 갖는 DenseNet의 구조를 도시한 블록 구성도이다.
도 5는 본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템의 제2 신경망에 적용된 3차원 콘볼루션 신경망의 구조를 도시한 블록 구성도이다.
도 6은 본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템의 제3 신경망에 적용된 장면 인식 리커런트 신경망의 모델인 LSTM의 구조를 도시한 블록 구성도이다.FIG. 1 is a diagram illustrating an example of images of a data set of CK+ (The Extended Cohn-Kanade) used in a conventional facial expression recognition technology.
2 is a diagram illustrating facial expressions appearing in a video captured in an actual environment.
3 is a block diagram of a facial expression recognition system according to an embodiment of the present invention.
4 is a block diagram showing the structure of a DenseNet having two full connection layers employed as a first neural network according to an embodiment of the present invention.
5 is a block diagram showing the structure of a 3D convolutional neural network applied to a second neural network of a facial expression recognition system according to an embodiment of the present invention.
6 is a block diagram showing the structure of an LSTM, which is a model of a scene recognition recurrent neural network applied to a third neural network of a facial expression recognition system according to an embodiment of the present invention.

이하, 첨부의 도면을 참조하여 본 발명의 여러 실시형태에 따른 얼굴 표정 인식 시스템을 상세하게 설명한다.Hereinafter, a facial expression recognition system according to various embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명의 여러 실시형태에 따른 얼굴 표정 인식 시스템에 적용된 얼굴 표정 인식 기법은 비디오 신호를 2가지 측면으로 활용할 수 있다.The facial expression recognition technique applied to the facial expression recognition system according to various embodiments of the present invention may utilize a video signal as two aspects.

하나의 측면은 동영상 신호에 포함된 프레임(frame) 단위의 공간적 정보이다. 각 프레임은 공간적 신호만을 가지고 있기 때문에 특정 시점에서의 인물의 표정과 같은 외형(appearance)와 기하학(geometry) 정보가 프레임으로부터 추출될 수 있다. 이러한 정보들은 파악하고자 하는 세부 정보에 해당한다. One aspect is spatial information in units of frames included in a video signal. Since each frame only has a spatial signal, appearance and geometry information such as a person's expression at a specific point in time can be extracted from the frame. These pieces of information correspond to the details you want to grasp.

다른 하나의 측면은, 시공간 정보를 가지는 비디오 시퀀스(sequence) 자체이다. 시퀀스는 기본적으로 시간적 정보를 포함하기 때문에 표정의 변화 및 분위기(atmosphere)에 대한 정보를 제공한다. 따라서 전체 시퀀스는 파악하고자 하는 전반적인 맥락 즉, 전반적인 장면(scene)에 대한 정보에 해당한다. 본 발명의 여러 실시형태는 동영상 신호로부터 위 두가지 정보를 정확히 추출하고, 그들을 효과적으로 융합할 수 있는 네트워크를 제공하는 것이다.Another aspect is a video sequence itself having spatiotemporal information. Since the sequence basically includes temporal information, it provides information on changes in facial expressions and atmosphere. Therefore, the entire sequence corresponds to information on the overall context to be grasped, that is, the overall scene. Various embodiments of the present invention provide a network capable of accurately extracting the above two pieces of information from a moving picture signal and effectively fusing them.

도 3은 본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템의 블록 구성도이다.3 is a block diagram of a facial expression recognition system according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템은, 동영상 신호에 포함된 각 프레임으로부터 세부적인 지역 정보를 포함하는 외형 특징을 추출하는 제1 신경망(20)과, 동영상 신호에 포함된 비디오 시퀀스로부터 시공간 정보를 포함하는 장면 특징을 추출하는 제2 신경망(30) 및 제1 신경망(20)에서 추출된 외형 특징과 제2 신경망(30)에서 추출된 장면 특징을 융합하여 감정 분류(emotion classification)를 수행하는 제3 신경망(40)을 포함하여 구성될 수 있다.Referring to FIG. 3, a facial expression recognition system according to an embodiment of the present invention includes a first neural network 20 for extracting external features including detailed region information from each frame included in a video signal, and a video signal. The second neural network 30 extracting scene features including spatiotemporal information from the video sequence included in the second neural network 30 and the appearance features extracted from the first neural network 20 and the scene features extracted from the second neural network 30 It may be configured to include a third neural network 40 that performs emotion classification.

이에 더하여, 본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템은, 입력된 동영상 신호에서 얼굴을 인식하고 프레임을 크롭하여 정렬하거나 프레임을 보간하는 전처리부(10)를 더 포함할 수도 있다.In addition, the facial expression recognition system according to an embodiment of the present invention may further include a preprocessor 10 that recognizes a face from an input video signal, crops and aligns frames, or interpolates frames.

이하에서는, 본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템의 각 구성요소들에 대해 더욱 상세히 설명한다. 이하의 설명을 통해 각 구성요소의 특징과 구성요소 간의 유기적 결합을 통해 발현되는 얼굴 표정 인식 시스템의 작용효과가 더욱 명확하게 이해될 수 있을 것이다.Hereinafter, components of the facial expression recognition system according to an embodiment of the present invention will be described in more detail. Through the following description, the features of each component and the effect of the facial expression recognition system expressed through organic coupling between components may be more clearly understood.

전처리부(10)Pre-treatment unit (10)

전처리부(10)는 동영상 신호를 그 후단의 제1 및 제2 신경망(20, 30)에서 용이하게 사용할 수 있도록 처리하기 위한 요소이다.The preprocessor 10 is an element for processing the moving picture signal so that it can be easily used by the first and second neural networks 20 and 30 at the rear end thereof.

전처리부(10)는 동영상 신호에서 단순한 얼굴 인식과 정렬(alignment) 및/또는 프레임 보간(interpolation)을 수행할 수 있다. The preprocessor 10 may perform simple face recognition, alignment, and/or frame interpolation in the video signal.

예를 들어, 제1 및 제2 신경망(20, 30)의 트레이닝 및/또는 테스트에 사용될 수 있는 AFEW(Acted Facial Expressions in the Wild) 데이터 셋과 같은 와일드 데이터 셋은 낮은 조도와 어클루젼이 나타나는 많은 수의 비디오를 포함하므로, 전처리부(10)는 강인한 얼굴 검출을 위해 멀티 테스크 케스케이드 콘볼루션 네트워크(multitask cascaded convolutional networks)와 같이 콘볼루션을 통한 얼굴 인식을 수행하는 공지의 수단을 적용할 수 있다. 전처리부(10)는 멀티 테스크 케스케이드 콘볼루션 네트워크를 이용한 얼굴 인식 이후 인식된 영역을 크롭하여 후단의 제1 및 제2 신경망(20, 30)의 입력으로 설정할 수 있다.For example, a wild data set, such as an Acted Facial Expressions in the Wild (AFEW) data set that can be used for training and/or testing of the first and second neural networks 20, 30, exhibits low illumination and occlusion. Since a large number of videos are included, the preprocessor 10 may apply a known means for performing face recognition through convolution, such as multitask cascaded convolutional networks, for robust face detection. . The preprocessor 10 may crop a region recognized after face recognition using a multi-task cascade convolutional network and set it as inputs of the first and second neural networks 20 and 30 at the rear end.

또한, 전처리부(10)는 입력 동영상이 시간 특징을 도출할 수 없을 정도로 사전 설정된 기준 보다 짧은 경우, 분리 가능한 콘볼루션 기반 프레임 보간(separable convolution based frame interpolation) 기법 등과 같은 공지의 프레임 보간 기법을 적용하여 입력 동영상의 길이를 증가시킬 수도 있다.In addition, the preprocessor 10 applies a known frame interpolation technique, such as a separable convolution based frame interpolation technique, when the input video is shorter than a preset criterion such that the temporal feature cannot be derived. You can also increase the length of the input video.

제1 신경망(20)The first neural network (20)

최근, 불충분한 훈련 데이터에 대한 해결책으로서 딥 러닝(deep learning) 분야에서 전이 학습(transfer learning) 기법이 제안되고 있다. 전이 학습 기법이 적용되는 경우, 소스 도메인에서 학습된 모델이 타겟 도메인에서의 성능을 증가시킬 수 있는 것으로 알려져 있다. 본 발명의 일 실시형태에서, 제1 신경망(20)은 얼굴 표정의 학습된 지식을 활용하기 위한 전이 학습 툴로서 미세 튜닝된(fine tuned) 2차원 콘볼루션 신경망(2D Convolution Neural Network(CNN))으로 구현될 수 있다. 본 발명의 일 실시형태에서, 미세 튜닝된 2차원 CNN인 제1 신경망(20)에서 추출된 특징을 외형 특징(appearance feature)이라 하기로 한다.Recently, a transfer learning technique has been proposed in the field of deep learning as a solution to insufficient training data. When the transfer learning technique is applied, it is known that the model trained in the source domain can increase the performance in the target domain. In one embodiment of the present invention, the first neural network 20 is a fine tuned 2D convolutional neural network (CNN) as a transfer learning tool for utilizing learned knowledge of facial expressions. It can be implemented as In one embodiment of the present invention, a feature extracted from the first neural network 20, which is a fine-tuned two-dimensional CNN, will be referred to as an appearance feature.

특히, 본 발명의 일 실시형태는, 얼굴의 외형 특징을 효과적으로 추출할 수 있도록 미세 튜닝된 2차원 콘볼루션 신경망으로 DenseNet을 채용할 수 있다. DenseNet은 "G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, "Densely connected convolutional networks," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 2, pp. 3, 2017"에 소개된 신경망으로서, 위 논문에 의하면 DenseNet은 글로벌 에버리지 풀링(Grobal Average Pooling: GAP) 레이어를 통해 특징 맵(feature map)을 평균하고 즉시 분류(classification)를 진행하는 것으로 알려져 있다. 본 발명의 일 실시형태에서는 많은 정보를 갖는 특징 벡터가 네트워크로 입력되어야 하므로, DenseNet 내의 GAP 레이어가 두 개의 전체 연결 레이어(fully-connected layer: FC layer)로 대체될 수 있다.In particular, an embodiment of the present invention may employ DenseNet as a finely tuned two-dimensional convolutional neural network to effectively extract facial features. DenseNet is published in "G. Huang, Z. Liu, KQ Weinberger, and L. van der Maaten, "Densely connected convolutional networks," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 2, pp. As a neural network introduced in "3, 2017", according to the above paper, DenseNet is known to average feature maps through a global average pooling (GAP) layer and immediately perform classification. In an embodiment of the present invention, since a feature vector having a lot of information needs to be input to the network, the GAP layer in DenseNet can be replaced by two fully-connected layers (FC layers).

제1 신경망(20)으로 채용된 두 개의 전체 연결 레이어(fully-connected layer: FC layer)를 갖는 DenseNet의 구조가 도 4에 도시된다.A structure of a DenseNet having two fully-connected layers (FC layers) employed as the first neural network 20 is shown in FIG. 4.

본 발명의 일 실시형태에서, DenseNet의 미세 튜닝은 두 단계로 이루어질 수 있다. 먼저, 정통 DenseNet은 ImageNet(O. Russakovsky, J. Deng, H. Su, et. al., "ImageNet Large Scale Visual Recognition Challenge," International Journal of Computer Vision, vol. 114, no. 3, pp. 211-252, 2015.) 데이터셋을 사용하여 사전 트레이닝 된다. 사전 트레이닝 된 DenseNet은 코드 레벨에서 공개적으로 사용할 수 있다. 다음으로 DenseNet을 감정 데이터로 구성된 FER2013(I. J. Goodfellow, et. al., "Challenges in representation learning: A report on three machine learning contests," in International Conference on Neural Information Processing, pp. 117-124, 2013.)으로 재학습 시키는 방식으로 미세 튜닝이 이루어질 수 있다. 이러한 두 단계의 미세 튜닝을 통해 첫번째 단계의 지식(knowledge)을 두번째 단계의 감정으로 전이할 수 있다. 결과적으로, 미세 튜닝된 DenseNet은 추론 단계(inference stage)에서 AFEW 데이터 세트의 모든 테스트 비디오에 대해 작동 할 수 있게 되며, 4096 차원의 외형 특징을 추출할 수 있게 된다.In one embodiment of the present invention, fine tuning of DenseNet can be done in two steps. First, the authentic DenseNet is ImageNet (O. Russakovsky, J. Deng, H. Su, et. al., "ImageNet Large Scale Visual Recognition Challenge," International Journal of Computer Vision, vol. 114, no. 3, pp. 211 -252, 2015.) Pre-trained using the dataset. The pretrained DenseNet is publicly available at the code level. Next, DenseNet is FER2013 (IJ Goodfellow, et. al., "Challenges in representation learning: A report on three machine learning contests," in International Conference on Neural Information Processing, pp. 117-124, 2013.) Fine tuning can be achieved by re-learning. Through these two stages of fine tuning, the knowledge of the first stage can be transferred to the emotion of the second stage. As a result, the fine-tuned DenseNet will be able to work on all test videos in the AFEW data set at the inference stage and extract 4096 dimensions of cosmetic features.

제2 신경망(30)The second neural network (30)

도 5는 본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템의 제2 신경망의 구조를 도시한 블록 구성도이다.5 is a block diagram showing the structure of a second neural network of a facial expression recognition system according to an embodiment of the present invention.

본 발명의 일 실시형태에서, 제2 신경망(30)으로서 도 5에 도시된 것과 같은 3차원 콘볼루션 신경망이 사용될 수 있다. 3차원 콘볼루션 신경망은 입력된 동영상 신호의 전체 시퀀스를 입력으로 하기 때문에 입력된 동영상 시퀀스의 전체 시각적 장면(scene)을 포착하는 역할을 할 수 있다. 본 발명의 일 실시형태에서, 3차원 콘볼루션 신경망에서 획득되는 특징을 일종의 장면으로 사용할 수 있고 이러한 특징을 장면 특징(scene feature)라 할 수 있다.In an embodiment of the present invention, a 3D convolutional neural network as shown in FIG. 5 may be used as the second neural network 30. Since the 3D convolutional neural network takes the entire sequence of the input video signal as an input, it can play a role of capturing the entire visual scene of the input video sequence. In one embodiment of the present invention, a feature acquired from a 3D convolutional neural network may be used as a kind of scene, and such a feature may be referred to as a scene feature.

본 발명의 일 실시형태에서, 3차원 콘볼루션 신경망으로 공지의 신경망인 C3D(D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in Proceedings of IEEE International Conference on Computer Vision, pp. 4489-4497, 2015.), ResNet3D 및 ResNeXt3D(K. Hara, H. Kataoka, and Y. Satoh, "Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 18-22, 2018.) 등이 채용될 수 있다. 제1 신경망(20)과 마찬가지로 제2 신경망(30)으로 선택된 3차원 콘볼루션 신경망에도 전이 학습이 적용될 수 있다. 일반적으로 C3D는 Sports-1M 데이터 셋(A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei- Fei, "Large-scale video classification with convolutional neural networks," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725-1732, 2014.)를 사용하여 사전 훈련될 수 있으며, ResNet3D 및 ResNeXt3D는 Kinetic 데이터 셋(W. Kay, J. Carreira, K. Simonyan, et. al., "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950, 2017.)를 사용하여 사전 훈련될 수 있다. 제1 신경망(20)과 마찬가지로 제2 신경망(30)도 사전 트레이닝 된 모델을 사용하지만, 사전 트레이닝 시 감정 관련 데이터가 아닌 행동 인식 관련 데이터를 사용한다. 이는 더욱 동적인 비디오가 더 풍부한 장면 특징을 제공할 수 있기 때문이며, 얼굴 표정의 변화와 같은 모션 정보를 포착하기 위함이다. In one embodiment of the present invention, C3D (D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," which is a neural network known as a 3D convolutional neural network. "in Proceedings of IEEE International Conference on Computer Vision, pp. 4489-4497, 2015.), ResNet3D and ResNeXt3D (K. Hara, H. Kataoka, and Y. Satoh, "Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 18-22, 2018.), etc. may be employed. Like the first neural network 20, transfer learning may be applied to a 3D convolutional neural network selected as the second neural network 30. In general, C3D is a Sports-1M dataset (A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei- Fei, "Large-scale video classification with convolutional neural networks," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725-1732, 2014.), and ResNet3D and ResNeXt3D are Kinetic datasets (W. Kay, J. Carreira, K. Simonyan, et al.) al., "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950, 2017.). Like the first neural network 20, the second neural network 30 also uses a pre-trained model, but uses behavioral recognition-related data rather than emotion-related data during pre-training. This is because more dynamic videos can provide richer scene features, and to capture motion information such as changes in facial expressions.

본 발명의 일 실시형태에서, 삼차원 콘볼루션 신경망의 콘볼루션 블록(convolution block)이 모션 정보를 이미 충분히 잘 추출한다고 가정하여, 미세 튜닝 과정에서는 콘볼루션 블록은 모두 학습되지 않도록 파라미터들을 고정하고 오직 전체 연결 레이어(FC layer)들과 보조 분류기만 학습되도록 할 수 있다.In one embodiment of the present invention, assuming that a convolution block of a three-dimensional convolutional neural network already sufficiently extracts motion information, in the fine tuning process, parameters are fixed so that not all of the convolutional blocks are learned, Only FC layers and auxiliary classifiers can be learned.

한편, 본 발명의 일 실시형태는, 제2 신경망(30), 즉 3차원 콘볼루션 신경망의 최종 레이어에 보조 분류기(31)를 추가할 수 있다. 보조 분류기(31)는 vanishing gradient 현상을 완화할 뿐만 아니라 학습 과정에서 정규화 효과를 주기 때문에 안정적인 학습 및 성능 향상 효과가 있다. 도 5에 도시한 바와 같이, 보조 분류기(31)는 FC 레이어, 배치 정규화(BN: Batch Normalization), 드롭 아웃 및 ReLU를 포함하는 구조를 가질 수 있으며, 최종적으로 softmax 기능이 위치할 수 있다.Meanwhile, according to an embodiment of the present invention, the auxiliary classifier 31 may be added to the final layer of the second neural network 30, that is, a 3D convolutional neural network. The auxiliary classifier 31 not only mitigates the vanishing gradient phenomenon, but also provides a normalization effect in the learning process, thereby providing stable learning and performance improvement effects. As shown in FIG. 5, the auxiliary classifier 31 may have a structure including an FC layer, batch normalization (BN), dropout, and ReLU, and finally a softmax function may be located.

제3 신경망(40)Third Neural Network (40)

본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템은, 전체 시퀀스 기반의 장면 특징(scene feature)와 각 동영상 프레임의 세부 지역적 특징인 외형 특징(appearance feature)를 특징(feature) 레벨에서 융합할 수 있는 시간적 네트워크인 장면 인식(scene-aware) 리커런트 신경망(RNN: Recurrent Neural Network)을 제3 신경망(40)으로서 적용한다. 3차원 콘볼루션 신경망의 장면 특징은 개별 프레임에 존재하지 않는 시간 정보가 있기 때문에 컨텍스트로 사용할 수 있다. 이러한 장면 인식 리커런트 신경망을 적용하여 두 특징을 융합하는 방식은 얼리(early) 융합에 해당하며, 각 특징의 고유한 정보를 보존하면서 융합할 수 있다.The facial expression recognition system according to an embodiment of the present invention is capable of fusing an entire sequence-based scene feature and an appearance feature that is a detailed regional feature of each video frame at a feature level. A scene-aware recurrent neural network (RNN), which is a temporal network, is applied as the third neural network 40. The scene feature of the 3D convolutional neural network can be used as a context because there is time information that does not exist in individual frames. The method of fusing two features by applying such a scene recognition recurrent neural network corresponds to early fusion, and can be fused while preserving unique information of each feature.

본 발명의 일 실시형태에서, 장면 인식 리커런트 신경망의 모델로서 LSTM(Long Short-Term Memory)( S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.)가 사용될 수 있다.In one embodiment of the present invention, as a model of a scene recognition recurrent neural network, LSTM (Long Short-Term Memory) (S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.) can be used.

도 6은 본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템의 제3 신경망에 적용된 장면 인식 리커런트 신경망의 모델인 LSTM의 구조를 도시한 블록 구성도이다.6 is a block diagram showing the structure of an LSTM, which is a model of a scene recognition recurrent neural network applied to a third neural network of a facial expression recognition system according to an embodiment of the present invention.

시퀀스 데이터 처리에 용이한 LSTM은 현재 및 이전 정보 사이의 관계를 적절히 조절하면서 학습할 수 있다. 현재 입력을 x_t라 하고, 히든 스테이트(hidden state)를 h_t라 정의할 때, 전형적인 LSTM에서는 학습 초기에 히든 스테이트를 0으로 초기화 한다(즉, h₀=0). 그러나 본 발명의 일 실시형태에 따른 장면 인식 리커런트 신경망에 적용된 LSTM 모델에서는 간단하게 제2 신경망(30)인 3차원 콘볼루션 신경망에서 추출한 장면 특징을 히든 스테이트로 활용한다. 이런 연결의 장점은, 동영상의 개별 프레임이 파악할 수 없는 시간 정보가, 장면 특징(scene feature)이 내포하는 전체적인 시각적인 장면으로 시간적으로 이전 정보를 사용함으로써 파악될 수 있다는 점이다. 또한, 도 6과 같이 LSTM 수식 및 구조의 변화 없이 간단하게 구현 가능하다. 장면 특징 v는 다음과 같이 요약 될 수 있다LSTM, which is easy to process sequence data, can be learned while appropriately adjusting the relationship between current and previous information. When the current input is defined as x _t and the hidden state is defined as h _t , in a typical LSTM, the hidden state is initialized to 0 at the beginning of learning (ie, h ₀ = 0). However, in the LSTM model applied to the scene recognition recurrent neural network according to an embodiment of the present invention, the scene feature extracted from the 3D convolutional neural network, which is the second neural network 30, is simply used as a hidden state. The advantage of this connection is that temporal information that cannot be grasped by individual frames of a video can be grasped by using previous information temporally as an overall visual scene contained in a scene feature. In addition, it can be simply implemented without changing the LSTM formula and structure as shown in FIG. 6. Scene feature v can be summarized as follows

[식 1][Equation 1]

[식 2][Equation 2]

여기서 V는 동영상 시퀀스 입력을 나타내고, F_θ(·)는 제2 신경망, 즉 3차원 콘볼루션 신경망을 나타내는 함수이며, θ는 삼차원 콘볼루션 신경망의 모든 학습 파라미터를 나타낸다.Here, V represents the video sequence input, F _θ (·) is a function representing the second neural network, that is, a 3D convolutional neural network, and θ represents all learning parameters of the 3D convolutional neural network.

이상에서 설명한 것과 같이 구성되고 작동하는 본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템은, 다음과 같은 전체 학습과정을 가질 수 있다.The facial expression recognition system according to an embodiment of the present invention configured and operated as described above may have the following entire learning process.

본 발명의 일 실시형태에 따른 얼굴 표정 인식 시스템에서 이루어지는 학습은 두 단계로 이루어질 수 있다.Learning performed in the facial expression recognition system according to an embodiment of the present invention may be performed in two steps.

첫 단계에는, 제1 신경망(20)인 FC 레이어를 갖는 DenseNet 에 대한 학습이 이루어지고 학습의 결과로서 입력 동영상의 프레임에 대한 외형 특징(appearance feature)이 추출될 수 있다.In the first step, learning about the DenseNet having the FC layer, which is the first neural network 20, is performed, and as a result of the learning, an appearance feature for a frame of an input video may be extracted.

두번째 단계는, 제2 신경망(30)인 보조 분류기(31)를 갖는 삼차원 콘볼루션 신경망과 제3 신경망(40)인 장면 인식 리커런트 신경망(SA-RNN)의 학습이 이루어질 수 있다. 전술한 바와 같이 사전 학습된(pre-trained) 삼차원 콘볼루션 신경망의 FC 레이어와 보조 분류기(31)를 제외한 모든 레이어의 파라미터들은 고정하여 콘볼루션 레이어의 기능을 유지하게 하므로, 실제로 학습되는 파라미터는 삼차원 콘볼루션 신경망의 FC 레이어 및 보조 분류기(31)와 장면 인식 리커런트 신경망(SA-RNN)의 파라미터들이다. 본 발명의 일 실시형태에서, 보조 분류기를 포함한 전체 구조의 손실 함수로서 크로스-엔트로피(cross-entropy) 함수가 사용될 수 있으며 크로스-엔트로피 함수를 사용하는 경우 최종적인 손실 함수 L_TOTAL은 다음과 같다.In the second step, the 3D convolutional neural network having the auxiliary classifier 31 as the second neural network 30 and the scene recognition recurrent neural network SA-RNN as the third neural network 40 may be trained. As described above, the parameters of all layers except the FC layer and the auxiliary classifier 31 of the pre-trained 3D convolutional neural network are fixed to maintain the function of the convolutional layer, so the parameters actually learned are 3D These are the parameters of the FC layer of the convolutional neural network and the auxiliary classifier 31 and the scene recognition recurrent neural network (SA-RNN). In an embodiment of the present invention, a cross-entropy function may be used as a loss function of the entire structure including an auxiliary classifier, and when the cross-entropy function is used, the final loss function L _TOTAL is as follows.

[식 3][Equation 3]

여기서, L은 전체 네트워크의 손실 함수이고, L_aux는 보조 분류기의 손실함수를 나타내며, λ는 L_aux의 반영비율을 결정하는 하이퍼 파라미터(hyper-parameter)이다. 최종적인 손실 함수(목적함수)인 식 3을 최소화하는 방식으로 학습이 수행될 수 있다.Here, L is the loss function of the entire network, L _aux is the loss function of the auxiliary classifier, and λ is a hyper-parameter that determines the reflection ratio of L _aux . Learning can be performed in a manner that minimizes Equation 3, which is the final loss function (objective function).

본 발명의 여러 실시형태에 따른 얼굴 표정 인식 시스템은, 동영상의 전반적인 장면(scene)에 대한 시간적 변화 정보를 효과적으로 추출하여 낮은 조도 및 포즈 변화 등에 대해서도 강인하게 얼굴 표정을 인식할 수 있다. 이로 인해 인위적인 환경에서 취득된 특정 영상 뿐만 아니라 실제 환경에서 취득된 영상에 대해서도 강인한 얼굴 표정 인식이 이루어질 수 있다.The facial expression recognition system according to various embodiments of the present invention may effectively extract temporal change information for an overall scene of a moving picture, and thus strongly recognize facial expressions even with low illuminance and pose changes. As a result, strong facial expression recognition can be achieved not only for specific images acquired in an artificial environment, but also for images acquired in an actual environment.

또한, 본 발명의 여러 실시형태에 따른 얼굴 표정 인식 시스템은, 얼리(early) 융합을 통해 각 신호의 고유한 특징을 보존하면서 시너지를 발휘하도록 융합할 수 있다. In addition, the facial expression recognition system according to various embodiments of the present invention can be fused so as to exhibit synergy while preserving the unique characteristics of each signal through early fusion.

이상에서 본 발명의 특정한 실시형태에 관련하여 도시하고 설명하였지만, 청구범위의 한도 내에서, 본 발명이 다양하게 개량 및 변화될 수 있다는 것은 당 기술분야에서 통상의 지식을 가진 자에게 있어서 자명할 것이다.Although shown and described in connection with specific embodiments of the present invention above, it will be apparent to those of ordinary skill in the art that the present invention can be variously improved and changed within the scope of the claims. .

10: 전처리부
20: 제1 신경망(전체 연결 레이어를 갖는 DenseNet)
30: 제2 신경망(보조 분류기를 갖는 3차원 콘볼루션 신경망)\
31: 보조 분류기
40: 제3 신경망(장면 인식 리커런트 신경망)10: pretreatment unit
20: first neural network (DenseNet with full connection layer)
30: second neural network (three-dimensional convolutional neural network with auxiliary classifier)\
31: secondary classifier
40: third neural network (scene recognition recurrent neural network)

Claims

A first neural network for extracting external features including region information from each frame included in the input video;
A second neural network for extracting scene features including spatiotemporal information from the sequence of the video; And
A facial expression recognition system including a third neural network that performs emotion classification by fusing the external features and the scene features.

The method according to claim 1,
Recognizing a face in the video, cropping and aligning the recognized area, and applying a frame interpolation technique when the video is shorter than a preset criterion to increase the length of the video facial expression recognition, characterized in that system.

The method according to claim 1,
The first neural network is a fine-tuned two-dimensional convolutional neural network based on DenseNet.

The method of claim 3,
The facial expression recognition system, wherein the first neural network is a DenseNet having a fully-connected layer.

The method according to claim 3 or 4,
The DenseNet is pre-trained using the ImageNet data set, and then retrained with the FER2013 data set composed of emotion data to fine tune.

The method according to claim 1,
The second neural network is a facial expression recognition system, characterized in that the 3D convolutional neural network.

The method of claim 6,
The 3D convolutional neural network is a facial expression recognition system, characterized in that pre-trained with a Sports-1M data set or a kinetic data set.

The method of claim 6,
The 3D convolutional neural network, a facial expression recognition system comprising an auxiliary classifier including a fully-connected layer, a batch normalization, a dropout, and a ReLU.

The method of claim 8,
In the 3D convolutional neural network, parameters are fixed so that the convolutional block included therein is not learned, and only all connection layers included therein and the auxiliary classifier are learned.

The method according to claim 1,
The first neural network is pre-trained using emotion-related data, and the second neural network is pre-trained using behavior-recognition-related data.

The method according to claim 1,
And the third neural network is a scene-aware recurrent neural network, which is a temporal network in which the scene feature and the external feature are fused at a feature level.

The method of claim 11,
The scene recognition recurrent neural network is a facial expression recognition system, characterized in that the LSTM (Long Short-Term Memory) model is applied.

The method of claim 12,
The LSTM (Long Short-Term Memory) model uses the scene feature extracted from the second neural network as a hidden state.