KR20230088616A

KR20230088616A - Vision transformer-based facial expression recognition apparatus and method

Info

Publication number: KR20230088616A
Application number: KR1020210177151A
Authority: KR
Inventors: 고병철; 안다솜
Original assignee: 계명대학교 산학협력단
Priority date: 2021-12-11
Filing date: 2021-12-11
Publication date: 2023-06-20

Abstract

본 발명은 비전 트랜스포머 기반의 얼굴 표정 인식장치에 관한 것으로서, 보다 구체적으로는 얼굴 표정 인식장치로서, CNN과 비전 트랜스포머(Vision Transformer; ViT)를 결합해 얼굴 표정 인식 모델을 구성하는 학습부; 및 상기 얼굴 표정 인식 모델을 사용해 입력 영상에서 표정을 인식하는 인식부를 포함하며, 상기 학습부는, CNN을 사용해 얼굴 영상에서 특징 맵을 추출하는 특징 추출부; 상기 특징 맵을 패치로 나누어 임베딩 패치를 구성하는 패치 구성부; 및 상기 임베딩 패치를 비전 트랜스포머에 입력하여 얼굴 표정을 분류하는 표정 분류부를 포함하여 구성한 모델을, 얼굴 표정에 대한 학습 데이터셋을 사용해 학습하여 상기 얼굴 표정 인식 모델을 구성하는 것을 그 구성상의 특징으로 한다.
또한, 본 발명은 비전 트랜스포머 기반의 얼굴 표정 인식 방법에 관한 것으로서, 보다 구체적으로는 컴퓨터에 의해 각 단계가 수행되는 얼굴 표정 인식 방법으로서, (1) CNN과 비전 트랜스포머(Vision Transformer; ViT)를 결합해 얼굴 표정 인식 모델을 구성하는 단계; 및 (2) 상기 얼굴 표정 인식 모델을 사용해 입력 영상에서 표정을 인식하는 단계를 포함하며, 상기 단계 (1)에서는, CNN을 사용해 얼굴 영상에서 특징 맵을 추출하는 특징 추출부; 상기 특징 맵을 패치로 나누어 임베딩 패치를 구성하는 패치 구성부; 및 상기 임베딩 패치를 비전 트랜스포머에 입력하여 얼굴 표정을 분류하는 표정 분류부를 포함하여 구성한 모델을, 얼굴 표정에 대한 학습 데이터셋을 사용해 학습하여 상기 얼굴 표정 인식 모델을 구성하는 것을 그 구성상의 특징으로 한다.
본 발명에서 제안하고 있는 비전 트랜스포머 기반의 얼굴 표정 인식장치 및 방법에 따르면, CNN을 이용해 정밀한 얼굴 표정 변화를 특징 맵으로 추출하고, 추출한 특징 맵을 비전 트랜스포머에 입력해 얼굴 표정을 인식함으로써, 비전 트랜스포머를 얼굴 표정 인식에 성공적으로 적용하고 섬세한 얼굴 표정의 변화를 감지하여 얼굴 표정 인식 성능을 향상시킬 수 있다.The present invention relates to a facial expression recognition device based on a vision transformer, and more specifically, to a facial expression recognition device, comprising: a learning unit constituting a facial expression recognition model by combining a CNN and a Vision Transformer (ViT); and a recognizing unit recognizing a facial expression in an input image using the facial expression recognition model, wherein the learning unit includes: a feature extracting unit extracting a feature map from a facial image using a CNN; a patch construction unit configured to configure an embedding patch by dividing the feature map into patches; and a facial expression classification unit for classifying facial expressions by inputting the embedding patch to a vision transformer, and learning the configured model using a facial expression learning dataset to construct the facial expression recognition model. .
In addition, the present invention relates to a facial expression recognition method based on a vision transformer, and more specifically, to a facial expression recognition method in which each step is performed by a computer, (1) combining a CNN and a vision transformer (ViT). constructing a facial expression recognition model; and (2) recognizing a facial expression from the input image using the facial expression recognition model, wherein in the step (1), a feature extractor extracting a feature map from the facial image using CNN; a patch construction unit configured to configure an embedding patch by dividing the feature map into patches; and a facial expression classification unit for classifying facial expressions by inputting the embedding patch to a vision transformer, and learning the configured model using a facial expression learning dataset to construct the facial expression recognition model. .
According to the vision transformer-based facial expression recognition apparatus and method proposed in the present invention, precise facial expression changes are extracted as a feature map using CNN, and the extracted feature map is input to the vision transformer to recognize the facial expression. can be successfully applied to facial expression recognition and improve facial expression recognition performance by detecting subtle facial expression changes.

Description

Facial expression recognition device and method based on vision transformer {VISION TRANSFORMER-BASED FACIAL EXPRESSION RECOGNITION APPARATUS AND METHOD}

본 발명은 얼굴 표정 인식장치 및 방법에 관한 것으로서, 보다 구체적으로는 비전 트랜스포머 기반의 얼굴 표정 인식장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for recognizing facial expressions, and more particularly, to an apparatus and method for recognizing facial expressions based on a vision transformer.

인간은 언어, 얼굴 표정, 말과 같은 다양한 방법으로 감정적 신호를 표현한다. 그 중에서도 얼굴 표정은 감정 표현에서 가장 중요한 부분이므로, 얼굴 표정을 정확하게 인식할 수 있다면 그 결과를 다양한 분야에 활용할 수 있다. 따라서 얼굴 표정 인식(Facial Expression Recognition; FER)은, 컴퓨터 비전, 차량 운행자 모니터링, 헬스케어, 게임, AR, VR 등 다양한 분야에서 주목을 받고 있다.Humans express emotional signals in a variety of ways, such as language, facial expressions, and speech. Among them, facial expressions are the most important part in emotional expression, so if facial expressions can be accurately recognized, the results can be used in various fields. Therefore, facial expression recognition (FER) is attracting attention in various fields such as computer vision, vehicle operator monitoring, healthcare, games, AR, and VR.

도 1은 얼굴 표정 인식을 설명하는 도면이다. 도 1에 도시된 바와 같이, 머신러닝 기법을 FER 연구에 적용하여, 이미지로부터 얼굴 표정을 인식할 수 있는 연구가 활발히 진행되고 있다.1 is a diagram illustrating facial expression recognition. As shown in FIG. 1 , research on recognizing facial expressions from images by applying machine learning techniques to FER studies is being actively conducted.

인간은 사진을 볼 때 집중하는 부분이 있다. 인물이 있다면 인물에 집중하고, 사물이 있다면 사물에 집중한다. 그러나 CNN 등 기존의 머신러닝 기법에서는 전체 이미지를 하나의 특징 맵(Feature map)으로 만들어 사용하기 때문에 집중하는 부분만 따로 고를 수 없고, 그래서 연관 없는 결과가 출력될 수도 있다. 이러한 단점을 해결하기 위한 방법이 바로 어텐션(Attention) 메커니즘이다.There is a part that humans focus on when looking at a photograph. If there is a character, focus on the character; if there is an object, focus on the object. However, in existing machine learning techniques such as CNN, since the entire image is created and used as a single feature map, it is not possible to select only the part to focus on, so irrelevant results may be output. A method to solve these disadvantages is the attention mechanism.

도 2는 어텐션 메커니즘을 설명하기 위해 도시한 도면이다. 어텐션 메커니즘에 따르면, 인간이 사진을 볼 때 집중하는 부분이 있는 것처럼 집중하는 부분을 특정하여 예측 성능을 높일 수 있다.2 is a diagram illustrating an attention mechanism. According to the attention mechanism, predictive performance can be improved by specifying a part to focus on, just as there is a part that humans focus on when looking at a picture.

이전 영상 처리에서 Self-Attention 적용을 시도하였으나 하드웨어적 측면에서 효율적이지 못했다. 하지만 비전 트랜스포머(Vision Transformer; ViT)에서 구현된 Self-Attention은 성공적인 영상 인식 성능을 보였기 때문에 최근에는 CNN 기반의 영상 인식보다는 ViT 기반의 영상 인식 연구가 활발히 이루어지고 있다. 또한 ViT는 계산 효율성과 확장성이 좋아 메모리를 효율적으로 사용하며 속도 또한 개선할 수 있다.Previous image processing tried to apply Self-Attention, but it was not efficient in terms of hardware. However, since Self-Attention implemented in Vision Transformer (ViT) showed successful image recognition performance, recently, research on ViT-based image recognition rather than CNN-based image recognition is being actively conducted. In addition, ViT has good computational efficiency and scalability, so it can use memory efficiently and improve speed.

그러나 ViT는 영상의 다른 부분 간의 의존성을 모델링하여 전체적인 관계를 확인하는 데 적합하기 때문에 글로벌한 영상 분류에 강점을 보인다. 따라서 얼굴 표정과 같은 섬세한 영상의 분류에는 적합하지 않으므로, ViT를 사용해 얼굴 표정을 효율적으로 인식할 수 있는 방법의 개발이 필요하다.However, ViT shows strength in global image classification because it is suitable for confirming the overall relationship by modeling the dependence between different parts of the image. Therefore, since it is not suitable for classification of delicate images such as facial expressions, it is necessary to develop a method that can efficiently recognize facial expressions using ViT.

한편, 본 발명과 관련하여, 등록특허 제10-2188970호(발명의 명칭: 경량 다층 랜덤 포레스트 기반의 얼굴 표정 인식 방법 및 장치, 등록일자: 2020년 12월 03일) 등이 개시된 바 있다.Meanwhile, in relation to the present invention, Patent Registration No. 10-2188970 (Title of Invention: Facial Expression Recognition Method and Apparatus Based on Lightweight Multilayer Random Forest, Registration Date: December 03, 2020) has been disclosed.

본 발명은 기존에 제안된 방법들의 상기와 같은 문제점들을 해결하기 위해 제안된 것으로서, CNN을 이용해 정밀한 얼굴 표정 변화를 특징 맵으로 추출하고, 추출한 특징 맵을 비전 트랜스포머에 입력해 얼굴 표정을 인식함으로써, 비전 트랜스포머를 얼굴 표정 인식에 성공적으로 적용하고 섬세한 얼굴 표정의 변화를 감지하여 얼굴 표정 인식 성능을 향상시킬 수 있는, 비전 트랜스포머 기반의 얼굴 표정 인식장치 및 방법을 제공하는 것을 그 목적으로 한다.The present invention has been proposed to solve the above problems of the previously proposed methods, by extracting precise facial expression changes as a feature map using CNN and inputting the extracted feature map to a vision transformer to recognize facial expressions, An object of the present invention is to provide a vision transformer-based facial expression recognition device and method capable of successfully applying the vision transformer to facial expression recognition and detecting subtle changes in facial expressions to improve facial expression recognition performance.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치는,A vision transformer-based facial expression recognition device according to the features of the present invention for achieving the above object,

얼굴 표정 인식장치로서,As a facial expression recognition device,

CNN과 비전 트랜스포머(Vision Transformer; ViT)를 결합해 얼굴 표정 인식 모델을 구성하는 학습부; 및a learning unit constituting a facial expression recognition model by combining a CNN and a Vision Transformer (ViT); and

상기 얼굴 표정 인식 모델을 사용해 입력 영상에서 표정을 인식하는 인식부를 포함하며,A recognition unit recognizing a facial expression in an input image using the facial expression recognition model,

상기 학습부는,The learning unit,

CNN을 사용해 얼굴 영상에서 특징 맵을 추출하는 특징 추출부;a feature extraction unit that extracts a feature map from a face image using a CNN;

상기 특징 맵을 패치로 나누어 임베딩 패치를 구성하는 패치 구성부; 및a patch construction unit configured to configure an embedding patch by dividing the feature map into patches; and

상기 임베딩 패치를 비전 트랜스포머에 입력하여 얼굴 표정을 분류하는 표정 분류부를 포함하여 구성한 모델을, 얼굴 표정에 대한 학습 데이터셋을 사용해 학습하여 상기 얼굴 표정 인식 모델을 구성하는 것을 그 구성상의 특징으로 한다.The facial expression recognition model is configured by learning a model including a facial expression classification unit that classifies facial expressions by inputting the embedding patch to a vision transformer using a learning dataset for facial expressions.

바람직하게는, 상기 특징 추출부는,Preferably, the feature extraction unit,

CNN을 기반으로 하는 ResNet50을 사용해 상기 얼굴 영상으로부터 얼굴 표정 변화를 나타내는 상기 특징 맵을 추출할 수 있다.The feature map representing the change in facial expression may be extracted from the face image using ResNet50 based on CNN.

더욱 바람직하게는, 상기 학습부는,More preferably, the learning unit,

ImageNet-1k로 사전학습된 ResNet50과 사전 학습된 ViT-B/16를 ImageNet-21k로 추가 사전학습한 비전 트랜스포머를 결합해 상기 얼굴 표정 인식 모델을 구성할 수 있다.The facial expression recognition model can be constructed by combining ResNet50 pre-trained with ImageNet-1k and ViT-B/16 pre-trained with ImageNet-21k and additionally pre-trained vision transformers.

더더욱 바람직하게는, 상기 학습부는,Even more preferably, the learning unit,

손실함수로 교차-엔트로피(Cross-Entropy)를 사용하여 학습을 수행할 수 있다.Learning can be performed using cross-entropy as a loss function.

바람직하게는, 상기 비전 트랜스포머는,Preferably, the vision transformer,

멀티-헤드 어텐션(Multi-Head Attention) 및 다층 퍼셉트론(Multi-Layer Perceptron; MLP)을 포함하여 구성되는 ViT 인코더로 구성되며, 상기 임베딩 패치 사이의 상관관계에 대해 계산할 수 있다.It is composed of a ViT encoder including Multi-Head Attention and a Multi-Layer Perceptron (MLP), and can calculate correlations between the embedding patches.

더욱 바람직하게는, 상기 ViT 인코더는,More preferably, the ViT encoder,

상기 멀티-헤드 어텐션 및 다층 퍼셉트론 앞에 Layer-Norm이 각각 적용될 수 있다.A Layer-Norm may be applied in front of the multi-head attention and multi-layer perceptron, respectively.

더욱 바람직하게는, 상기 표정 분류부는,More preferably, the facial expression classification unit,

상기 ViT 인코더의 출력을 MLP와 소프트맥스(Softmax) 함수에 적용해 얼굴 표정을 분류할 수 있다.Facial expressions can be classified by applying the output of the ViT encoder to MLP and Softmax functions.

바람직하게는, 상기 표정 분류부는,Preferably, the facial expression classification unit,

화남(anger), 경멸(contempt), 혐오(disgust), 두려움(fear), 행복(happy), 슬픔(sadness) 및 놀람(surprise)을 포함하는 7개의 분류로 얼굴 표정을 분류할 수 있다.Facial expressions can be classified into seven categories including anger, contempt, disgust, fear, happy, sadness and surprise.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 비전 트랜스포머 기반의 얼굴 표정 인식 방법은,A vision transformer-based facial expression recognition method according to the features of the present invention for achieving the above object is,

컴퓨터에 의해 각 단계가 수행되는 얼굴 표정 인식 방법으로서,A facial expression recognition method in which each step is performed by a computer,

(1) CNN과 비전 트랜스포머(Vision Transformer; ViT)를 결합해 얼굴 표정 인식 모델을 구성하는 단계; 및(1) Constructing a facial expression recognition model by combining a CNN and a Vision Transformer (ViT); and

(2) 상기 얼굴 표정 인식 모델을 사용해 입력 영상에서 표정을 인식하는 단계를 포함하며,(2) recognizing a facial expression in an input image using the facial expression recognition model;

상기 단계 (1)에서는,In the step (1),

더욱 바람직하게는, 상기 단계 (1)에서는,More preferably, in the step (1),

ImageNet-1k로 사전학습된 ResNet50과 사전 학습된 ViT-B/16를 ImageNet-21k로 추가 사전학습한 비전 트랜스포머를 사용해 상기 얼굴 표정 인식 모델을 구성할 수 있다.ResNet50 pre-trained with ImageNet-1k and ViT-B/16 pre-trained with ImageNet-21k can be used to construct the facial expression recognition model using vision transformers additionally pre-trained with ImageNet-21k.

더더욱 바람직하게는, 상기 단계 (1)에서는,Even more preferably, in the step (1),

본 발명에서 제안하고 있는 비전 트랜스포머 기반의 얼굴 표정 인식장치 및 방법에 따르면, CNN을 이용해 정밀한 얼굴 표정 변화를 특징 맵으로 추출하고, 추출한 특징 맵을 비전 트랜스포머에 입력해 얼굴 표정을 인식함으로써, 비전 트랜스포머를 얼굴 표정 인식에 성공적으로 적용하고 섬세한 얼굴 표정의 변화를 감지하여 얼굴 표정 인식 성능을 향상시킬 수 있다.According to the vision transformer-based facial expression recognition apparatus and method proposed in the present invention, precise facial expression changes are extracted as a feature map using CNN, and the extracted feature map is input to the vision transformer to recognize the facial expression. can be successfully applied to facial expression recognition and improve facial expression recognition performance by detecting subtle facial expression changes.

도 1은 얼굴 표정 인식을 설명하는 도면.
도 2는 어텐션 메커니즘을 설명하기 위해 도시한 도면.
도 3은 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치의 전체 구성을 도시한 도면.
도 4는 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치의 얼굴 표정 인식 모델의 구성을 도시한 도면.
도 5는 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치의 얼굴 표정 인식 모델의 전체 구조를 도시한 도면.
도 6은 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치에서 사용하는 ViT 인코더의 구성을 도시한 도면.
도 7은 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치에서 사용하는 MLP의 구성을 도시한 도면.
도 8은 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식 방법의 흐름을 도시한 도면.
도 9는 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식 방법을 사용해 입력 영상에서 표정을 인식하는 과정의 흐름을 도시한 도면.
도 10은 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치 및 방법과 최신 FER 방법을 비교해서 나타낸 도면.1 is a diagram illustrating facial expression recognition;
2 is a diagram illustrating an attention mechanism;
3 is a diagram showing the overall configuration of a facial expression recognition device based on a vision transformer according to an embodiment of the present invention.
4 is a diagram showing the configuration of a facial expression recognition model of a facial expression recognition device based on a vision transformer according to an embodiment of the present invention.
5 is a view showing the overall structure of a facial expression recognition model of a facial expression recognition device based on a vision transformer according to an embodiment of the present invention.
6 is a diagram showing the configuration of a ViT encoder used in a facial expression recognition device based on a vision transformer according to an embodiment of the present invention.
7 is a diagram showing the configuration of an MLP used in a facial expression recognition device based on a vision transformer according to an embodiment of the present invention.
8 is a flow diagram illustrating a facial expression recognition method based on a vision transformer according to an embodiment of the present invention.
9 is a flowchart illustrating a process of recognizing a facial expression in an input image using a facial expression recognition method based on a vision transformer according to an embodiment of the present invention.
10 is a view showing a comparison between a vision transformer-based facial expression recognition apparatus and method according to an embodiment of the present invention and a state-of-the-art FER method.

이하, 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예를 상세하게 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일한 부호를 사용한다.Hereinafter, preferred embodiments will be described in detail so that those skilled in the art can easily practice the present invention with reference to the accompanying drawings. However, in describing a preferred embodiment of the present invention in detail, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. In addition, the same reference numerals are used throughout the drawings for parts having similar functions and actions.

덧붙여, 명세서 전체에서, 어떤 부분이 다른 부분과 ‘연결’ 되어 있다고 할 때, 이는 ‘직접적으로 연결’ 되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고 ‘간접적으로 연결’ 되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 ‘포함’ 한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.In addition, throughout the specification, when a part is said to be 'connected' to another part, this is not only the case where it is 'directly connected', but also the case where it is 'indirectly connected' with another element in between. include In addition, 'including' a certain component means that other components may be further included, rather than excluding other components unless otherwise specified.

도 3은 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100)의 전체 구성을 도시한 도면이다. 도 3에 도시된 바와 같이, 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100)는, CNN과 비전 트랜스포머(Vision Transformer; ViT)를 결합해 얼굴 표정 인식 모델(200)을 구성하는 학습부(110); 및 얼굴 표정 인식 모델(200)을 사용해 입력 영상에서 표정을 인식하는 인식부(120)를 포함하여 구성될 수 있다.3 is a diagram showing the overall configuration of a vision transformer-based face expression recognition apparatus 100 according to an embodiment of the present invention. As shown in FIG. 3, the facial expression recognition apparatus 100 based on a vision transformer according to an embodiment of the present invention combines a CNN and a Vision Transformer (ViT) to generate a facial expression recognition model 200. Configuring the learning unit 110; and a recognition unit 120 that recognizes facial expressions in an input image using the facial expression recognition model 200 .

즉, 학습부(110)는 CNN과 ViT를 결합해 모델을 구성하고, 구성한 모델을 얼굴 표정에 대한 학습 데이터셋을 사용해 학습하여 얼굴 표정 인식 모델을 생성할 수 있다. 인식부(120)는 학습부(110)에서 학습을 통해 생성한 얼굴 표정 인식 모델을 사용해 입력 영상에서 표정을 인식할 수 있다.That is, the learning unit 110 may configure a model by combining CNN and ViT, and generate a facial expression recognition model by learning the configured model using a training dataset for facial expressions. The recognition unit 120 may recognize a facial expression in an input image using a facial expression recognition model generated through learning in the learning unit 110 .

도 4는 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100)의 얼굴 표정 인식 모델(200)의 구성을 도시한 도면이고, 도 5는 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100)의 얼굴 표정 인식 모델(200)의 전체 구조를 도시한 도면이다. 도 4 및 도 5에 도시된 바와 같이, 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100)의 얼굴 표정 인식 모델(200)은, CNN을 사용해 얼굴 영상에서 특징 맵을 추출하는 특징 추출부(210); 특징 맵을 패치로 나누어 임베딩 패치를 구성하는 패치 구성부(220); 및 임베딩 패치를 비전 트랜스포머에 입력하여 얼굴 표정을 분류하는 표정 분류부(230)를 포함하여 구성될 수 있다.4 is a diagram showing the configuration of the facial expression recognition model 200 of the vision transformer-based facial expression recognition apparatus 100 according to an embodiment of the present invention, and FIG. 5 is a view showing the vision according to an embodiment of the present invention It is a diagram showing the overall structure of the facial expression recognition model 200 of the transformer-based facial expression recognition apparatus 100. As shown in FIGS. 4 and 5 , the facial expression recognition model 200 of the vision transformer-based facial expression recognition apparatus 100 according to an embodiment of the present invention extracts a feature map from a face image using CNN. a feature extraction unit 210; a patch constructing unit 220 that configures an embedding patch by dividing the feature map into patches; and an expression classification unit 230 that classifies facial expressions by inputting the embedding patch to the vision transformer.

즉, ViT는 영상의 다른 부분 간의 의존성을 모델링하여 전체적인 관계를 확인하는 데 적합하므로 글로벌 한 영상 분류에 강점을 보인다. 따라서 얼굴 표정과 같은 섬세한 영상의 분류에는 적합하지 않다. 본 발명에서는, 이러한 문제점을 해결하기 위해 ViT의 전 단계에 CNN을 배치하고 CNN의 마지막 레이어를 통해 추출된 특징 맵을 ViT의 입력으로 사용하였다. 이하에서는, 도 4 및 도 5를 참조하여, 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100)의 얼굴 표정 인식 모델(200)에 대해 상세히 설명하도록 한다.In other words, ViT is suitable for confirming the overall relationship by modeling the dependence between different parts of the image, so it shows strength in global image classification. Therefore, it is not suitable for classification of delicate images such as facial expressions. In the present invention, in order to solve this problem, a CNN is placed in the previous stage of ViT, and the feature map extracted through the last layer of the CNN is used as an input of ViT. Hereinafter, referring to FIGS. 4 and 5 , the facial expression recognition model 200 of the vision transformer-based facial expression recognition apparatus 100 according to an embodiment of the present invention will be described in detail.

특징 추출부(210)는, CNN을 사용해 얼굴 영상에서 특징 맵을 추출할 수 있다. 보다 구체적으로, 특징 추출부(210)는, CNN을 기반으로 하는 ResNet50을 사용해 얼굴 영상으로부터 얼굴 표정 변화를 나타내는 특징 맵을 추출할 수 있다. 즉, 정확한 얼굴 표정 인식에 중요한 것은 얼굴 내의 특징을 파악하는 것이므로, 특징 추출부(210)는 얼굴 내 특징에 대한 특징 맵 추출을 위해 백본 네트워크(Backbone network)로 ResNet50을 사용할 수 있다.The feature extractor 210 may extract a feature map from a face image using a CNN. More specifically, the feature extractor 210 may extract a feature map indicating a change in facial expression from a face image using CNN-based ResNet50. That is, since it is important to accurately recognize facial expressions to recognize facial features, the feature extractor 210 may use ResNet50 as a backbone network to extract feature maps for facial features.

패치 구성부(220)는, 특징 맵을 패치로 나누어 임베딩 패치를 구성할 수 있다. 특징 추출부(210)에서 ResNet50을 통해 추출된 특징 맵

을 고정된 패치 크기(Patch Size)인 n에 맞게 나누어 패치 집합

,

을 구성할 수 있다. 여기서, ω는 클래스 토큰 정보이다. 패치 구성부(220)는, 패치에 위치(position)를 임베딩한 임베딩 패치(Embedding Patches)를 구성하며, 이러한 임베딩 패치를 ViT에 입력할 수 있다.The patch configuration unit 220 may configure an embedding patch by dividing the feature map into patches. The feature map extracted through ResNet50 in the feature extraction unit 210

into a fixed patch size, n, into a set of patches.

,

can be configured. Here, ω is class token information. The patch configuration unit 220 configures embedding patches by embedding positions in patches, and may input these embedding patches to ViT.

표정 분류부(230)는, 임베딩 패치를 비전 트랜스포머에 입력하여 얼굴 표정을 분류할 수 있다. 트랜스포머는 어텐션(attention) 메커니즘을 활용하는 딥 러닝 모형이다. 본래 자연어 처리에 활용되었으나, 비전 트랜스포머의 등장 이후 컴퓨터 비전에도 활용되고 있다. 자연어 처리에서 사용되는 트랜스포머는 인코더, 디코더로 구성되지만, 영상 분류를 위해서는 잠재 특징(latent feature) 추출만 필요하므로 인코더로 구성되는 ViT를 사용해 얼굴 표정 인식 모델(200)을 구성할 수 있다.The facial expression classification unit 230 may classify facial expressions by inputting the embedding patch to the vision transformer. Transformer is a deep learning model that utilizes an attention mechanism. It was originally used for natural language processing, but since the advent of vision transformers, it has also been used for computer vision. The transformer used in natural language processing consists of an encoder and a decoder, but since only latent feature extraction is required for image classification, the facial expression recognition model 200 can be constructed using ViT, which consists of an encoder.

도 6은 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100)에서 사용하는 ViT 인코더의 구성을 도시한 도면이다. 도 6에 도시된 바와 같이, 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100)에서 사용하는 비전 트랜스포머는, 멀티-헤드 어텐션(Multi-Head Attention) 및 다층 퍼셉트론(Multi-Layer Perceptron; MLP)을 포함하여 구성되는 ViT 인코더로 구성되며, 임베딩 패치 사이의 상관관계에 대해 계산할 수 있다. 실시예에 따라서, ViT 인코더는, 멀티-헤드 어텐션과 MLP가 교대로 구성될 수 있으며, 도 6에 도시된 바와 같은 구성을 레이어 수만큼 반복해 다층으로 구성할 수도 있다. ViT 인코더에서는 멀티-헤드 어텐션을 통해 Self-Attention 메커니즘이 적용될 수 있다.6 is a diagram showing the configuration of a ViT encoder used in the facial expression recognition apparatus 100 based on a vision transformer according to an embodiment of the present invention. As shown in FIG. 6, the vision transformer used in the facial expression recognition apparatus 100 based on a vision transformer according to an embodiment of the present invention is a multi-head attention and a multi-layer perceptron It is composed of a ViT encoder including Layer Perceptron (MLP) and can calculate correlation between embedding patches. Depending on the embodiment, the ViT encoder may be configured with multi-head attention and MLP alternately, and may be configured in multiple layers by repeating the configuration shown in FIG. 6 as many times as the number of layers. In the ViT encoder, a Self-Attention mechanism can be applied through multi-head attention.

또한, 도 6에 도시된 바와 같이, ViT 인코더에서는, 멀티-헤드 어텐션 및 다층 퍼셉트론 앞에 Layer-Norm이 각각 적용될 수 있다. 여기서, Layer-Norm은 동일한 층의 뉴런 간 정규화(Normalization)를 위한 것으로, Batch-Norm이 배치 단위로 정규화를 수행했다면 Layer-Norm은 Batch Norm의 미니-배치 사이즈를 뉴런 개수로 변경해 정규화를 수행한다.In addition, as shown in FIG. 6, in the ViT encoder, Layer-Norm may be applied in front of multi-head attention and multi-layer perceptron, respectively. Here, Layer-Norm is for normalization between neurons in the same layer. If Batch-Norm performed normalization in batch units, Layer-Norm performs normalization by changing the mini-batch size of Batch Norm to the number of neurons. .

표정 분류부(230)는, ViT 인코더에서 각각의 Norm과 멀티-헤드 어텐션, MLP Block을 거쳐 패치들 사이의 상관관계에 대해서 다음 수학식 1과 같이 계산할 수 있다. 그 다음 ViT 인코더의 출력을 MLP와 소프트맥스(Softmax) 함수에 적용해 얼굴 표정을 분류할 수 있다(수학식 2).The facial expression classification unit 230 may calculate the correlation between patches through each norm, multi-head attention, and MLP block in the ViT encoder as shown in Equation 1 below. Then, the output of the ViT encoder can be applied to the MLP and Softmax functions to classify facial expressions (Equation 2).

여기서, T(·)는 ViT 인코더이고, σ(·)는 MLP layer이다. θ는 MLP layer의 학습 가능한 가중치 파라미터이며, c는 class의 수이다.Here, T(·) is the ViT encoder, and σ(·) is the MLP layer. θ is the learnable weight parameter of the MLP layer, and c is the number of classes.

도 7은 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100)에서 사용하는 MLP의 구성을 예를 들어 도시한 도면이다. 도 7에 도시된 바와 같이, 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100)에서 사용하는 MLP는, 완전 연결층(Fully-connected layer)과 GeLU(Gaussian Error Linear Unit)층을 포함하여 구성될 수 있다.7 is a diagram showing the configuration of an MLP used in the facial expression recognition apparatus 100 based on a vision transformer according to an embodiment of the present invention, for example. As shown in FIG. 7, the MLP used in the vision transformer-based facial expression recognition apparatus 100 according to an embodiment of the present invention has a fully-connected layer and a Gaussian Error Linear Unit (GeLU) It may consist of layers.

한편, 학습부(110)는, 손실함수로 교차-엔트로피(Cross-Entropy)를 사용하여 학습을 수행할 수 있다. 즉, 다음 수학식 3의 교차-엔트로피를 손실함수로 사용해서 네트워크를 학습시키며, 여기서 Yi는 i번째 이미지의 실측값이다. Meanwhile, the learning unit 110 may perform learning using cross-entropy as a loss function. That is, the network is trained using the cross-entropy of Equation 3 as a loss function, where Yi is an actual value of the i-th image.

한편, ViT는 방대한 양의 학습데이터가 필요하며 그에 따른 사전 학습이 중요하다. 따라서 본 발명의 학습부(110)는, ImageNet-1k로 사전학습된 ResNet50과 사전 학습된 ViT-B/16를 ImageNet-21k로 추가 사전학습한 비전 트랜스포머를 결합해 얼굴 표정 인식 모델(200)을 구성할 수 있다. 이러한 사전학습 한 결과는 성능을 개선하는 데 많은 도움이 된다.On the other hand, ViT requires a huge amount of learning data, and prior learning is important accordingly. Therefore, the learning unit 110 of the present invention combines ResNet50 pre-trained with ImageNet-1k and ViT-B/16 pre-trained with ImageNet-21k and a vision transformer additionally pre-trained to obtain a facial expression recognition model 200. can be configured. The results of such pre-learning are very helpful in improving performance.

또한, 도 5에 도시된 바와 같이, 표정 분류부(230)는, 화남(anger), 경멸(contempt), 혐오(disgust), 두려움(fear), 행복(happy), 슬픔(sadness) 및 놀람(surprise)을 포함하는 7개의 분류로 얼굴 표정을 분류할 수 있다.In addition, as shown in FIG. 5 , the expression classification unit 230 includes anger, contempt, disgust, fear, happiness, sadness, and surprise ( Facial expressions can be classified into 7 categories including surprise.

도 8은 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식 방법의 흐름을 도시한 도면이다. 도 8에 도시된 바와 같이, 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식 방법은, 컴퓨터에 의해 각 단계가 수행되는 얼굴 표정 인식 방법으로서, CNN과 ViT를 결합해 얼굴 표정 인식 모델(200)을 구성하는 단계(S110) 및 얼굴 표정 인식 모델(200)을 사용해 입력 영상에서 표정을 인식하는 단계(S120)를 포함하여 구현될 수 있다.8 is a flow diagram illustrating a facial expression recognition method based on a vision transformer according to an embodiment of the present invention. As shown in FIG. 8, the vision transformer-based facial expression recognition method according to an embodiment of the present invention is a facial expression recognition method in which each step is performed by a computer, and a facial expression recognition model by combining CNN and ViT. It can be implemented including the step of constructing (200) (S110) and the step of recognizing a facial expression in the input image using the facial expression recognition model 200 (S120).

본 발명은 비전 트랜스포머 기반의 얼굴 표정 인식 방법에 관한 것으로서, 메모리 및 프로세서를 포함한 하드웨어에서 기록되는 소프트웨어로 구성될 수 있다. 예를 들어, 본 발명의 비전 트랜스포머 기반의 얼굴 표정 인식 방법은, 개인용 컴퓨터, 노트북 컴퓨터, 서버 컴퓨터, PDA, 스마트폰, 태블릿 PC, 차량용 임베디드 컴퓨터 등에 저장 및 구현될 수 있다. 이하에서는 설명의 편의를 위해, 각 단계를 수행하는 주체는 생략될 수 있다.The present invention relates to a method for recognizing facial expressions based on a vision transformer, and may be composed of software recorded in hardware including a memory and a processor. For example, the facial expression recognition method based on the vision transformer of the present invention can be stored and implemented in a personal computer, a notebook computer, a server computer, a PDA, a smart phone, a tablet PC, and an embedded computer for a vehicle. In the following, for convenience of explanation, the subject performing each step may be omitted.

단계 S110에서는, 학습부(110)가, CNN을 사용해 얼굴 영상에서 특징 맵을 추출하는 특징 추출부(210); 특징 맵을 패치로 나누어 임베딩 패치를 구성하는 패치 구성부(220); 및 임베딩 패치를 비전 트랜스포머에 입력하여 얼굴 표정을 분류하는 표정 분류부(230)를 포함하여 구성한 모델을, 얼굴 표정에 대한 학습 데이터셋을 사용해 학습하여 얼굴 표정 인식 모델(200)을 구성할 수 있다.In step S110, the learning unit 110 includes a feature extraction unit 210 for extracting a feature map from a face image using CNN; a patch constructing unit 220 that configures an embedding patch by dividing the feature map into patches; and a facial expression classification unit 230 that classifies facial expressions by inputting the embedding patch to the vision transformer. The facial expression recognition model 200 may be configured by learning the constructed model using a learning dataset for facial expressions. .

단계 S120에서는, 인식부(120)가, 얼굴 표정 인식 모델(200)을 사용해 입력 영상에서 표정을 인식할 수 있다. 도 9는 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식 방법을 사용해 입력 영상에서 표정을 인식하는 과정의 흐름을 도시한 도면이다. 도 9에 도시된 바와 같이, 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식 방법의 단계 S120은, 특징 추출부(210)가 CNN을 사용해 입력 영상에서 특징 맵을 추출하는 단계(S121), 패치 구성부(220)가 특징 맵을 패치로 나누어 임베딩 패치를 구성하는 단계(S122), 표정 분류부(230)가 임베딩 패치를 비전 트랜스포머에 입력하여 얼굴 표정을 분류하는 단계(S123)를 포함하여 구성될 수 있다.In step S120, the recognition unit 120 may recognize a facial expression in the input image using the facial expression recognition model 200. 9 is a diagram illustrating a flow of a process of recognizing a facial expression in an input image using a facial expression recognition method based on a vision transformer according to an embodiment of the present invention. As shown in FIG. 9 , in step S120 of the vision transformer-based facial expression recognition method according to an embodiment of the present invention, feature extraction unit 210 extracts a feature map from an input image using CNN (S121 ), the patch configuration unit 220 divides the feature map into patches to construct an embedding patch (S122), and the facial expression classification unit 230 inputs the embedding patch to the vision transformer to classify facial expressions (S123). can be configured to include

각각의 단계들과 관련된 상세한 내용들은, 앞서 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100)와 관련하여 충분히 설명되었으므로, 상세한 설명은 생략하기로 한다.Since details related to each step have been sufficiently described in relation to the facial expression recognition apparatus 100 based on a vision transformer according to an embodiment of the present invention, detailed descriptions thereof will be omitted.

실험Experiment

본 실험에서는 다양한 FER Data set 중에서 CK+(The Extended Cohn Kanade Dataset)를 사용하였다. CK+는 얼굴 표정에 대한 981개의 이미지를 제공하는데, CK+ Data set의 이미지를 448×448로 리사이즈 하여 학습에 사용했다. Backbone Network로는 ResNet50을 사용하였다. 배치 크기(Batch size)는 16, Step은 10000, Learning rate는 3e-2로 설정하였다.In this experiment, CK+ (The Extended Cohn Kanade Dataset) was used among various FER data sets. CK+ provides 981 images of facial expressions, and the images of the CK+ Data set were resized to 448×448 and used for learning. ResNet50 was used as the backbone network. The batch size was set to 16, the step to 10000, and the learning rate to 3e-2.

도 10은 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100) 및 방법과 최신 FER 방법을 비교해서 나타낸 도면이다. 도 10에 도시된 바와 같이, 본 발명의 일실시예에 따른 비전 트랜스포머 기반의 얼굴 표정 인식장치(100) 및 방법과 다른 최첨단(Sate of the art, SOTA) 얼굴 표정 인식 방법을 CK+ Data set으로 실험한 결과, SOTA 방법 중 가장 성능이 좋은 FDRL 이 비해 제안하는 방법은 약 0.18% 성능이 향상되었다. 또한, ResNet50 없이 ViT만으로 얼굴 표정을 인식한 경우에는 본 발명보다 0.97% 성능이 하락하였다. 이러한 결과를 통해 ViT는 전역적 특징을 이용한 영상 분류 또는 인식에는 효과적이지만 얼굴 표정과 같이 지역적 특징 변화를 감지해야 하는 분류 문제에는 효율적이 못하다는 것을 알 수 있었다. 하지만 본 발명에서는 전처리로 ResNet50을 추가함으로써 얼굴 표정 변화에 따른 중요 변화 부위의 강조된 특징들을 얻어 낼 수 있었기 때문에 ViT에서 성능을 오히려 향상시킬 수 있었다.10 is a diagram showing a comparison between the vision transformer-based facial expression recognition apparatus 100 and method according to an embodiment of the present invention and the latest FER method. As shown in FIG. 10, the vision transformer-based facial expression recognition apparatus 100 and method according to an embodiment of the present invention and other state-of-the-art (Sate of the art, SOTA) facial expression recognition methods are experimented with CK+ Data set As a result, compared to FDRL, which has the best performance among SOTA methods, the proposed method improves performance by about 0.18%. In addition, in the case of recognizing facial expressions only with ViT without ResNet50, the performance was lowered by 0.97% compared to the present invention. These results show that ViT is effective for image classification or recognition using global features, but not effective for classification problems that require detection of regional feature changes, such as facial expressions. However, in the present invention, by adding ResNet50 as a preprocessing, it was possible to obtain the highlighted features of the important change part according to the change in facial expression, so the performance was rather improved in ViT.

전술한 바와 같이, 본 발명에서 제안하고 있는 비전 트랜스포머 기반의 얼굴 표정 인식장치(100) 및 방법에 따르면, CNN을 이용해 정밀한 얼굴 표정 변화를 특징 맵으로 추출하고, 추출한 특징 맵을 비전 트랜스포머에 입력해 얼굴 표정을 인식함으로써, 비전 트랜스포머를 얼굴 표정 인식에 성공적으로 적용하고 섬세한 얼굴 표정의 변화를 감지하여 얼굴 표정 인식 성능을 향상시킬 수 있다.As described above, according to the vision transformer-based facial expression recognition apparatus 100 and method proposed in the present invention, precise facial expression changes are extracted as feature maps using CNN, and the extracted feature maps are input to the vision transformer. By recognizing facial expressions, vision transformers can be successfully applied to facial expression recognition and detect subtle changes in facial expressions to improve facial expression recognition performance.

한편, 본 발명은 다양한 통신 단말기로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터에서 판독 가능한 매체를 포함할 수 있다. 예를 들어, 컴퓨터에서 판독 가능한 매체는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD_ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.Meanwhile, the present invention may include a computer-readable medium including program instructions for performing operations implemented in various communication terminals. For example, computer-readable media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD_ROMs and DVDs, and floptical disks. It may include hardware devices specially configured to store and execute program instructions, such as magneto-optical media and ROM, RAM, flash memory, and the like.

이와 같은 컴퓨터에서 판독 가능한 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 이때, 컴퓨터에서 판독 가능한 매체에 기록되는 프로그램 명령은 본 발명을 구현하기 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예를 들어, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.Such computer-readable media may include program instructions, data files, data structures, etc. alone or in combination. At this time, program instructions recorded on a computer-readable medium may be specially designed and configured to implement the present invention, or may be known and usable to those skilled in computer software. For example, it may include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes generated by a compiler.

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.The present invention described above can be variously modified or applied by those skilled in the art to which the present invention belongs, and the scope of the technical idea according to the present invention should be defined by the claims below.

100: 얼굴 표정 인식장치
110: 학습부
120: 인식부
200: 얼굴 표정 인식 모델
210: 특징 추출부
220: 패치 구성부
230: 표정 분류부
S110: CNN과 ViT를 결합해 얼굴 표정 인식 모델을 구성하는 단계
S120: 얼굴 표정 인식 모델을 사용해 입력 영상에서 표정을 인식하는 단계
S121: CNN을 사용해 입력 영상에서 특징 맵을 추출하는 단계
S122: 특징 맵을 패치로 나누어 임베딩 패치를 구성하는 단계
S123: 임베딩 패치를 비전 트랜스포머에 입력하여 얼굴 표정을 분류하는 단계100: facial expression recognition device
110: learning unit
120: recognition unit
200: facial expression recognition model
210: feature extraction unit
220: patch component
230: expression classification unit
S110: Constructing a facial expression recognition model by combining CNN and ViT
S120: Recognizing a facial expression from an input image using a facial expression recognition model
S121: Extracting a feature map from an input image using CNN
S122: Constructing an embedding patch by dividing the feature map into patches
S123: step of classifying facial expressions by inputting the embedding patch to the vision transformer

Claims

As a facial expression recognition device 100,
a learning unit 110 constituting a facial expression recognition model 200 by combining a CNN and a Vision Transformer (ViT); and
A recognition unit 120 recognizing a facial expression in an input image using the facial expression recognition model 200,
The learning unit 110,
a feature extraction unit 210 that extracts a feature map from a face image using a CNN;
a patch constructing unit 220 configured to construct an embedding patch by dividing the feature map into patches; and
Constructing the facial expression recognition model 200 by learning a model configured including a facial expression classification unit 230 that classifies facial expressions by inputting the embedding patch to a vision transformer using a learning dataset for facial expressions. Characterized by, a vision transformer-based facial expression recognition device (100).

The method of claim 1, wherein the feature extraction unit 210,
A vision transformer-based facial expression recognition device (100), characterized in that the feature map representing the change in facial expression is extracted from the facial image using ResNet50 based on CNN.

The method of claim 2, wherein the learning unit 110,
ResNet50 pre-trained with ImageNet-1k and ViT-B/16 pre-trained with ImageNet-21k are combined with vision transformers additionally pre-trained to construct the facial expression recognition model (200), vision transformer-based The facial expression recognition device 100 of

The method of claim 3, wherein the learning unit 110,
Characterized in that learning is performed using cross-entropy as a loss function, a vision transformer-based facial expression recognition device (100).

The method of claim 1, wherein the vision transformer,
A vision transformer composed of a ViT encoder including a multi-head attention and a multi-layer perceptron (MLP), characterized in that it calculates the correlation between the embedding patches. Based facial expression recognition device (100).

The method of claim 5, wherein the ViT encoder,
The vision transformer-based facial expression recognition device 100, characterized in that Layer-Norm is applied in front of the multi-head attention and multi-layer perceptron, respectively.

The method of claim 5, wherein the facial expression classification unit 230,
The vision transformer-based facial expression recognition device 100, characterized in that the facial expression is classified by applying the output of the ViT encoder to the MLP and Softmax functions.

The method of claim 1, wherein the facial expression classification unit 230,
Characterized in classifying facial expressions into seven categories including anger, contempt, disgust, fear, happy, sadness and surprise. , Facial expression recognition device 100 based on a vision transformer.

A facial expression recognition method in which each step is performed by a computer,
(1) Constructing a facial expression recognition model 200 by combining a CNN and a Vision Transformer (ViT); and
(2) recognizing a facial expression in an input image using the facial expression recognition model 200;
In the step (1),
a feature extraction unit 210 that extracts a feature map from a face image using a CNN;
a patch constructing unit 220 configured to construct an embedding patch by dividing the feature map into patches; and
Constructing the facial expression recognition model 200 by learning a model configured including a facial expression classification unit 230 that classifies facial expressions by inputting the embedding patch to a vision transformer using a learning dataset for facial expressions. Characterized in, a vision transformer-based facial expression recognition method.

The method of claim 9, wherein the feature extraction unit 210,
A facial expression recognition method based on a vision transformer, characterized in that extracting the feature map representing a change in facial expression from the facial image using ResNet50 based on CNN.

The method of claim 10, wherein in step (1),
ResNet50 pre-trained with ImageNet-1k and ViT-B/16 pre-trained with ImageNet-21k, characterized in that the facial expression recognition model (200) is constructed using a vision transformer additionally pre-trained, vision transformer-based How to recognize facial expressions.

The method of claim 11, wherein in step (1),
A method for recognizing facial expressions based on a vision transformer, characterized in that learning is performed using cross-entropy as a loss function.

The method of claim 9, wherein the vision transformer,
A vision transformer composed of a ViT encoder including a multi-head attention and a multi-layer perceptron (MLP), characterized in that it calculates the correlation between the embedding patches. based facial expression recognition method.

The method of claim 13, wherein the ViT encoder,
Characterized in that Layer-Norm is applied in front of the multi-head attention and multi-layer perceptron, respectively, a vision transformer-based facial expression recognition method.

The method of claim 13, wherein the facial expression classification unit 230,
A facial expression recognition method based on a vision transformer, characterized in that the facial expression is classified by applying the output of the ViT encoder to the MLP and Softmax functions.

The method of claim 9, wherein the facial expression classification unit 230,
Characterized in classifying facial expressions into seven categories including anger, contempt, disgust, fear, happy, sadness and surprise. , A vision transformer-based facial expression recognition method.