KR102090171B1

KR102090171B1 - Video-based human emotion recognition using semi-supervised learning and multimodal networks

Info

Publication number: KR102090171B1
Application number: KR1020180043342A
Authority: KR
Inventors: 송병철; 김대하; 최동윤
Original assignee: 인하대학교 산학협력단
Priority date: 2018-04-13
Filing date: 2018-04-13
Publication date: 2020-03-17
Also published as: KR20190119863A

Abstract

반 지도 학습과 여러 개의 멀티 모달 네트워크를 이용한 비디오 기반 인물 감정 인식 기법이 개시된다. 일 실시예에 따른 비디오 기반 인물 감정 인식 방법은, 비디오 내에 존재하는 이미지 데이터, 얼굴 특징점 데이터 또는 음성 데이터 중 적어도 하나 이상의 신호를 인물 감정 인식을 위한 반 지도(Semi-supervised) 학습과 복수 개의 멀티 모달 네트워크에 기반하여 구성된 딥 러닝 네트워크에 입력하는 단계; 및 상기 딥 러닝 네트워크에 입력된 적어도 하나 이상의 신호를 분석함에 따라 획득된 각각의 확률 정보를 적응적으로 융합하여 상기 비디오 내의 인물의 감정을 인식하는 단계를 포함할 수 있다. Disclosed is a video-based human emotion recognition technique using semi-supervised learning and multiple multi-modal networks. In the video-based human emotion recognition method according to an embodiment, at least one signal among image data, facial feature point data, or voice data present in the video is semi-supervised learning and multiple multi-modal for human emotion recognition. Inputting into a deep learning network configured based on the network; And adaptively fusing each probability information obtained by analyzing at least one signal input to the deep learning network to recognize emotions of a person in the video.

Description

VIDEO-BASED HUMAN EMOTION RECOGNITION USING SEMI-SUPERVISED LEARNING AND MULTIMODAL NETWORKS

아래의 설명은 비디오 데이터를 기반으로 딥 러닝 네트워크를 사용하여 감정 인식을 수행하는 기술에 관한 것이다.The description below relates to a technique for performing emotion recognition using a deep learning network based on video data.

오늘날 인물의 감정을 인식하는 분야는 빠르게 발전하고 있고, 다양한 인물의 표정 정보를 획득하는 데 딥 러닝 기법이 사용되면서 보다 효율적으로 인물의 감정을 파악할 수 있게 되었다. 또한, 이미지 시퀀스를 분석할 경우 단일 이미지에서 얻을 수 없었던 해당 인물의 분위기 또한 파악할 수 있고, 표정의 변화 추이를 분석하여 보다 효율적인 감정 인식 과정을 수행할 수 있기 때문에 단일 이미지 기반 감정 인식에서 확장된 비디오 기반(이미지 시퀀스 기반) 감정 인식을 연구하는 추세이다.Today, the field of recognizing a person's emotions is rapidly developing, and deep learning techniques are used to acquire facial expression information of various characters, making it possible to grasp the person's emotions more efficiently. In addition, when analyzing an image sequence, it is possible to grasp the mood of a person who could not be obtained from a single image, and analyze the trend of facial expressions to perform a more efficient emotion recognition process, thus expanding the video in single image-based emotion recognition. It is a trend to study emotion recognition based on base (image sequence).

최근 감정 인식 분야에서의 챌린지 또한 활발하게 열리고 있다. 도 1을 참고하면, Convolutional 3D Hybrid Network를 설명하기 위한 것이다. 중국의 video streaming 회사인 Iqiyi에서 제안된 알고리즘은 어려운 비디오 클립을 효율적으로 분석하였다. 도1에서 제안된 알고리즘에 따르면, 네트워크의 경우, 이미지 시퀀스 기반, 단일 이미지 기반, 음성 신호 기반 알고리즘을 모두 사용하여 비디오 내 인물의 감정을 분석하였다. 도 1과 같이 비디오에서 인물의 얼굴을 확보한 뒤 전 처리 과정(예를 들어 히스토크램 평활화)을 수행한 뒤 CNN-RNN과 Convolutional 3D 네트워크를 사용하였다.Recently, a challenge in the field of emotion recognition is also actively being held. Referring to Figure 1, it is for explaining the Convolutional 3D Hybrid Network. The algorithm proposed by Iqiyi, a Chinese video streaming company, efficiently analyzed difficult video clips. According to the algorithm proposed in FIG. 1, in the case of a network, emotions of a person in a video are analyzed by using an image sequence-based, single image-based, and voice signal-based algorithm. After obtaining the face of a person in the video as shown in FIG. 1, a pre-processing process (for example, histogram smoothing) was performed, and then CNN-RNN and a convolutional 3D network were used.

CNN-RNN 네트워크는 기본적으로 단일 이미지 기반 네트워크이다. 미리 학습이 된 VGG16(Visual Geometry Group 16) 딥 러닝 네트워크를 fine-tuning한 뒤 이후 LSTM 네트워크의 학습을 진행하였다. 다음으로 이미지 시퀀스 정보를 고려하기 위해 Convolutional 3D 네트워크를 사용하였다. 따라서 아래에 소개된 음성 신호 분석을 위한 네트워크와 더불어 총 3개의 네트워크를 학습하여 각각의 네트워크의 정보를 기반으로 앙상블 과정을 수행하였다. 하지만 얼굴 특징점은 사용하지 않았고, 음성 신호의 분석 또한 간단한 SVM with Linear kernel을 사용하였기 때문에 비디오 주변의 배경을 고려하는 데에는 분명 한계가 존재한다. The CNN-RNN network is basically a single image-based network. After fine-tuning the previously learned Visual Geometry Group 16 (VGG16) deep learning network, the LSTM network was studied. Next, a convolutional 3D network was used to consider image sequence information. Therefore, in addition to the network for voice signal analysis introduced below, a total of three networks were studied to perform an ensemble process based on the information of each network. However, since facial feature points were not used and the analysis of the audio signal also used a simple SVM with Linear kernel, there are obvious limitations in considering the background around the video.

도 2를 참고하면, Parallel CNN Network을 설명하기 위한 것으로, Parallel CNN Network은 Microsoft에서 제안된 알고리즘은 단일 이미지 정보만을 가지고 최대한 이미지의 특징을 획득하기 위하여 주력하였지만, 이미지 시퀀스 정보와 비디오 내 음성, 그리고 얼굴 특징점 정보를 사용하지 않았다는 점에서 멀티 모달을 사용한 네트워크라고 보기는 힘들며, 그 성능 또한 한계가 존재한다. 또한, 네트워크의 경우 매우 무거운 딥 러닝 네트워크를 사용한다. 아래 네트워크의 경우 2 step으로 네트워크의 감정 인식 과정이 수행된다. Referring to FIG. 2, for explaining the Parallel CNN Network, the Parallel CNN Network focused on obtaining the characteristics of the image as much as possible with only the single image information, but the algorithm proposed by Microsoft focused on the image sequence information and the voice in the video, and It is difficult to see it as a network using multi-modal in that face feature point information is not used, and its performance also has limitations. In addition, the network uses a very heavy deep learning network. In the case of the network below, the emotion recognition process of the network is performed in 2 steps.

도 2에서 총 3개의 딥 러닝 네트워크(예를 들면, VGG13, VGG16, ResNet91)을 병렬로 학습한 뒤 학습 이후의 특징들을 정규화 과정을 수행한 뒤 특징 벡터를 이어 붙인다. 이어 붙인 네트워크를 소프트맥스(softmax) 분류 함수를 통하여 학습한다. 두 번째 이후 충분히 학습된 네트워크의 fine-tuning 과정을 수행한다. Fine-tuning과정을 통해 softmax 앞 단의 특징 벡터(총 2304 차원)를 획득한다. 이후, 하나의 비디오 시퀀스 내의 모든 프레임의 특징 벡터를 기반으로 통계적인 인코딩 과정 (STAT Encoding)을 수행한다. 이를 통해서 총 9216 차원의 비디오 특징 벡터를 얻어내고 이를 Support Vector Machine(SVM)을 통해서 최종 감정 인식을 수행한다. 하지만, 단일 이미지만을 사용하여 높은 성능을 달성한 점에서는 매우 고무적이지만 인물의 표정 변화가 매우 적은 비디오 클립에서는 감정 인식 성능이 현저히 떨어진다는 단점 또한 존재한다.In FIG. 2, after learning a total of three deep learning networks (for example, VGG13, VGG16, ResNet91) in parallel, after performing the normalization process of the features after learning, feature vectors are connected. The connected network is trained through a softmax classification function. After the second, a fine-tuning process of a sufficiently trained network is performed. Through the fine-tuning process, the feature vector (total 2304 dimensions) at the front end of the softmax is acquired. Thereafter, a statistical encoding process (STAT Encoding) is performed based on the feature vectors of all frames in one video sequence. Through this, a total of 9216-dimensional video feature vectors are obtained and the final emotion recognition is performed through a support vector machine (SVM). However, although it is very encouraging in that a high performance is achieved by using only a single image, there is also a disadvantage in that the emotion recognition performance is remarkably deteriorated in a video clip with very little facial expression change.

효율적이고 세밀한 인물의 표정까지도 분석이 가능한 비디오 기반 인물 감정 인식 방법 및 시스템을 제공할 수 있다. It is possible to provide a video-based human emotion recognition method and system capable of analyzing an efficient and detailed facial expression.

반 지도 학습과 복수 개의 멀티 모달 네트워크를 이용한 비디오 기반 인물 감정 인식 방법 및 시스템을 제공할 수 있다.It is possible to provide a video-based human emotion recognition method and system using semi-supervised learning and multiple multi-modal networks.

비디오 기반 인물 감정 인식 방법은, 비디오 내에 존재하는 이미지 데이터, 얼굴 특징점 데이터 또는 음성 데이터 중 적어도 하나 이상의 신호를 인물 감정 인식을 위한 반 지도(Semi-supervised) 학습과 복수 개의 멀티 모달 네트워크에 기반하여 구성된 딥 러닝 네트워크에 입력하는 단계; 및 상기 딥 러닝 네트워크에 입력된 적어도 하나 이상의 신호를 분석함에 따라 획득된 각각의 확률 정보를 적응적으로 융합하여 상기 비디오 내의 인물의 감정을 인식하는 단계를 포함할 수 있다. The video-based human emotion recognition method is configured based on semi-supervised learning for human emotion recognition and a plurality of multi-modal networks for at least one signal among image data, facial feature point data, or voice data present in the video. Input into a deep learning network; And adaptively fusing each probability information obtained by analyzing at least one signal input to the deep learning network to recognize emotions of a person in the video.

상기 딥 러닝 네트워크에 입력하는 단계는, 상기 비디오 내에 존재하는 이미지 데이터, 얼굴 특징점 데이터 및 음성 데이터로부터 상기 비디오 내에 존재하는 인물의 감정을 분석하기 위하여 이미지 기반의 네트워크, 얼굴 특징점 기반의 네트워크 및 비디오 음성 신호 기반의 네트워크를 구성하는 단계를 포함할 수 있다. The step of inputting into the deep learning network includes an image-based network, a facial feature-point-based network, and a video voice to analyze emotions of a person existing in the video from image data, facial feature point data, and voice data existing in the video. And configuring a signal-based network.

상기 딥 러닝 네트워크에 입력하는 단계는, S3DAE, C3DA 또는 Parallel CNN 중 적어도 하나의 딥 러닝 네트워크를 사용함에 따라 상기 비디오 내에 존재하는 이미지 데이터로부터 이미지 특징을 획득하고, 상기 획득된 이미지 특징을 SVM 또는 Softmax을 수행하여 인물의 감정을 분류하는 단계를 포함할 수 있다. In the step of inputting into the deep learning network, an image characteristic is obtained from image data existing in the video by using at least one deep learning network of S3DAE, C3DA, or Parallel CNN, and the acquired image characteristic is SVM or Softmax. It may include the step of classifying the emotion of the person by performing.

상기 딥 러닝 네트워크에 입력하는 단계는, 상기 비디오 내의 연속적인 프레임에서 인물의 얼굴 정보에 대한 각각의 특징점들의 상대적 거리 변화를 기반으로 2차원 특징 벡터를 획득하고, 상기 획득된 2차원 특징 벡터를 CNN-LSTM 네트워크에 입력하여 인물의 감정을 분류하는 단계를 포함할 수 있다. In the step of inputting into the deep learning network, a 2D feature vector is obtained based on a change in a relative distance of each feature point to face information of a person in a continuous frame in the video, and the obtained 2D feature vector is CNN It may include the step of classifying the emotion of the person by inputting to the LSTM network.

상기 딥 러닝 네트워크에 입력하는 단계는, 상기 비디오 내의 음성 신호에 NN, CNN, LSTM 중 적어도 하나 이상의 음성 기반 네트워크를 사용하여 비디오 내의 분위기 및 배경 사운드에서 인물의 감정을 분석하는 단계를 포함할 수 있다. The step of inputting into the deep learning network may include analyzing emotions of a person in the atmosphere and background sound in the video using at least one voice-based network of NN, CNN, and LSTM for the voice signal in the video. .

상기 비디오 내의 인물의 감정을 인식하는 단계는, 상기 딥 러닝 네트워크에 입력된 적어도 하나 이상의 신호를 분석함에 따라 획득된 각각의 확률 정보 중 기 설정된 기준에 기초하여 가중치를 적용하는 단계를 포함할 수 있다. Recognizing the emotion of a person in the video may include applying a weight based on a preset criterion among each probability information obtained by analyzing at least one signal input to the deep learning network. .

상기 비디오 내의 인물의 감정을 인식하는 단계는, 상기 비디오 내의 인물의 감정을 화남, 역겨움, 두려움, 행복, 슬픔, 놀라움 및 중립을 포함하는 7가지의 감정으로 분류하고, 상기 분류된 감정을 정량값으로 도출하는 단계를 포함할 수 있다. The step of recognizing the emotion of the person in the video classifies the emotion of the person in the video into seven emotions including anger, disgust, fear, happiness, sadness, surprise, and neutrality, and quantifies the classified emotion It may include a step of deriving.

비디오 기반 인물 감정 인식 시스템은, 비디오 내에 존재하는 이미지 데이터, 얼굴 특징점 데이터 또는 음성 데이터 중 적어도 하나 이상의 신호를 인물 감정 인식을 위한 반 지도(Semi-supervised) 학습과 복수 개의 멀티 모달 네트워크에 기반하여 구성된 딥 러닝 네트워크에 입력하는 입력부; 및 상기 딥 러닝 네트워크에 입력된 적어도 하나 이상의 신호를 분석함에 따라 획득된 각각의 확률 정보를 적응적으로 융합하여 상기 비디오 내의 인물의 감정을 인식하는 인식부를 포함할 수 있다. The video-based human emotion recognition system is configured based on semi-supervised learning for human emotion recognition and a plurality of multi-modal networks for at least one signal of image data, facial feature point data, or voice data present in the video. An input unit input to a deep learning network; And a recognition unit that adaptively fuses each probability information obtained by analyzing at least one signal input to the deep learning network and recognizes emotions of a person in the video.

본 발명은 반 지도 학습과 비디오 내에 존재하는 멀티 모달 정보를 최대한 활용하여 비디오 클립 내 인물의 감정을 효율적으로 파악할 수 있다. 구체적으로, 이미지 기반, 얼굴 특징점 기반, 비디오 음성 신호 기반의 네트워크를 구성하여 인물의 표정 변화, 이미지에서 얼굴의 가림 및 어두운 조도에 의하여 얼굴 정보를 획득하기 어려울 경우, 얼굴 특징점을 보조 정보로 활용하여 인물의 감정을 구분할 수 있고, 인물의 표정 변화가 적거나 표정에서 인물의 감정을 획득할 수 없는 경우, 비디오 음성 신호를 사용하여 비디오 배경 소리 및 주변 인물들의 소리를 분석하여 비디오 내 인물의 감정 파악을 수행할 수 있다. The present invention can efficiently grasp the emotion of a person in a video clip by making full use of the semi-supervised learning and multi-modal information present in the video. Specifically, if it is difficult to obtain face information due to facial expression changes, image obscuration, and dark illumination by constructing an image-based, facial feature-point-based, and video-audio-based network, the facial feature points are used as auxiliary information. If the emotions of a person can be distinguished, and the facial expression change of the person is small or the emotion of the person cannot be obtained from the facial expression, the video voice signal is used to analyze the sound of the video background and the sounds of the people around him to understand the emotion of the person in the video You can do

또한, 본 발명은 비디오 내에 존재하는 비디오 데이터의 이미지, 인물의 얼굴 정보 및 비디오 내의 음성 신호를 이용하여 인물의 감정을 파악함으로써 강건한 감정 인식 기술을 구현할 수 있다. In addition, the present invention can implement a robust emotion recognition technology by grasping the emotion of a person using an image of video data present in the video, face information of the person, and voice signals in the video.

또한, 본 발명은 7개의 네트워크의 정보를 적응적으로 융합하여 비디오 내의 감정을 보다 정확하게 파악할 수 있다. In addition, the present invention can adaptively fuse information of seven networks to grasp emotions in a video more accurately.

도 1 및 도 2는 종래의 인물의 감정 인식 기술을 설명하기 위한 도면이다.
도 3은 일 실시예에 따른 인물 감정 인식 시스템의 개괄적인 동작을 설명하기 위한 도면이다.
도 4는 일 실시예에 따른 인물 감정 인식 시스템의 세부적인 네트워크의 정보를 설명하기 위한 도면이다.
도 5는 일 실시예에 따른 인물 감정 인식 시스템의 구성을 설명하기 위한 블록도이다.
도 6은 일 실시예에 따른 인물 감정 인식 시스템의 인물 감정 인식 방법을 설명하기 위한 도면이다.
도 7내지 도 12는 일 실시예에 따른 인물 감정 인식 시스템의 이미지 기반의 네트워크를 설명하기 위한 예이다.
도 13은 일 실시예에 따른 인물 감정 인식 시스템의 얼굴 특징점 기반의 네트워크를 설명하기 위한 예이다.
도 14는 일 실시예에 따른 인물 감정 인식 시스템의 음성 기반의 네트워크를 설명하기 위한 예이다.
도 15는 일 실시예에 따른 인물 감정 인식 시스템에서 7개의 라벨에 따른 융합 매트릭스(Confusion matrix)를 설명하기 위한 도면이다.
도 16은 일 실시예에 따른 인물 감정 인식 시스템에서 감정 인식 API의 프레임워크를 나타낸 도면이다. 1 and 2 are views for explaining the emotion recognition technology of a conventional person.
3 is a view for explaining the general operation of the person emotion recognition system according to an embodiment.
4 is a diagram for explaining detailed network information of a person's emotion recognition system according to an embodiment.
5 is a block diagram illustrating a configuration of a person emotion recognition system according to an embodiment.
6 is a diagram for explaining a method for recognizing a person's emotion in a person's emotion recognition system according to an embodiment.
7 to 12 are examples for explaining the image-based network of the person emotion recognition system according to an embodiment.
13 is an example for describing a network based on facial feature points of a person's emotion recognition system according to an embodiment.
14 is an example for describing a voice-based network of a person's emotion recognition system according to an embodiment.
15 is a diagram for explaining a confusion matrix according to seven labels in a person emotion recognition system according to an embodiment.
16 is a diagram illustrating a framework of an emotion recognition API in a person emotion recognition system according to an embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.

도 3은 일 실시예에 따른 인물 감정 인식 시스템의 개괄적인 동작을 설명하기 위한 도면이다. 3 is a view for explaining the general operation of the person emotion recognition system according to an embodiment.

인물 감정 인식 시스템은 비디오 데이터를 기반으로 다양한 딥 러닝 네트워크를 사용하여 보다 효과적인 감정 인식을 수행할 수 있다. 인물 감정 인식 시스템은 비디오 내에 존재하는 이미지 데이터, 얼굴 특징점 및 음성 데이터로부터 비디오 내에 존재하는 인물의 감정을 분석하기 위하여 이미지 기반의 네트워크, 얼굴 특징점 기반의 네트워크 및 음성 기반의 네트워크를 구성할 수 있다. 인물 감정 인식 시스템은 비디오 내에 존재하는 이미지 데이터, 이미지의 얼굴 특징점 데이터, 음성 데이터 중 적어도 하나 이상의 신호를 딥 러닝 네트워크의 입력으로 사용할 수 있다. The person emotion recognition system can perform more effective emotion recognition using various deep learning networks based on video data. The person emotion recognition system may configure an image-based network, a facial feature point-based network, and a voice-based network to analyze emotions of a person present in the video from image data, facial feature points, and voice data existing in the video. The human emotion recognition system may use at least one signal among image data, facial feature point data of the image, and voice data as an input of the deep learning network.

인물 감정 인식 시스템은 비디오 프레임 내에 존재하는 이미지 데이터, 얼굴 특징점 데이터 또는 음성 데이터 중 적어도 하나 이상의 신호를 인물 감정 인식을 위한 반 지도(Semi-supervised) 학습과 복수 개의 멀티 모달 네트워크에 기반하여 구성된 딥 러닝 네트워크에 입력할 수 있다. 구체적으로, 인물 감정 시스템은 이미지 데이터, 얼굴 특징점 및 음성 데이터 각각을 각각의 딥 러닝 네트워크에 입력할 수 있다. 예를 들면, 인물 감정 시스템을 이미지를 이미지 기반의 딥 러닝 네트워크에 입력하고, 얼굴 특징점을 얼굴 특징점 기반의 네트워크에 입력하고, 음성 데이터를 비디오 음성 기반의 네트워크에 입력할 수 있다. The human emotion recognition system is a deep learning constructed based on semi-supervised learning for recognition of human emotions and a plurality of multi-modal networks for at least one signal of image data, facial feature point data, or voice data existing in a video frame. You can enter it on the network. Specifically, the person emotion system may input each of image data, facial feature points, and voice data to each deep learning network. For example, a person emotion system may input an image into an image-based deep learning network, face feature points into a face feature point based network, and voice data into a video voice based network.

인물 감정 인식 시스템은 이미지 기반의 딥 러닝 네트워크의 경우, 연속된 이미지 시퀀스를 바탕으로 시계열 정보를 고려하는 딥 러닝 네트워크를 구성할 수 있다. 그리고 인물 감정 인식 시스템은 얼굴의 특징점을 1차원 공간이 아닌 2차원 공간으로 확장시켜 감정 변화를 분석할 수 있다. 인물 감정 인식 시스템은 음성 데이터의 잡음 및 대역 폭을 조절하기 위하여 오픈 소스 라이브러리인 Librosa와 OpensMile을 사용하여 Fine-grained 1차원 신호를 추출할 수 있다. 인물 감정 인식 시스템은 추출된 1차원 신호를 기반으로 Neural Network(NN), Convolutional Neural Network(CNN), 그리고 Long Short Term Memory(LSTM)을 각각 사용하여 네트워크를 구성할 수 있다. 인물 감정 인식 시스템은 최종적으로 이미지 데이터, 얼굴 특징점 데이터 및 음성 데이터 총 3개의 신호를 기반으로 구성한 네트워크 각각의 확률 정보를 적응적 융합 과정을 통하여 하나의 비디오 내의 인물의 감정을 최종적으로 판단할 수 있다. In the case of an image-based deep learning network, the person emotion recognition system may configure a deep learning network that considers time series information based on a continuous image sequence. In addition, the human emotion recognition system can analyze the emotion change by extending the feature points of the face into a two-dimensional space rather than a one-dimensional space. The human emotion recognition system can extract fine-grained one-dimensional signals using Librosa and OpensMile, which are open source libraries, to control the noise and bandwidth of speech data. The human emotion recognition system may construct a network using Neural Network (NN), Convolutional Neural Network (CNN), and Long Short Term Memory (LSTM), respectively, based on the extracted one-dimensional signal. The person emotion recognition system can finally determine the emotion of a person in a video through an adaptive fusion process of probability information of each network based on a total of three signals: image data, facial feature point data, and voice data. .

인물 감정 인식 시스템은 이미지 기반의 네트워크를 통해 인물의 표정 변화를 효율적으로 분석할 수 있다. 만약, 비디오 내의 이미지에서 얼굴의 가림 및 어두운 조도에 의해 얼굴 정보를 획득하기 어려울 경우, 얼굴 특징점을 보조 정보로 활용하여 인물의 감정을 구분할 수 있다. 또한 인물의 표정 변화가 적거나 표정에서 해당 인물의 감정을 획득할 수 없는 경우, 비디오 음성 데이터를 사용하여 비디오의 배경 소리 및 주변 인물들의 소리를 분석하여 비디오 내 인물의 감정 파악을 수행할 수 있다. The human emotion recognition system can efficiently analyze the facial expression change of the character through an image-based network. If it is difficult to obtain face information due to obscuration and dark illumination of the face in an image in the video, emotions of the person may be distinguished by using facial feature points as auxiliary information. In addition, when the facial expression change of the character is small or the emotion of the corresponding character cannot be obtained from the facial expression, the emotion of the person in the video may be analyzed by analyzing the background sound of the video and the sounds of the surrounding persons using the video voice data. .

이미지 기반의 네트워크는 총 3개의 네트워크를 사용한다. 대표적인 시계열 정보 기반 CNN인 Convolutional 3D 네트워크를 사용하여 감정 구분을 수행할 수 있다. 그리고 반 지도 학습 네트워크와 앞선 Convolutional 3D 네트워크를 융합한 네트워크를 사용하여 감정 구분을 수행할 수 있다. 인물 감정 인식 시스템은 네트워크의 정규화 효과를 기대하여 보다 높은 감정 인식 성능을 획득할 수 있다. 마지막으로 시계열 기반이 아닌 단일 이미지 기반 CNN을 구성하여 이미지의 특징을 획득한 뒤 SupportVector Machine(SVM) 과정을 수행하여 감정 분류를 진행하는 Parallel network를 구성할 수 있다. 총 7개의 제안 네트워크 중 유일하게 분류 함수로써 Softmax 함수가 아닌 SVM을 사용한다.The image-based network uses a total of three networks. Emotion classification can be performed using a representative time series information-based CNN, a Convolutional 3D network. In addition, emotion classification can be performed using a network that combines a semi-supervised learning network with an advanced convolutional 3D network. The human emotion recognition system can obtain a higher emotion recognition performance in anticipation of the normalization effect of the network. Finally, a single image-based CNN that is not time-series-based can be constructed to acquire the characteristics of the image, and then a SupportVector Machine (SVM) process can be performed to construct a parallel network that performs emotion classification. SVM, not Softmax function, is the only classification function among the 7 proposed networks.

얼굴 특징점 기반의 네트워크는 비디오 내 인물의 얼굴 정보 중 총 64개의 특징점 정보를 기반으로 특징을 획득할 수 있다. 얼굴 특징점 정보는 종래의 1차원 특징과 달리 얼굴 특징점 정보를 2차원 특징 벡터로 확장하여 딥 러닝 네트워크의 입력으로 사용한다. 이후, 획득된 2차원의 얼굴 특징점 특징을 사용하여 CNN-LSTM 네트워크를 사용하여 인물의 감정을 분류한다.The facial feature point-based network may acquire features based on a total of 64 feature point information among face information of a person in a video. Face feature point information is used as an input of a deep learning network by extending face feature point information into a two-dimensional feature vector, unlike conventional one-dimensional features. Thereafter, the emotions of the person are classified using the CNN-LSTM network using the obtained two-dimensional facial feature point feature.

비디오 내의 음성 데이터를 사용하여 비디오 내 감정을 분석할 수 있다. 인물의 표정 정보보다는 주변 배경 소리 정보를 얻는데 초점을 맞춘다. 표정 변화가 적은 비디오 클립의 경우 영상 기반 네트워크로 인물의 감정 정보를 획득할 수 없을 때 음성 기반 네트워크를 사용함으로써 보다 효율적으로 비디오 내 감정을 파악할 수 있다. 음성 데이터를 위하여 사용하는 네트워크는 3개의 네트워크(NN, CNN, LSTM)을 사용할 수 있다.Voice data in a video can be used to analyze emotion in the video. The focus is on obtaining background sound information rather than facial expression information. In the case of a video clip with little facial expression change, when the emotion information of a person cannot be obtained through an image-based network, the emotion in the video can be more efficiently identified by using a voice-based network. Three networks (NN, CNN, LSTM) can be used as the network used for voice data.

인물 감정 인식 시스템은 단일 모달(uni-modal) 정보가 아닌 멀티 모달(multi-model) 정보를 다양한 네트워크를 기반으로 효율적으로 분석하여 비디오 내 인물의 감정을 분석할 수 있다. 또한 총 7개의 네트워크의 정보를 적응적으로 융합하여 감정 인식의 성능에 영향을 크게 미치는 네트워크의 확률은 많이 고려하고 영향을 적게 미치는 네트워크의 확률을 적게 고려하는 과정을 수행할 수 있다. 이에 따라, 제안하는 딥 러닝 네트워크는 비디오 내 정보를 효율적으로 사용화여 인물의 감정을 분류하게 된다.The character emotion recognition system can analyze multi-model information, not single-modal information, based on various networks efficiently to analyze the emotions of the characters in the video. In addition, by adaptively fusion of information from a total of 7 networks, a process of considering a probability of a network that greatly affects the performance of emotion recognition and considering a probability of a network that has less influence can be performed. Accordingly, the proposed deep learning network uses information in video efficiently to classify emotions of characters.

도 4를 참고하면, 인물 감정 인식 시스템의 세부적인 네트워크의 정보를 설명하기 위한 도면이다. 인물 감정 인식 시스템은 이미지 기반의 네트워크(410), 얼굴 특징점 기반의 네트워크(420) 및 비디오 음성 기반의 네트워크(430)를 구성할 수 있다. 인물 감정 인식 시스템은 이미지 기반의 네트워크(410)의 경우, 연속된 이미지 시퀀스를 바탕으로 시계열 정보를 고려하는 딥 러닝 네트워크를 구성할 수 있다. 도 7을 참고하면, 2D Convolution과 3D Convolution를 나타낸 것으로, 이미지 기반의 네트워크를 설명하기 앞서 본 발명에 기반이 되는 네트워크를 우선적으로 설명하기로 한다. Convolutional 3D(C3D)는 기존의 2차원 정보의 컨볼루션(convolution)이 아니라 특정 정보의 깊이(depth)까지 컨볼루션(Convolution) 하는 총 3차원 정보의 컨볼루션 과정을 수행한다. 이를 통하여 시계열 정보를 고려하는 컨볼루션 네트워크를 구성할 수 있게 된다. 인물 감정 인식 시스템은 도 8 내지 도 12에서 설명하는 이미지 기반의 네트워크를 사용할 수 있다. Referring to FIG. 4, it is a diagram for describing detailed network information of a person's emotion recognition system. The human emotion recognition system may configure an image-based network 410, a facial feature point-based network 420, and a video voice-based network 430. In the case of the image emotion-based network 410, the person emotion recognition system may configure a deep learning network that considers time series information based on a continuous image sequence. Referring to FIG. 7, the 2D convolution and the 3D convolution are illustrated, and the network based on the present invention will be described first prior to describing the image-based network. Convolutional 3D (C3D) performs a convolution process of total 3D information that does not convolution of the existing 2D information but convolutions to a depth of specific information. Through this, it is possible to construct a convolutional network considering time series information. The person emotion recognition system may use the image-based network described in FIGS. 8 to 12.

도 8을 참고하면, NIN의 Global Average Pooling를 나타낸 것으로, 이미지 기반의 네트워크로 Convolutional 3D 네트워크를 사용할 수 있다. 이미지 기반의 네트워크는 Convolutional 3D 구조에서 복수의 파라미터의 수를 가지고 있으며, overfitting phenomenal(과적합 현상)의 원인이 되는 완전 연결 레이어(fully-connectedlayer)를 과감히 제거한다. 그리고 Network In Network(NIN)의 Global Average Pooling을 사용한 뒤 분류 함수로 Softmax를 사용한다.Referring to FIG. 8, it shows the global average pooling of NIN, and a convolutional 3D network can be used as an image-based network. The image-based network has a number of parameters in a convolutional 3D structure and drastically removes a fully-connected layer that causes overfitting phenomenal. Then, after using Global Average Pooling of Network In Network (NIN), Softmax is used as the classification function.

도 9를 참고하면, 보조 경로를 Convolutional 3D에 추가로 구성한 이미지 기반의 네트워크를 나타낸 것이다. Convolutional 3D의 경우 깊이(depth) 정보까지 convolution 하여 프레임 정보까지 고려하는 딥 러닝 학습을 진행할 수 있지만 커다란 네트워크의 특성 때문에 vanishing gradient 문제의 발생도 농후하다. 이를 해결하기 위하여 도 9는 2015년 구글의 논문 'Going deeper with convolutions'에서 처음 사용한 보조 경로(auxiliary path)를 Convolutional 3D에 추가로 구성한 것을 나타낸 예이다. 보조 경로를 네트워크의 중간에 부가함으로써 네트워크 학습 시 vanishing gradient 문제를 일정 부분 완화시킬 수 있다.Referring to FIG. 9, it shows an image-based network in which an auxiliary path is additionally configured in Convolutional 3D. In the case of convolutional 3D, deep learning learning that considers frame information by convolution of depth information can be performed, but the vanishing gradient problem is also rich due to the large network characteristics. To solve this, FIG. 9 is an example of additionally configuring an auxiliary path used in Google's paper 'Going deeper with convolutions' in 2015 in addition to Convolutional 3D. By adding the auxiliary path in the middle of the network, vanishing gradient problems can be partially mitigated when learning the network.

도 10을 참고하면, 인물 감정 인식 시스템은 이미지 기반의 네트워크로 Convolutional 3D with auxiliarynetwork(C3DA)을 구성할 수 있다. 도 10은 이미지 기반의 네트워크인 C3DA의 전체 프레임워크를 나타낸 것이다. 이러한 C3DA는 기존의 C3D에 비해 두 가지 특징을 가진다. 첫 번째, 보조 경로를 사용하여 학습 gradient의 흐름을 원활히 도와주기 때문에 네트워크의 더 나은 최적화(optimization)를 가능하게 한다. 두 번째, Global Pooling을 사용하여 과적합 현상을 완화하였다. Referring to FIG. 10, the human emotion recognition system may configure a Convolutional 3D with auxiliary network (C3DA) as an image-based network. Figure 10 shows the overall framework of the image-based network C3DA. This C3DA has two characteristics compared to the existing C3D. First, it enables a better optimization of the network by using auxiliary paths to facilitate the flow of the learning gradient. Second, overfitting was alleviated by using global pooling.

도 11을 참고하면, 인물 감정 인식 시스템은 이미지 기반의 네트워크로 S3DAE를 구성할 수 있다. 도 11은 S3DAE의 전체 프레임워크를 나타낸 것이다. S3DAE는 기존의 Convolutional 3D에 오토인코더(Autoencoder)를 사용하여 반 지도 학습 컨셉으로 네트워크를 구성한 것이다. 오토인코더를 사용한 이유는 기존의 Convolutional 3D에서 도 11의 1110의 컨볼루션의 학습을 돕고자 구성한 것이다. 오토인코더의 손실 함수로는 Binary Crossentropy를 사용할 수 있다. Referring to FIG. 11, the human emotion recognition system may configure S3DAE as an image-based network. 11 shows the overall framework of S3DAE. S3DAE is a network using a semi-supervised learning concept using an autoencoder in the existing convolutional 3D. The reason for using the autoencoder is to help the learning of the convolution of 1110 in FIG. 11 in the existing convolutional 3D. Binary Crossentropy can be used as a loss function of the autoencoder.

도 12를 참고하면, Wide ResNet의 기본 모듈을 나타낸 것이다. 이미지 기반 네트워크로 parallel network에서 ResNet-91 대신 더욱 성능이 좋은 Wide Residual Network를 사용할 수 있다.Referring to Figure 12, it shows the basic module of Wide ResNet. As an image-based network, instead of ResNet-91 in a parallel network, a higher performance wide residual network can be used.

도 13을 참고하면, 얼굴 특징점 정보를 2차원 특징 벡터로 변환하는 과정을 설명하기 위한 도면이다. 얼굴 특징점 정보를 얼굴 특징점 기반의 네트워크를 통하여 2차원 특징 벡터로 획득할 수 있다. 구체적으로, 인물 감정 인식 시스템은 연속된 프레임 간 얼굴의 랜드마크 정보를 합성함에 따라 얼굴 특징점을 획득할 수 있다. 인물 감정 인식 시스템은 얼굴 특징점 기반의 네트워크를 통하여 기존의 1차원 특징 벡터와 달리 연속적인 프레임에서 각각의 특징점들의 상대적 거리 변화를 기반으로 2차원 특징 벡터를 획득할 수 있다. 인물 감정 인식 시스템은 1차원 얼굴 특징점 특징 벡터보다 2차원 특징 벡터를 이용함으로써 얼굴의 각 요소의 표정 변화를 높은 확률로 분석할 수 있다. 이때, 2차원 특징 벡터를 아래의 수학식 1을 통하여 획득할 수 있다. Referring to FIG. 13, a diagram for explaining a process of converting facial feature point information into a two-dimensional feature vector. The facial feature point information may be obtained as a 2D feature vector through a network based on the facial feature point. Specifically, the facial emotion recognition system may acquire facial feature points by synthesizing landmark information of faces between successive frames. The human emotion recognition system can obtain a two-dimensional feature vector based on a change in the relative distance of each feature point in a continuous frame, unlike a conventional one-dimensional feature vector, through a network based on facial feature points. The human emotion recognition system can analyze the expression change of each element of the face with a high probability by using a two-dimensional feature vector rather than a one-dimensional facial feature point feature vector. At this time, a 2D feature vector may be obtained through Equation 1 below.

수학식 1: Equation 1:

인물 감정 인식 시스템은 수학식 1을 통하여 획득한 2차원 특징 벡터를 사용하여 기존의 CNN-LSTM 네트워크에서 감정 분류 과정을 수행할 수 있다. The human emotion recognition system may perform an emotion classification process in an existing CNN-LSTM network using a 2D feature vector obtained through Equation 1.

도 14를 참고하면, 음성 기반 네트워크를 설명하기 위한 도면이다. 예를 들면, 두려움과 슬픔 비디오 클립의 경우 해당 인물의 표정보다는 주변의 분위기 및 배경 사운드에서 인물의 감정 정보를 획득할 수 있다. 이에 따라 다양한 음성 정보 특징 벡터를 획득하기 위해 OpensMile 패키지 툴과 Librosa 파이썬 라이브러리를 사용하여 비디오의 음성 데이터를 획득할 수 있다. 음성 기반의 네트워크로 단순 fully-connectedlayer 기반의 Deep Neural Network(DNN), 1D Convolution을 사용한 1D CNN, 그리고 시계열 정보 분석에 많이 사용되는 Long Short Term Memory(LSTM)을 사용할 수 있다.14, a diagram for describing a voice-based network. For example, in the case of a video clip of fear and sadness, emotion information of a person may be acquired from surrounding atmosphere and background sound rather than the facial expression of the person. Accordingly, to obtain various voice information feature vectors, the voice data of the video can be obtained using the OpensMile package tool and the Librosa Python library. As a voice-based network, a simple fully-connected layer-based Deep Neural Network (DNN), 1D CNN using 1D Convolution, and Long Short Term Memory (LSTM), which are frequently used for time series information analysis, can be used.

도 15를 참고하면, 인물 감정 인식 시스템에서 7개의 라벨에 따른 융합 매트릭스(Confusion matrix)를 설명하기 위한 도면이다. 인물 감정 인식 시스템은 딥 러닝 네트워크에 입력된 적어도 하나 이상의 신호를 분석함에 따라 획득된 각각의 확률 정보를 적응적으로 융합하여 비디오 내의 인물의 감정을 인식할 수 있다. 인물 감정 인식 시스템은 딥 러닝 네트워크에 입력된 적어도 하나 이상의 신호를 분석함에 따라 획득된 각각의 확률 정보 중 감정 인식에 영향을 미치는 정도에 따라 네트워크 가중치를 적용할 수 있다. 인물 감정 인식 시스템은 7개의 네트워크에 대한 감정 확률을 일종의 앙상블 기법을 통하여 적응적으로 융합할 수 있다. Referring to FIG. 15, a figure for explaining a confusion matrix according to seven labels in a person emotion recognition system. The person emotion recognition system may recognize each person's emotion in the video by adaptively fusing each probability information obtained by analyzing at least one signal input to the deep learning network. The person emotion recognition system may apply a network weight according to the degree of affecting emotion recognition among each probability information obtained by analyzing at least one signal input to the deep learning network. The human emotion recognition system can adaptively fuse emotion probabilities for 7 networks through a kind of ensemble technique.

여기서, W는 가중치, S는 스코어를 의미한다. 네트워크의 가중치(Weight)의 경우, 각각의 네트워크에서의 각 감정 확률 정보를 기반으로 결정할 수 있다. 최종 감정 분석은 각각의 네트워크의 감정 스코어(Score)와 가중치(Weight) 값의 가중치 합산을 통하여 이루어질 수 있다.Here, W means weight and S means score. In the case of the weight of the network, it may be determined based on each emotion probability information in each network. The final emotion analysis may be performed through summing the weights of emotion scores and weight values of each network.

인물 감정 인식 시스템은 비디오 내의 인물의 감정을 화남, 역겨움, 두려움, 행복, 슬픔, 놀라움 및 중립을 포함하는 7가지의 감정으로 분류할 수 있다. 구체적으로, 인물 감정 인식 시스템은 분류된 감정에 대하여 정량값으로 도출할 수 있다. 예를 들면, 인물 감정 인식 시스템은 행복 92%, 슬픔 5%, 화남 3% 등과 같이 정량적으로 출력할 수 있다.The character emotion recognition system can classify the emotions of the characters in the video into seven emotions including anger, disgust, fear, happiness, sadness, surprise, and neutrality. Specifically, the person emotion recognition system can derive as a quantitative value for the classified emotions. For example, the character emotion recognition system can quantitatively output 92% of happiness, 5% of sadness, and 3% of angry men.

인물 감정 인식 시스템은 딥 러닝 네트워크 기반의 end-to-end 네트워크의 학습이 가능하다. 인물 감정 인식 시스템에서 제안하는 딥 러닝 네트워크는 기존 네트워크에 비해서 음성 정보를 잘 활용하여 슬픔과 두려움 감정의 비디오 클립의 감정의 정확도를 향상시킬 수 있다. 또한, 멀티 모달 정보를 최대한 활용하여 비디오 내 인물의 감정을 분석할 수 있다. The person emotion recognition system is capable of learning an end-to-end network based on a deep learning network. The deep learning network proposed by the human emotion recognition system can improve the accuracy of emotions of video clips of sadness and fear emotions by utilizing voice information better than existing networks. In addition, it is possible to analyze the emotion of a person in a video by making full use of multi-modal information.

도 16은 일 실시예에 따른 인물 감정 인식 시스템에서 감정 인식 API의 프레임워크를 나타낸 도면이다. 16 is a diagram illustrating a framework of an emotion recognition API in a person emotion recognition system according to an embodiment.

인물 감정 인식 시스템은Real-time API of a light emotion recognition algorithm을 제공할 수 있다. 이러한 알고리즘을 토대로 준 실시간성을 갖춘 감정 인식 API를 구현할 수 있다. API의 전체 프레임워크는 준 실시간 감정인식 과정을 수행하기 위해서 multi thread 기법과 Secure Shell(SSH) 통신을 사용할 수 있다.The human emotion recognition system may provide a real-time API of a light emotion recognition algorithm. Based on these algorithms, it is possible to implement emotion recognition APIs with semi-real-time properties. The entire framework of the API can use multi-thread technique and Secure Shell (SSH) communication to perform quasi-real-time emotion recognition process.

제안된 API의 기능은 총 3가지로 구성될 수 있다. 입력된 데이터 스트림에 대한 전처리 과정을 수행함에 따라 얼굴을 검출할 수 있다. 검출된 얼굴에 기반하여 해당 인물의 식별 정보(ID)를 식별할 수 있다. 그리고 나서, 주변 환경을 고려하여 해당 인물의 얼굴에 대한 감정을 인식할 수 있다. 예를 들면, '행복'이라는 결과가 출력될 수 있다. The proposed API can be composed of three functions. A face may be detected by performing a pre-processing process for the input data stream. Based on the detected face, identification information (ID) of the corresponding person may be identified. Then, the emotion of the person's face can be recognized in consideration of the surrounding environment. For example, the result of 'happiness' may be output.

도 5는 일 실시예에 따른 인물 감정 인식 시스템의 구성을 설명하기 위한 블록도이고, 도 6은 일 실시예에 따른 인물 감정 인식 시스템의 감정 인식 방법을 설명하기 위한 도면이다.5 is a block diagram for explaining a configuration of a person emotion recognition system according to an embodiment, and FIG. 6 is a view for explaining a feeling recognition method of the person emotion recognition system according to an embodiment.

인물 감정 인식 시스템(100)은 입력부(510) 및 인식부(520)를 포함할 수 있다. 이러한 구성요소들은 인물 감정 인식 시스템(100)에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. 구성요소들은 도 6의 인물 감정 인식 방법이 포함하는 단계들(610 내지 620)을 수행하도록 인물 감정 인식 시스템(100)을 제어할 수 있다. 이때, 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다. The person emotion recognition system 100 may include an input unit 510 and a recognition unit 520. These components may be expressions of different functions performed by the processor according to a control command provided by the program code stored in the person emotion recognition system 100. The components may control the character emotion recognition system 100 to perform steps 610 to 620 included in the character emotion recognition method of FIG. 6. In this case, the components may be implemented to execute instructions according to the code of the operating system included in the memory and the code of at least one program.

인물 감정 인식 시스템(100)의 프로세서는 인물 감정 인식 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예를 들면, 인물 감정 인식 시스템(100)에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 인물 감정 인식 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서가 포함하는 입력부(510) 및 인식부(520) 각각은 메모리에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(610 내지 620)을 실행하기 위한 프로세서의 서로 다른 기능적 표현들일 수 있다. The processor of the human emotion recognition system 100 may load the program code stored in the program file for the human emotion recognition method into the memory. For example, when the program is executed in the character emotion recognition system 100, the processor may control the character emotion recognition system to load program code from a file of the program into a memory under the control of the operating system. At this time, each of the processor and the input unit 510 and the recognition unit 520 included in the processor executes an instruction of a corresponding portion of the program code loaded in the memory, so that each of the processors for executing subsequent steps 610 to 620 It may be other functional expressions.

단계(610)에서 입력부(510)는 비디오 내에 존재하는 이미지 데이터, 얼굴 특징점 데이터 또는 음성 데이터 중 적어도 하나 이상의 신호를 인물 감정 인식을 위한 반 지도(Semi-supervised) 학습과 복수 개의 멀티 모달 네트워크에 기반하여 구성된 딥 러닝 네트워크에 입력할 수 있다. 입력부(510)는 비디오 내에 존재하는 이미지 데이터, 얼굴 특징점 데이터 및 음성 데이터로부터 비디오 내에 존재하는 인물의 감정을 분석하기 위하여 이미지 기반의 네트워크, 얼굴 특징점 기반의 네트워크 및 비디오 음성 신호 기반의 네트워크를 구성할 수 있다. 입력부(510)는 S3DAE, C3DA 또는 Parallel CNN 중 적어도 하나의 딥 러닝 네트워크를 사용함에 따라 비디오 내에 존재하는 이미지 데이터로부터 이미지 특징을 획득하고, 획득된 이미지 특징을 SVM 또는 Softmax을 수행하여 인물의 감정을 분류할 수 있다. 입력부(510)는 비디오의 연속적인 프레임에서 인물의 얼굴 정보에 대한 각각의 특징점들의 상대적 거리 변화를 기반으로 2차원 특징 벡터를 획득하고, 획득된 2차원 특징 벡터를 CNN-LSTM 네트워크에 입력하여 인물의 감정을 분류할 수 있다. 입력부(510)는 비디오 내의 음성 신호에 NN, CNN, LSTM 중 적어도 하나 이상의 음성 기반 네트워크를 사용하여 비디오 내의 분위기 및 배경 사운드에서 인물의 감정을 분석할 수 있다. In step 610, the input unit 510 is based on semi-supervised learning for recognition of human emotions and a plurality of multi-modal networks based on at least one signal among image data, facial feature point data, or voice data present in the video. Can be input to the configured deep learning network. The input unit 510 configures an image-based network, a facial feature-point-based network, and a video voice signal-based network to analyze emotions of a person present in the video from image data, facial feature point data, and voice data existing in the video. You can. The input unit 510 acquires an image characteristic from image data existing in the video by using a deep learning network of at least one of S3DAE, C3DA, or Parallel CNN, and performs the SVM or Softmax on the acquired image characteristic to perform emotion of the person. Can be classified. The input unit 510 obtains a two-dimensional feature vector based on a change in the relative distance of each feature point to the face information of the person in a continuous frame of the video, and inputs the obtained two-dimensional feature vector into the CNN-LSTM network. You can classify your emotions. The input unit 510 may analyze a person's emotion in the atmosphere and background sound in the video by using at least one voice-based network of NN, CNN, and LSTM for the voice signal in the video.

단계(620)에서 인식부(520)는 딥 러닝 네트워크에 입력된 적어도 하나 이상의 신호를 분석함에 따라 획득된 각각의 확률 정보를 적응적으로 융합하여 상기 비디오 내의 인물의 감정을 인식할 수 있다. 인식부(520)는 딥 러닝 네트워크에 입력된 적어도 하나 이상의 신호를 분석함에 따라 획득된 각각의 확률 정보 중 기 설정된 기준에 기초하여 가중치를 적용할 수 있다. 인식부(520)는 비디오 내의 인물의 감정을 화남, 역겨움, 두려움, 행복, 슬픔, 놀라움 및 중립을 포함하는 7가지의 감정으로 분류하고, 분류된 감정을 정량값으로 도출할 수 있다. In step 620, the recognizer 520 may adaptively fuse each probability information obtained by analyzing at least one signal input to the deep learning network to recognize emotions of the person in the video. The recognition unit 520 may apply a weight based on a preset criterion among each probability information obtained by analyzing at least one signal input to a deep learning network. The recognition unit 520 may classify emotions of a person in the video into seven emotions including anger, disgust, fear, happiness, sadness, surprise, and neutrality, and derive the classified emotions as quantitative values.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented with hardware components, software components, and / or combinations of hardware components and software components. For example, the devices and components described in the embodiments include, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors (micro signal processors), microcomputers, field programmable gate arrays (FPGAs). , A programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose computers or special purpose computers. The processing device may run an operating system (OS) and one or more software applications running on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, a processing device may be described as one being used, but a person having ordinary skill in the art, the processing device may include a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that may include. For example, the processing device may include a plurality of processors or a processor and a controller. In addition, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instruction, or a combination of one or more of these, and configure the processing device to operate as desired, or process independently or collectively You can command the device. Software and / or data may be interpreted by a processing device, or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. Can be embodied in The software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiments or may be known and usable by those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. -Hardware devices specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by a limited embodiment and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques are performed in a different order than the described method, and / or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, even if replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

In the video-based human emotion recognition method,
Inputting at least one signal from image data, facial feature point data, or voice data present in the video into a deep learning network configured based on semi-supervised learning for recognition of human emotions and a plurality of multi-modal networks; And
Adaptively fusing each probability information obtained by analyzing at least one signal input to the deep learning network to recognize emotions of a person in the video
Video-based character emotion recognition method comprising a.

According to claim 1,
Entering into the deep learning network,
Constructing an image-based network, a facial feature-point-based network, and a video voice signal-based network to analyze emotions of a person present in the video from image data, facial feature point data, and voice data present in the video
Video-based character emotion recognition method comprising a.

According to claim 1,
Entering into the deep learning network,
Acquiring an image feature from image data existing in the video by using a deep learning network of at least one of S3DAE, C3DA, or Parallel CNN, and classifying the emotion of the person by performing the SVM or Softmax on the acquired image feature
Video-based character emotion recognition method comprising a.

According to claim 1,
Entering into the deep learning network,
In a continuous frame in the video, a 2D feature vector is obtained based on a change in the relative distance of each feature point to a person's face information, and the obtained 2D feature vector is input to the CNN-LSTM network to express the emotion of the character. Sorting steps
Video-based character emotion recognition method comprising a.

According to claim 1,
Entering into the deep learning network,
Analyzing the emotion of the person in the atmosphere and background sound in the video using at least one voice-based network of NN, CNN, LSTM to the voice signal in the video
Video-based character emotion recognition method comprising a.

According to claim 1,
Recognizing the emotion of the person in the video,
Applying a weight based on a preset criterion among each probability information obtained by analyzing at least one signal input to the deep learning network
Video-based character emotion recognition method comprising a.

According to claim 1,
Recognizing the emotion of the person in the video,
Classifying emotions of characters in the video into seven emotions including anger, disgust, fear, happiness, sadness, surprise and neutrality, and deriving the classified emotions as quantitative values
Video-based character emotion recognition method comprising a.

In the video-based human emotion recognition system,
An input unit that inputs at least one signal of image data, facial feature point data, or voice data present in the video to a deep learning network configured based on semi-supervised learning for recognition of human emotions and a plurality of multi-modal networks; And
The recognition unit adaptively fuses each probability information obtained by analyzing at least one signal input to the deep learning network to recognize emotions of the person in the video
Video-based character emotion recognition system comprising a.