KR20200085696A

KR20200085696A - Method of processing video for determining emotion of a person

Info

Publication number: KR20200085696A
Application number: KR1020200081613A
Authority: KR
Inventors: 유대훈; 이영복
Original assignee: 주식회사 제네시스랩
Priority date: 2018-01-02
Filing date: 2020-07-02
Publication date: 2020-07-15
Also published as: KR102290186B1

Abstract

The present invention relates to an emotion recognition method for processing an image to determine the emotional state of a person. The emotion recognition method for processing an image to determine the emotional state of a person comprises the following steps of: providing an image and a voice representing the appearance of a person, wherein the image includes a first image part, a second image part, and a third image part; processing the first image part to determine the emotional state of the person in the first image part, wherein the face and at least one hand of the person are shown and the at least one hand does not overlap any part of the face of the person in the first image part; and processing the second image part to determine the emotional state of the person, wherein the face and at least one hand of the person are shown and the at least one hand overlaps the face of the person in the second image part.

Description

Emotional recognition method that processes images to determine a person's emotional state{METHOD OF PROCESSING VIDEO FOR DETERMINING EMOTION OF A PERSON}

본 발명의 실시예들은 사람의 감성 상태를 결정하기 위하여 영상을 처리하는 감성인식 방법에 관한 것이다.Embodiments of the present invention relate to a method for recognizing emotions for processing an image to determine a person's emotional state.

종래의 기술에서는 가림(Occlusion)을 인식하여 오류로 처리한다. 손으로 입을 가린다는 것은 중요한 정보로 감정 상태의 세기 정도를 알아낼 수 있다. 단순히 정적 이미지로는 가림(Occlusion) 문제로 인식 정보가 부족할 수가 있다.In the conventional technique, occlusion is recognized and treated as an error. Covering your mouth with your hands is an important piece of information that can help you determine the intensity of your emotional state. Recognition information may be insufficient due to occlusion problem with a static image.

또한, 얼굴 표정으로 감정을 인식할 때 대상자가 말을 하면 잘못된 감정 인식 결과를 도출한다. 표정인식을 통한 감정인식은 입모양이 매우 중요한 정보지만 말을 할 때는 입모양이 수시로 변하기 때문에 놀람, 화, 웃음 등과 같은 입모양이 나올 수 있어 잘못된 인식 결과를 초래한다. In addition, when a subject speaks when recognizing emotion with a facial expression, an incorrect emotion recognition result is obtained. Emotion recognition through facial expression recognition is very important information, but when you speak, the mouth shape changes frequently, so you may have a mouth shape such as surprise, anger, or laughter.

이와 같이, 종래의 기술 중에는 얼굴 표정만으로 감정을 인식하는 경우 이를 해결하기 위한 대안은 거의 없으며, 멀티 모달인 경우에는 이러한 노이즈를 최소화하기 위해 얼굴 표정과 음성 정보를 혼용하여 오류를 최소화하는 방법으로 접근하고 있다. 본 특허에서는 얼굴 혹은 입모양을 추적하여 현재 말하는 상태인지 판별한 후, 말하는 상태인 경우에는 입모양 정보를 최소화하고 음성 특징정보의 비중을 확대하는 방법으로 정확한 감정 인식 결과를 도출 할 수 있도록 한다.As described above, in the prior art, there is almost no alternative to resolve the emotion when only the facial expression is recognized, and in the case of the multi-modal approach, the expression is mixed with the facial expression and the voice information to minimize the error to minimize the error. Doing. In this patent, it is possible to derive accurate emotion recognition results by tracking the face or mouth shape and determining whether it is currently speaking, and in the case of speaking, minimizing mouth shape information and expanding the weight of voice feature information.

본 발명의 실시예들은 손의 움직임 및 식별 정보, 입모양에 대한 정보, 음성 정보, 부분 표정 정보와 더불어 시간적 정보를 이용하여 보다 정확한 감정인식을 수행하는 멀티 모달 감성인식 장치, 방법 및 저장매체를 제공하고자 한다.Embodiments of the present invention is a multi-modal emotional recognition device, method, and storage medium that performs more accurate emotional recognition using temporal information together with hand movement and identification information, mouth shape information, voice information, and partial expression information. Want to provide

본 발명의 실시예의 일 측면에 따른 사람의 감성 상태를 결정하기 위하여 영상을 처리하는 감성인식 방법은, 사람의 외형을 표현하는 영상과 음성을 제공하는, 상기 영상은 제1 영상부와, 상기 제1 영상부를 바로 뒤따른 제2 영상부와, 상기 제2 영상부를 바로 뒤따르는 제3 영상부를 포함하는, 단계; 상기 제1 영상부에서 상기 사람의 감성 상태를 결정하기 위하여 상기 제1 영상부를 처리하며, 상기 제1 영상부에서는 상기 사람의 얼굴 및 적어도 하나의 손이 보여지며 상기 적어도 하나의 손이 상기 사람의 얼굴의 어떠한 일부도 중첩되지 않는 것을 특징으로 하는 단계; 및 상기 제2 영상부에서 상기 사람의 감성 상태를 결정하기 위하여 상기 제2 영상부를 처리하며, 상기 제2 영상부에서는 상기 사람의 얼굴과 적어도 하나의 손이 보여지며 상기 적어도 하나의 손이 상기 사람의 얼굴과 중첩되는 것을 특징으로 하는 단계;를 포함하고, 상기 제1 영상부를 처리하는 단계는, 상기 적어도 하나의 손이 상기 사람의 얼굴을 가리는지 여부를 결정하기 위하여 상기 제1 영상부의 적어도 하나의 프레임을 처리하는 단계와, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 상기 사람의 제1 얼굴 요소를 찾는 단계와, 상기 제1 얼굴 요소가 위치된 상태에서, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 모양에 기초하여 상기 제1 영상부 제1 얼굴 특징 데이터를 획득하는 단계와, 상기 제1 영상부에서 상기 사람의 목소리의 특성(characteristics)에 기초한 voice feature를 획득하기 위하여 제1 영상부의 오디오 데이터를 처리하는 단계;와, 상기 제1 영상부의 제1 얼굴 특징 데이터 및 voice feature를 포함하는 복수의 데이터에 기초하여 상기 제1 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계;를 포함하고, 상기 제2 영상부를 처리하는 단계는, 상기 사람의 얼굴이 적어도 하나의 손에 의하여 가려지는 지 여부를 결정하기 위하여 상기 제2 영상부의 적어도 하나의 프레임을 처리하는, 특히 상기 제2 영상부에서 상기 사람의 얼굴을 상기 적어도 하나의 손이 가리는 지 여부가 결정되는, 단계와, 상기 제2 영상부의 적어도 하나의 프레임에서 상기 사람의 상기 제1 얼굴 요소를 찾는 단계와, 상기 제1 얼굴 요소가 위치된 상태에서, 상기 제2 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 모양에 기초하여 상기 제1 영상부 제1 얼굴 특징 데이터를 획득하는 단계와, 상기 제2 영상부에서 상기 사람의 목소리 특성에 기초한 음성 특징 데이터를 획득하기 위하여 제2 영상부의 오디오 데이터를 처리하는 단계와, 상기 제2 영상부의 상기 제1 얼굴 특징 데이터와, 상기 제2 영상부의 상기 음성 특징 데이터와, 상기 사람의 얼굴 일부를 적어도 하나의 손이 가린 위치를 지시하는 부가 데이터를 포함하는 복수의 데이터에 기초하여 상기 제2 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계를 포함한다.Emotion recognition method for processing an image to determine a person's emotional state according to an aspect of an embodiment of the present invention provides an image and an audio expressing a person's appearance, the image comprising a first image unit and the agent And a second image portion immediately following the first image portion and a third image portion immediately following the second image portion; In the first image section, the first image section is processed to determine the emotional state of the person, and in the first image section, the face of the person and at least one hand are shown, and the at least one hand is the view of the person. Characterized in that no part of the face overlaps; And the second imager processes the second imager to determine the emotional state of the person, wherein the second imager shows the person's face and at least one hand and the at least one hand is the person. It characterized in that it overlaps with the face of the, including, processing the first image, at least one of the first image portion to determine whether the at least one hand covers the face of the person Processing a frame of, finding a first face element of the person in the at least one frame of the first image portion, and in the state where the first face element is located, the at least one of the first image portion Acquiring first facial feature data of the first imaging unit based on the shape of the first facial element shown in the frame of and voice feature based on characteristics of the human voice in the first imaging unit Processing audio data of a first image unit to obtain a; and the emotion of the person with respect to the first image unit based on a plurality of data including first facial feature data and voice features of the first image unit. Including; determining the state, the processing of the second image portion, processing the at least one frame of the second image portion to determine whether the face of the person is covered by at least one hand In particular, it is determined whether or not the at least one hand covers the face of the person in the second image section, and finding the first face element of the person in at least one frame of the second image section. Step, in the state where the first face element is located, obtaining the first facial feature data of the first image portion based on the shape of the first face element shown in the at least one frame of the second image portion Step and audio of the second video unit in order to obtain speech feature data based on the human voice characteristics in the second video unit Processing data, the first facial feature data of the second image portion, the voice characteristic data of the second image portion, and additional data indicating a location where at least one hand covers a part of the face of the person. And determining the emotional state of the person with respect to the second image unit based on a plurality of data.

또한, 상기 복수의 데이터에 기초하여 상기 제2 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계는, 상기 제2 영상부에서 상기 사람의 얼굴의 일부가 적어도 하나의 손에 의하여 가려지는 경우, 상기 제2 영상부의 상기 제1 얼굴 특징 데이터보다 상기 제2 영상부의 상기 음성 특징 데이터에 더 가중치를 둘 수 있다.Further, the step of determining the emotional state of the person with respect to the second image unit based on the plurality of data may include when a part of the person's face is covered by at least one hand in the second image unit, The voice feature data of the second image portion may be weighted more than the first face feature data of the second image portion.

또한, 상기 복수의 데이터에 기초하여 상기 제2 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계는, 상기 제1 영상부에서는 상기 사람의 얼굴의 어느 부분도 적어도 하나의 손에 의하여 가려지지 않았으나, 상기 제2 영상부에서 상기 사람의 얼굴의 일부가 적어도 하나의 손에 의하여 가려지는 경우, 상기 제1 영상부의 상기 음성 특징 데이터보다 상기 제2 영상부의 상기 음성 특징 데이터에 더 가중치를 둘 수 있다.In addition, in the step of determining the emotional state of the person with respect to the second image unit based on the plurality of data, in the first image unit, any part of the person's face was not covered by at least one hand. When the part of the person's face is covered by at least one hand in the second image unit, the voice feature data of the second image portion may be weighted more than the voice feature data of the first image portion. .

또한, 상기 제1 영상부를 처리하는 단계는, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 상기 사람의 제2 얼굴 요소를 찾는 단계와, 상기 제2 얼굴 요소가 위치된 상태에서, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제2 얼굴 요소의 형상에 기초하여 상기 제1 영상부의 제2 얼굴 특징 데이터를 획득하는 단계를 더 포함하고, 특히, 상기 제1 영상부의 상기 제1 얼굴 요소 특징 데이터와, 제1 영상부의 상기 제2 얼굴 요소 특징 데이터와, 상기 제1 영상부의 음성 특징 데이터를 포함하는 복수의 데이터에 기초하여 상기 제1 영상부에 대한 상기 사람의 감성 상태를 결정하며, 상기 제2 영상부를 처리하는 단계는, 상기 제2 영상부의 상기 적어도 하나의 프레임에서 상기 사람의 제2 얼굴 요소를 찾는, 적어도 하나의 손에 의하여 상기 제2 얼굴 요소가 가려지는 지 여부를 결정하는, 단계와, 상기 제1 영상부의 상기 제2 face feature와 제2 얼굴 요소의 가려짐에 대한 기설정된 가중치에 기초하여 제2 영상부의 제2 얼굴 특징 데이터를 획득하는 단계를 더 포함하고, 특히, 상기 제2 영상부의 상기 제1 얼굴 요소 특징 데이터와, 상기 제2 영상부의 제2 얼굴 요소 특징 데이터와, 상기 제2 영상부의 상기 음성 특징 데이터와, 상기 사람의 얼굴의 일부를 가지는 상기 적어도 하나의 손의 위치를 지시하는 부가 데이터에 기초하여 상기 제2 영상부에 대한 상기 사람의 감성 상태를 결정할 수 있다.Further, the processing of the first image portion may include: finding a second face element of the person in the at least one frame of the first image portion, and while the second face element is located, the first image And acquiring second facial feature data of the first image part based on the shape of the second face element shown in the at least one frame of the negative, in particular, the first face element of the first image part. Determining the emotional state of the person with respect to the first image portion based on a plurality of data including feature data, the second facial element feature data of the first image portion, and the voice feature data of the first image portion, The processing of the second image portion may include determining whether the second face element is covered by at least one hand, which finds the second face element of the person in the at least one frame of the second image portion. And, acquiring second face feature data of the second image part based on predetermined weights of the second face feature and the second face element obscured by the first image part. The at least one having the first facial element feature data of the second image portion, the second facial element feature data of the second image portion, the audio feature data of the second image portion, and a part of the face of the person The emotional state of the person with respect to the second image unit may be determined based on additional data indicating the position of the hand.

또한, 제3 영상부에 대한 상기 사람의 감성 상태를 결정하기 위하여 상기 3 영상부를 처리하는 단계를 더 포함하고, 상기 사람의 얼굴의 어느 부분도 적어도 나의 손에 의하여 가려지지 않은 상태에서, 상기 제3 영상부 상에 상기 사람의 얼굴 및 상기 적도 하나의 손이 보여지며, 상기 제3 영상부를 처리하는 단계는, 상기 사람의 얼굴을 상기 적어도 하나의 손이 가리는 지 여부를 결정하기 위하여 상기 제3 영상부의 적어도 하나의 프레임을 처리하는 단계; 상기 제3 영상부의 적어도 하나의 프레임에서 상기 사람의 제1 얼굴 요소를 찾는 단계; 제1 얼굴 요소가 위치된 상태에서, 상기 제3 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 형상에 기초하여 상기 제3 영상부의 제1 얼굴 특징 데이터를 획득하는 단계; 상기 제3 영상부에서 상기 사람의 목소리 특성에 기초하여 상기 제3 영상부의 음성 특징 데이터를 획득하기 위하여 상기 제1 영상부의 오디오 데이터를 처리하는 단계; 및 상기 제3 영상부의 상기 제1 얼굴 특징 데이터 및 상기 음성 특징 데이터를 포함하는 복수의 데이터에 기초하여 상기 제1 영상부의 상기 사람의 감성 상태를 결정하는 단계;를 포함할 수 있다.In addition, further comprising the step of processing the three image portion to determine the emotional state of the person with respect to the third image portion, wherein at least a part of the face of the person is not covered by my hand, the agent 3 The face of the person and one hand of the equator are shown on the image section, and the processing of the third image section may include the third step of determining whether the at least one hand covers the face of the person. Processing at least one frame of the image unit; Finding a first face element of the person in at least one frame of the third image unit; Acquiring first facial feature data of the third image portion based on a shape of the first facial element shown in the at least one frame of the third image portion while the first face element is located; Processing audio data of the first video unit to obtain audio characteristic data of the third video unit based on the human voice characteristics in the third video unit; And determining the emotional state of the person in the first image unit based on a plurality of data including the first facial feature data and the voice feature data in the third image unit.

본 발명의 실시예의 다른 측면에 따른 컴퓨터에 의하여 실행될 때, 기설정된 명령어를 저장하는 컴퓨터 판독가능한 저장 매체는, 사람의 외형을 표현하는 영상과 음성을 제공하는, 상기 영상은 제1 영상부와, 상기 제1 영상부를 바로 뒤따른 제2 영상부와, 상기 제2 영상부를 바로 뒤따르는 제3 영상부를 포함하는, 단계; 상기 제1 영상부에서 상기 사람의 감성 상태를 결정하기 위하여 상기 제1 영상부를 처리하며, 상기 제1 영상부에서는 상기 사람의 얼굴 및 적어도 하나의 손이 보여지며 상기 적어도 하나의 손이 상기 사람의 얼굴의 어떠한 일부도 중첩되지 않는 것을 특징으로 하는 단계; 및 상기 제2 영상부에서 상기 사람의 감성 상태를 결정하기 위하여 상기 제2 영상부를 처리하며, 상기 제2 영상부에서는 상기 사람의 얼굴과 적어도 하나의 손이 보여지며 상기 적어도 하나의 손이 상기 사람의 얼굴과 중첩되는 것을 특징으로 하는 단계;를 포함하고, 상기 제1 영상부를 처리하는 단계는, 상기 적어도 하나의 손이 상기 사람의 얼굴을 가리는지 여부를 결정하기 위하여 상기 제1 영상부의 적어도 하나의 프레임을 처리하는 단계와, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 상기 사람의 제1 얼굴 요소를 찾는 단계와, 상기 제1 얼굴 요소가 위치된 상태에서, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 모양에 기초하여 상기 제1 영상부 제1 얼굴 특징 데이터를 획득하는 단계와, 상기 제1 영상부에서 상기 사람의 목소리의 특성(characteristics)에 기초한 voice feature를 획득하기 위하여 제1 영상부의 오디오 데이터를 처리하는 단계;와, 상기 제1 영상부의 제1 얼굴 특징 데이터 및 voice feature를 포함하는 복수의 데이터에 기초하여 상기 제1 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계;를 포함하고, 상기 제2 영상부를 처리하는 단계는, 상기 사람의 얼굴이 적어도 하나의 손에 의하여 가려지는 지 여부를 결정하기 위하여 상기 제2 영상부의 적어도 하나의 프레임을 처리하는, 특히 상기 제2 영상부에서 상기 사람의 얼굴을 상기 적어도 하나의 손이 가리는 지 여부가 결정되는, 단계와, 상기 제2 영상부의 적어도 하나의 프레임에서 상기 사람의 상기 제1 얼굴 요소를 찾는 단계와, 상기 제1 얼굴 요소가 위치된 상태에서, 상기 제2 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 모양에 기초하여 상기 제1 영상부 제1 얼굴 특징 데이터를 획득하는 단계와, 상기 제2 영상부에서 상기 사람의 목소리 특성에 기초한 음성 특징 데이터를 획득하기 위하여 제2 영상부의 오디오 데이터를 처리하는 단계와, 상기 제2 영상부의 상기 제1 얼굴 특징 데이터와, 상기 제2 영상부의 상기 음성 특징 데이터와, 상기 사람의 얼굴 일부를 적어도 하나의 손이 가린 위치를 지시하는 부가 데이터를 포함하는 복수의 데이터에 기초하여 상기 제2 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계를 포함하는 사람의 감성 상태를 결정하기 위하여 영상을 처리하는 감성인식 방법을 수행하는 명령어를 저장한다.When executed by a computer according to another aspect of the embodiment of the present invention, a computer-readable storage medium for storing predetermined instructions provides an image and audio representing a person's appearance, the image comprising a first image unit, And a second image portion immediately following the first image portion and a third image portion immediately following the second image portion; In the first image section, the first image section is processed to determine the emotional state of the person, and in the first image section, the face of the person and at least one hand are shown, and the at least one hand is the view of the person. Characterized in that no part of the face overlaps; And the second imager processes the second imager to determine the emotional state of the person, wherein the second imager shows the person's face and at least one hand and the at least one hand is the person. It characterized in that it overlaps with the face of the, including, processing the first image, at least one of the first image portion to determine whether the at least one hand covers the face of the person Processing a frame of, finding a first face element of the person in the at least one frame of the first image portion, and in the state where the first face element is located, the at least one of the first image portion Acquiring first facial feature data of the first imaging unit based on the shape of the first facial element shown in the frame of and voice feature based on characteristics of the human voice in the first imaging unit Processing audio data of a first image unit to obtain a; and the emotion of the person with respect to the first image unit based on a plurality of data including first facial feature data and voice features of the first image unit. Including; determining the state, the processing of the second image portion, processing the at least one frame of the second image portion to determine whether the face of the person is covered by at least one hand In particular, it is determined whether or not the at least one hand covers the face of the person in the second image section, and finding the first face element of the person in at least one frame of the second image section. Step, in the state where the first face element is located, obtaining the first facial feature data of the first image portion based on the shape of the first face element shown in the at least one frame of the second image portion Step and audio of the second video unit in order to obtain speech feature data based on the human voice characteristics in the second video unit Processing data, the first facial feature data of the second image portion, the voice characteristic data of the second image portion, and additional data indicating a location where at least one hand covers a part of the face of the person. A command for performing an emotion recognition method for processing an image is stored to determine the emotional state of a person, including determining the emotional state of the person for the second image unit based on a plurality of data.

상기한 바와 같은 본 발명의 실시예에 따르면, 사람의 감성 상태를 결정하기 위하여 영상을 처리하는 감성인식 방법은 대화하는 경우 및 손과 같은 객체에 의한 표정 가림을 하는 경우의 감정 상태를 정확하게 파악할 수 있다.According to the embodiment of the present invention as described above, the emotion recognition method for processing an image to determine the emotional state of a person can accurately grasp the emotional state in the case of conversation and masking the expression by an object such as a hand. have.

도 1은 본 발명의 실시예에 따른 멀티모달 감성 인식 장치의 구성을 개략적으로 도시한 도면이다.
도 2는 도 1의 멀티모달 감성 인식 장치 중 데이터 전처리부의 구성을 개략적으로 도시한 도면이다.
도 3는 도 1의 멀티모달 감성 인식 장치 중 예비 추론부의 구성을 개략적으로 도시한 도면이다.
도 4는 도 1의 멀티모달 감성 인식 장치 중 메인 추론부의 구성을 개략적으로 도시한 도면이다.
도 5는 도 1의 멀티모달 감성 인식 장치에 의한 멀티모달 감성 인식 방법을 보여주는 순서도이다.
도 6은 도 5의 멀티모달 감성 인식 방법 중 데이터 전처리 단계를 상세하게 보여주는 순서도이다.
도 7은 도 5의 멀티모달 감성 인식 방법 중 예비 추론 단계를 상세하게 보여주는 순서도이다.
도 8은 도 5의 멀티모달 감성 인식 방법 중 메인 추론 단계를 상세하게 보여주는 순서도이다.
도 9는 도 1의 멀티모달 감성 인식 장치에서 상황 변화 여부에 따른 얼굴 인식 과정을 보여주는 예시적인 도면이다.1 is a diagram schematically showing the configuration of a multi-modal emotional recognition device according to an embodiment of the present invention.
FIG. 2 is a diagram schematically showing the configuration of a data pre-processing unit among the multi-modal emotion recognition devices of FIG. 1.
FIG. 3 is a diagram schematically showing a configuration of a preliminary reasoning unit in the multi-modal emotion recognition device of FIG. 1.
FIG. 4 is a diagram schematically showing the configuration of the main reasoning unit among the multi-modal emotion recognition devices of FIG. 1.
5 is a flowchart illustrating a multi-modal emotion recognition method using the multi-modal emotion recognition apparatus of FIG. 1.
6 is a flowchart illustrating in detail the data preprocessing step of the multi-modal emotion recognition method of FIG. 5.
7 is a flowchart illustrating in detail the preliminary inference step of the multi-modal emotion recognition method of FIG. 5.
8 is a flowchart illustrating in detail the main reasoning step of the multi-modal emotion recognition method of FIG. 5.
9 is an exemplary diagram illustrating a face recognition process according to whether a situation changes in the multi-modal emotional recognition device of FIG. 1.

이하, 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시 할 수 있도록 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains can easily practice.

본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 동일 또는 유사한 구성요소에 대해서는 동일한 참조부호를 붙였다. 또한, 도면에서 나타난 각 구성의 크기 및 두께는 설명의 편의를 위해 임의로 나타내었으므로, 본 발명이 반드시 도시된 바에 한정되지 않는다.The present invention can be implemented in many different forms and is not limited to the embodiments described herein. In the drawings, parts not related to the description are omitted in order to clearly describe the present invention, and the same reference numerals are attached to the same or similar elements throughout the specification. In addition, since the size and thickness of each component shown in the drawings are arbitrarily shown for convenience of description, the present invention is not necessarily limited to what is illustrated.

본 발명에 있어서 "~상에"라 함은 대상부재의 위 또는 아래에 위치함을 의미하는 것이며, 반드시 중력방향을 기준으로 상부에 위치하는 것을 의미하는 것은 아니다. 또한, 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. In the present invention, "~ on" means that it is located above or below the target member, and does not necessarily mean that it is located above the gravity direction. Also, in the specification, when a part “includes” a certain component, it means that the component may further include other components, rather than excluding other components, unless otherwise specified.

이하, 첨부된 도면을 참조하여 본 발명의 실시예들을 상세히 설명하기로 하며, 도면을 참조하여 설명할 때 동일하거나 대응하는 구성 요소는 동일한 도면부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings, and the same or corresponding components will be given the same reference numerals when describing with reference to the drawings, and redundant description thereof will be omitted. .

본 발명은 대상자의 동영상과 음성 데이터를 기반으로 얼굴 표정, 말 상태, 손, 음성을 고려한 인공지능을 이용하여 보다 정확한 감성인식 결과를 도출한다.The present invention derives a more accurate emotional recognition result using artificial intelligence considering facial expression, speech state, hand, and voice based on the subject's video and voice data.

도 1은 본 발명의 실시예에 따른 멀티모달 감성 인식 장치의 구성을 개략적으로 도시한 도면이다.1 is a diagram schematically showing the configuration of a multi-modal emotional recognition device according to an embodiment of the present invention.

도 1을 참조하면, 멀티 모달 감성 인식 장치(10)는, 데이터 입력부(100), 데이터 전처리부(200), 예비 추론부(300), 메인 추론부(400) 및 출력부(500)를 포함할 수 있다.Referring to FIG. 1, the multi-modal emotion recognition device 10 includes a data input unit 100, a data preprocessing unit 200, a preliminary reasoning unit 300, a main reasoning unit 400, and an output unit 500 can do.

데이터 입력부(100)는 사용자의 영상 데이터(DV) 및 음성 데이터(DS)를 입력 받을 수 있다.The data input unit 100 may receive image data DV and audio data DS of the user.

데이터 입력부(100)는 사용자의 감성 인식을 하기 위한 영상 데이터(DV)를 수신 받는 영상 입력부(110) 및 사용자의 음성 데이터(DS)를 수신 받는 음성 입력부(120)를 포함할 수 있다.The data input unit 100 may include an image input unit 110 that receives image data DV for user's emotional recognition and a voice input unit 120 that receives user's voice data DS.

또한, 데이터 전처리부(200)는 음성 데이터(DS)로부터 음성 특징 데이터(DF₂)를 생성하는 음성 전처리부(220), 영상 데이터(DV)로부터 하나 이상의 얼굴 특징 데이터(DF₁)를 생성하는 영상 전처리부(210)를 포함할 수 있다.In addition, the data pre-processing unit 200 generates one or more facial feature data DF ₁ from the voice pre-processing unit 220 that generates the voice feature data DF ₂ from the voice data DS, and the image data DV. An image pre-processing unit 210 may be included.

이 때, 얼굴 특징 데이터(DF₁)는 이미지, 위치 정보, 크기 정보, 얼굴 비율 정보, 뎁스 정보(Depth Information) 중 적어도 하나 이상을 포함할 수 있고, 음성 특징 데이터(DF₂)는 억양, 음높이 정보, 발성 강도, 발화속도 등 음성의 특징을 나타낼 수 있는 정보를 포함할 수 있다. At this time, the facial feature data DF ₁ may include at least one of an image, location information, size information, face ratio information, and depth information, and the voice feature data DF ₂ may be accented or pitched. It may include information that can represent characteristics of the voice, such as information, speech intensity, and speech rate.

영상 전처리부(210)는 영상 데이터(DV)로부터 사용자의 얼굴 특징 데이터(DF₁)를 추출하기 위한 영상 전처리를 수행한다.The image pre-processing unit 210 performs image pre-processing for extracting the facial feature data DF ₁ of the user from the image data DV.

상기 영상 전처리는, 얼굴 전체 또는 부분 인식, 노이즈 제거, 사용자 얼굴 특징 및 이미지 추출 등 학습 모델을 사용하기 위한 영상 데이터(DV)를 적절한 양태로 변환할 수 있다.The image pre-processing may convert image data DV for use in a learning model such as full or partial face recognition, noise removal, user face feature and image extraction, to an appropriate mode.

음성 전처리부(220)는 음성 데이터(DS)로부터 사용자의 음성 특징 데이터(DF₂)를 추출하기 위한 음성 전처리를 수행한다.The voice pre-processing unit 220 performs voice pre-processing to extract the user's voice feature data DF ₂ from the voice data DS.

상기 음성 전처리는, 외부 소음 제거, 노이즈 제거, 사용자 음성 특징 추출 등 학습 모델을 사용하기 위한 적절한 양태로 음성 데이터(DS)를 변환할 수 있다.The voice pre-processing may convert voice data DS into an appropriate mode for using a learning model such as external noise removal, noise removal, and user voice feature extraction.

예비 추론부(300)는, 영상 데이터(DV)에 기반하여, 시간적 순서에 따른 사용자의 상황 변화 여부에 관한 상황 판단 데이터(P)를 생성할 수 있다.The preliminary reasoning unit 300 may generate context determination data P regarding whether a user's context changes according to temporal order based on the image data DV.

이 때, 상황 판단 데이터(P)는, 사용자가 대화 상태인지 여부에 대한 대화 판단 데이터(P₁) 또는 영상 데이터(DV)의 전체 영상 영역 중 일부인 추적 대상 영역(B)과 다른 인식 대상 영역(A)과의 중첩 여부에 대한 중첩 판단 데이터(P₂)를 포함할 수 있다.At this time, the situation determination data P is a recognition target region different from the tracking target region B which is a part of the entire image region of the dialogue judgment data P ₁ or whether the user is in a conversation state or the image data DV. A) may include overlapping determination data P ₂ as to whether or not it overlaps.

상세하게는, 예비 추론부(300)는 영상 데이터(DV)에 기반하여 추적 대상 영역(B)의 위치를 추론하기 위한 위치 추론 데이터(DM₁)를 생성하고, 얼굴 특징 데이터(DF₁) 및 위치 추론 데이터(DM₁)에 기반하여, 추적 대상 영역(B)과 인식 대상 영역(A)의 중첩 여부에 대한 중첩 판단 데이터(P₂)를 생성할 수 있다.In detail, the preliminary inference unit 300 generates location inference data DM ₁ for inferring the location of the tracking target area B based on the image data DV, and facial feature data DF ₁ and Based on the location inference data DM ₁ , it is possible to generate overlap determination data P ₂ about whether the tracking target area B and the recognition target area A overlap.

또한, 예비 추론부(300)는, 얼굴 특징 데이터(DF₁)에 기반하여 사용자가 대화 상태 인지 여부를 판단하는 대화 판단 데이터(P₁)를 생성할 수 있다.Also, the preliminary reasoning unit 300 may generate conversation determination data P ₁ that determines whether the user is in a conversation state based on the facial feature data DF ₁ .

메인 추론부(400)는, 음성 특징 데이터(DF₂) 또는 얼굴 특징 데이터(DF₁)에 기반하여 적어도 하나의 서브 특징맵(FM)을 생성하고, 서브 특징맵(FM) 및 상황 판단 데이터(P)에 기반하여 사용자의 감성 상태를 추론할 수 있다.The main reasoning unit 400 generates at least one sub-feature map FM based on the voice feature data DF ₂ or the face feature data DF ₁ , and the sub feature map FM and situation determination data ( Based on P), the user's emotional state can be inferred.

상기 감성 상태는 행복, 화, 두려움, 혐오, 슬픔, 놀람 등의 사용자의 감정 상태 정보를 포함할 수 있다.The emotional state may include user's emotional state information such as happiness, anger, fear, disgust, sadness, and surprise.

출력부(500)는 메인 추론부(400)에서 추론된 감성상태의 결과를 출력할 수 있다.The output unit 500 may output the result of the emotional state inferred by the main reasoning unit 400.

이 때, 출력부(500)는 시그모이드 함수(Sigmoid Function), 단계 함수(Step Function), 소프트맥스 함수(Softmax), ReLU(Rectified Linear Unit)등 활성화 함수를 이용하여 다양한 형태로 출력할 수 있다.At this time, the output unit 500 can output in various forms using activation functions such as sigmoid functions, step functions, softmax functions, and rectified linear units (ReLUs). have.

도 2는 도 1의 멀티모달 감성 인식 장치 중 데이터 전처리부의 구성을 개략적으로 도시한 도면이다.FIG. 2 is a diagram schematically showing the configuration of a data pre-processing unit among the multi-modal emotion recognition devices of FIG. 1.

도 2를 참조하면, 데이터 전처리부(200)는 영상 전처리부(210) 및 음성 전처리부(220)를 포함할 수 있다.Referring to FIG. 2, the data preprocessing unit 200 may include an image preprocessing unit 210 and an audio preprocessing unit 220.

영상 전처리부(210)는 얼굴 검출기(211), 이미지 전처리 모듈(212), 랜드 마크 검출모듈(213), 위치 조정모듈(214) 및 얼굴 요소 추출 모듈(215)을 포함 할 수 있다.The image pre-processing unit 210 may include a face detector 211, an image pre-processing module 212, a landmark detection module 213, a position adjustment module 214, and a face element extraction module 215.

얼굴 검출기(211)는 영상 데이터(DV)의 전체 영역에서 사용자의 얼굴에 대응되는 영역인 인식 대상 영역(A)을 검출할 수 있다.The face detector 211 may detect a recognition target area A, which is an area corresponding to a user's face, in the entire area of the image data DV.

이미지 전처리 모듈(212)은 인식 대상 영역(A)을 보정할 수 있다. The image pre-processing module 212 may correct the recognition target area A.

상세하게는, 이미지 전처리 모듈(212)은 이미지의 밝기, 블러(Blur)의 보정, 및 영상 데이터(DV)의 노이즈 제거를 수행할 수 있다.Specifically, the image pre-processing module 212 may perform brightness of an image, correction of a blur, and noise removal of image data DV.

랜드마크 검출모듈(213)은 인식 대상 영역(A)의 얼굴 요소 위치 정보(AL)를 추출할 수 있다.The landmark detection module 213 may extract face element location information AL of the recognition target area A.

상세하게는, 인식 대상 영역(A) 중 얼굴, 눈, 입, 코, 이마 등 얼굴 중요 요소의 위치 정보를 파악하여 얼굴 인식이 가능하게 수행할 수 있다.In detail, it is possible to perform face recognition by identifying location information of important elements of the face, such as the face, eyes, mouth, nose, and forehead, among the recognition target regions A.

위치 조정모듈(214)은 인식 대상 영역(A)의 얼굴 요소 위치 정보(AL)에 기반하여 위치를 조정할 수 있다.The position adjustment module 214 may adjust the position based on the face element position information AL of the recognition target area A.

상세하게는, 위치 조정모듈(214)은 랜드마크 검출모듈(213)로부터 추출된 얼굴 요소 위치 정보(AL)를 기준으로 수평 또는 수직에 맞춰 이미지를 정렬할 수 있다.In detail, the position adjustment module 214 may align the image to fit horizontally or vertically based on the face element location information AL extracted from the landmark detection module 213.

얼굴 요소 추출 모듈(215)은 인식 대상 영역(A) 내에 위치하며 인식 대상 영역(A)보다 작은 서브 인식 대상 영역(AA)을 설정하고, 서브 인식 대상 영역(AA)의 얼굴 특징 데이터(DF₁)를 생성할 수 있다.The face element extraction module 215 is located in the recognition target area A and sets a sub recognition target area AA smaller than the recognition target area A, and facial feature data DF ₁ of the sub recognition target area AA ).

서브 인식 대상 영역(AA)은 얼굴, 눈, 입, 코, 이마 등 적어도 하나 이상의 얼굴 요소가 판별된 복수의 영역 또는 하나의 영역일 수 있다.The sub-recognition target area AA may be a plurality of areas or one area in which at least one or more facial elements such as a face, eyes, mouth, nose, and forehead are determined.

예를 들어, 인식 대상 영역(A) 중 얼굴 요소 위치 정보(AL)가 추출된 눈, 코, 입이 추출될 경우, 얼굴 요소 추출 모듈(215)은 서브 인식 대상 영역(AA)인 눈 인식 영역(A₁), 코 인식 영역(A₂), 입 인식 영역(A₃)을 설정 및 상기 설정된 서브 인식 대상 영역(AA)에 대해 적어도 하나 이상의 얼굴 특징 데이터(DF₁)를 생성할 수 있다.For example, when the eye, nose, and mouth from which the facial element location information AL is extracted among the recognition target areas A, the facial element extraction module 215 is an eye recognition area that is a sub recognition target area AA. (A ₁ ), a nose recognition area A ₂ , and a mouth recognition area A ₃ may be set and at least one or more facial feature data DF ₁ may be generated for the set sub-recognition target area AA.

또한, 얼굴 요소 추출 모듈(215)은 서브 인식 대상 영역(AA)이 설정되지 않을 경우, 인식 대상 영역(A)을 기반으로 얼굴 특징 데이터(DF₁)를 생성할 수 있다.Also, the face element extraction module 215 may generate face feature data DF ₁ based on the recognition target area A when the sub-recognition target area AA is not set.

음성 전처리부(220)는 음성 보정 모듈(221), 음성 특징 데이터 추출 모듈(222)을 포함할 수 있다.The voice pre-processing unit 220 may include a voice correction module 221 and a voice feature data extraction module 222.

음성 보정 모듈(221)은 음성 데이터(DS)를 보정할 수 있다. The voice correction module 221 may correct the voice data DS.

상세하게는, 음성 보정 모듈(221)은 음성 데이터(DS)에 포함된 다양한 노이즈 및 외부 소음 제거, 음량 조절, 주파수 보정 등 다양한 보정 방법을 수행하여, 보정된 음성 데이터를 생성할 수 있다.In detail, the voice correction module 221 may perform various correction methods such as noise and external noise removal, volume control, and frequency correction included in the voice data DS to generate corrected voice data.

음성 특징 데이터 추출 모듈(222)은 음성 보정 모듈(221)을 거친 음성 데이터(DS)의 특징을 추출하여, 음성 특징 데이터(DF₂)를 생성할 수 있다.The voice feature data extraction module 222 may extract features of the voice data DS that have passed through the voice correction module 221 to generate voice feature data DF ₂ .

상세하게는, 음성 특징 데이터 추출 모듈(222)은 MFCC(Mel-frequency Cepstral Coefficients), eGeMAPS(Geneva Minimalistic Acoustic Parameter Set), Logbank 등과 같은 음성 데이터, 주파수 및 스펙트럼 분석 모듈 중 하나 이상의 모듈을 통하여 사용자의 음성 특징 데이터(DF₂)를 생성 할 수 있다.In detail, the voice feature data extraction module 222 may be used by one or more modules of voice data, frequency, and spectrum analysis modules such as Melc-frequency Cepstral Coefficients (MFCC), Geneva Minimalistic Acoustic Parameter Set (eGeMAPS), Logbank, etc. The voice feature data DF ₂ can be generated.

이 때, 음성 특징 데이터 추출 모듈(222)은 상기 보정된 음성 데이터를 사용하거나, 음성 데이터(DS)를 사용할 수도 있다.At this time, the voice feature data extraction module 222 may use the corrected voice data or may use voice data DS.

도 3은 도 1의 멀티모달 감성 인식 장치 중 예비 추론부의 구성을 개략적으로 도시한 도면이다.3 is a diagram schematically showing a configuration of a preliminary reasoning unit in the multi-modal emotion recognition device of FIG. 1.

도 3을 참조하면, 예비 추론부(300)는 손 검출 추론모듈(310), 대화 상태 추론모듈(320) 및 얼굴 겹침 검사모듈(330)을 포함할 수 있다.Referring to FIG. 3, the preliminary reasoning unit 300 may include a hand detection reasoning module 310, a dialogue state reasoning module 320, and a face overlap inspection module 330.

대화 상태 추론모듈(320)은, 제1 학습 모델(LM₁)을 이용하고, 얼굴 특징 데이터(DF₁)에 기반하여 대화 판단 데이터(P₁)를 생성할 수 있다.The dialogue state inference module 320 may use the first learning model LM ₁ and generate dialogue determination data P ₁ based on the facial feature data DF ₁ .

상세하게는, 대화 상태 추론모듈(320)은 사용자의 얼굴 특징 데이터(DF₁)의 전체 또는 부분을 사용하여, 사용자가 대화 상태인지를 판별할 수 있는 제1 학습 모델(LM₁)을 이용하여, 대화 판단 여부인 대화 판단 데이터(P₁)를 생성할 수 있다.In detail, the dialogue state inference module 320 uses the first learning model LM ₁ that can determine whether the user is in a conversation state, using all or part of the facial feature data DF ₁ of the user. , It is possible to generate conversation judgment data P ₁ which is a conversation judgment.

얼굴 특징 데이터(DF₁)는, 인식 대상 영역(A) 중 사용자의 입에 대응되는 부분에 대한 영상 데이터(DV)인 입 영상 데이터(DV₂)를 포함하고, 제1 학습 모델(LM₁)을 이용하여, 입 영상 데이터(DV₂)로부터 사용자의 대화 상태 여부에 대한 대화 판단 데이터(P₁)를 생성할 수 있다.The facial feature data DF ₁ includes mouth image data DV _{2 that} is image data DV for a portion corresponding to the user's mouth in the recognition target area A, and the first learning model LM ₁ By using, it is possible to generate conversation judgment data P ₁ about whether the user is in a conversation state from the input image data DV ₂ .

제1 학습 모델(LM₁)은 LSTM(Long Short-Term Memory), RNNs(Recurrent Neural Network), DNN(Deep Neural Networks), CNN(Convolutional Neural Network) 등 시간적 특징 또는 공간적 특징을 추론 할 수 있는 인공지능 모델, 머신 러닝, 딥 러닝 방법 중 적어도 하나 이상의 방법일 수 있다.The first learning model (LM ₁ ) is an artificial object that can infer temporal or spatial features such as Long Short-Term Memory (LSTM), Recurrent Neural Network (RNNs), Deep Neural Networks (DNN), and Convolutional Neural Network (CNN). It may be at least one of intelligent model, machine learning, and deep learning methods.

손 검출 추론모듈(310)은, 영상 데이터(DV)에서 추적 대상 영역(B)에 대한 손 영상 데이터(DV₁)를 검출하고, 제2 학습 모델(LM₂)을 이용하여 손 영상 데이터(DV₁)에 기반한 위치 추론 데이터(DM₁)를 생성할 수 있다.The hand detection inference module 310 detects the hand image data DV ₁ for the tracking target region B from the image data DV, and uses the second learning model LM ₂ to generate the hand image data DV ₁₎ may generate a location inference data (DM ₁₎ is based on.

이 때, 제2 학습 모델(LM₂)은 LSTM(Long Short-Term Memory), RNNs(Recurrent Neural Network), DNN(Deep Neural Networks), CNN(Convolutional Neural Network) 등 시간적 특징 또는 공간적 특징을 추론 할 수 있는 인공지능 모델, 머신 러닝, 딥 러닝 방법 중 적어도 하나 이상의 방법이며, 이를 통해 손에 대한 위치 추론 데이터(DM₁)를 생성할 수 있다.At this time, the second learning model LM ₂ may infer temporal features or spatial features such as Long Short-Term Memory (LSTM), Recurrent Neural Network (RNNs), Deep Neural Networks (DNN), and Convolutional Neural Network (CNN). It is at least one of artificial intelligence models, machine learning, and deep learning methods that can generate position inference data (DM ₁ ) for the hand.

또한, 손 검출 추론모듈(310)은, 위치 추론 데이터(DM₁)에 대한 위치 추론 특징맵(FM₁)을 생성하고, 서브 특징맵(FM), 상황 판단 데이터(P), 및 위치 추론 특징맵(FM₁)에 기반하여 사용자의 감성 상태를 추론할 수 있다.In addition, the hand detection inference module 310 generates a location inference feature map FM ₁ for the location inference data DM ₁ , a sub feature map FM, situation determination data P, and a location inference feature. Based on the map FM ₁ , a user's emotional state may be inferred.

이 때, 위치 추론 특징맵(FM₁)은 손에 대한 특징 정보, 즉, 손에 대한 제스처 및 손에 대한 위치에 대한 정보 등 손의 움직임의 의미 있는 정보를 포함할 수 있다.In this case, the location inference feature map FM ₁ may include meaningful information about the movement of the hand, such as feature information about the hand, that is, gesture information about the hand and location information about the hand.

얼굴 겹침 검사모듈(330)은, 얼굴 특징 데이터(DF₁) 및 위치 추론 데이터(DM₁)에 기반하여 인식 대상 영역(A)과 추적 대상 영역(B)의 중첩 여부를 판단하고, 중첩 여부 판단 결과에 따라 중첩 판단 데이터(P₂)를 생성 할 수 있다.The face overlap inspection module 330 determines whether the recognition target area A and the tracking target area B overlap based on the facial feature data DF ₁ and the location inference data DM ₁ , and determines whether or not the overlap occurs. Depending on the result, it is possible to generate the overlap determination data P ₂ .

상세하게는, 중첩 판단 데이터(P₂)는 인식 대상 영역(A)과 추적 대상 영역(B)의 중첩 여부를 판단하여, 인식 대상 영역(A)의 해당하는 얼굴 특징 데이터(DF₁)와 음성 특징 데이터(DF₂)의 중요도 및 사용 여부를 결정하는 하나 이상의 파라미터를 생성할 수 있다.Specifically, the overlap determination data P ₂ determines whether the recognition target area A and the tracking target area B overlap, and the corresponding facial feature data DF ₁ of the recognition target area A and voice One or more parameters that determine the importance and use of the feature data DF ₂ may be generated.

도 4는 도 1의 멀티모달 감성 인식 장치 중 메인 추론부의 구성을 개략적으로 도시한 도면이다.FIG. 4 is a diagram schematically showing the configuration of the main reasoning unit among the multi-modal emotion recognition devices of FIG. 1.

도 4를 참조하면, 메인 추론부(400)는, 복수의 서브 특징맵 생성부(410; 411, 412, 413, 414), 멀티 모달 특징맵 생성부(420) 및 감성 인식 추론부(430)를 포함할 수 있다.Referring to FIG. 4, the main reasoning unit 400 includes a plurality of sub-feature map generation units 410 (411, 412, 413, 414), a multi-modal feature map generation unit 420, and an emotion recognition reasoning unit 430. It may include.

복수의 서브 특징맵 생성부(410; 411, 412, 413, 414)는 제3 학습 모델(LM₃)을 이용하여 음성 특징 데이터(DF₂) 및 얼굴 특징 데이터(DF₁)에 기반하여 음성 특징 데이터(DF₂) 및 얼굴 특징 데이터(DF₁)에 대한 복수의 서브 특징맵(FM)을 생성할 수 있다.The plurality of sub feature map generators 410 (411, 412, 413, and 414) use the third learning model LM ₃ to speech features based on the voice feature data DF ₂ and the face feature data DF ₁ . A plurality of sub feature maps FM for the data DF ₂ and the facial feature data DF ₁ may be generated.

상세하게는, 제3 학습 모델(LM₃)은 DNN(Deep Neural Networks), CNN(Convolutional Neural Network) 등을 적어도 하나 이상의 공간적 특징을 추론할 수 있는 인공지능 모델, 머신 러닝, 딥 러닝 방법 중 적어도 하나 이상의 방법일 수 있고, 제3 학습 모델(LM₃)을 이용하여, 음성 특징 데이터(DF₂) 및 얼굴 특징 데이터(DF₁)의 특징이 함축된 복수의 서브 특징맵(FM)을 생성할 수 있다.In detail, the third learning model LM ₃ is at least one of an artificial intelligence model, a machine learning, and a deep learning method capable of inferring at least one spatial feature such as Deep Neural Networks (DNN), Convolutional Neural Network (CNN), and the like. One or more methods, and using the third learning model LM ₃ , generate a plurality of sub-feature maps FM having features of the voice feature data DF ₂ and the face feature data DF ₁ . Can.

멀티 모달 특징맵 생성부(420)는 상황 판단 데이터(P)를 참조하여, 복수의 서브 특징맵(FM)으로부터 멀티 모달 특징맵(M)을 생성할 수 있다.The multi-modal feature map generator 420 may generate the multi-modal feature map M from a plurality of sub-feature maps FM with reference to the situation determination data P.

상황 판단 데이터(P)는, 사용자의 상황에 따라 기설정된 상황 판단값(PV)을 가지며, 멀티 모달 특징맵 생성부(420)는, 복수의 서브 특징맵(FM) 중 적어도 하나의 상황 판단값(PV)을 적용하여 멀티 모달 특징맵(M)을 생성할 수 있다.The situation determination data P has a predetermined situation determination value PV according to a user's situation, and the multi-modal feature map generation unit 420 includes at least one situation determination value among the plurality of sub-feature maps FM. (PV) may be applied to generate a multi-modal feature map (M).

상세하게는, 상황 판단값(PV)은 각각의 서브 특징맵(FM)이 가지는 중요도 및 사용여부를 나타내는 파라미터일 수 있다.In detail, the situation determination value PV may be a parameter indicating the importance and use of each sub feature map FM.

상황 판단 데이터(P)와 서브 특징맵(FM)과의 연산을 통하여 상황 판단 데이터(P)의 상황 판단값(PV)이 적용된 서브 특징맵(FM)을 생성하고, 복수의 서브 특징맵(FM)을 통합하여, 멀티 모달 특징맵(M)을 생성할 수 있다.Through the operation of the situation determination data P and the sub feature map FM, a sub feature map FM to which the situation determination value PV of the situation determination data P is applied is generated, and a plurality of sub feature maps FM ) To create a multi-modal feature map (M).

예를 들면, 사용자의 눈이 가려졌을 경우, 눈에 대한 상황 판단값을 0으로 출력하여, 상기 눈에 대한 상황 판단값과 눈에 대한 서브 특징맵(FM)의 곱연산을 통해 0을 출력하게 되어, 메인 추론부(400)가 상기 눈에 대한 서브 특징맵을 제외한 다른 서브 특징맵을 기준으로 멀티 모달 특징맵(M)을 생성할 수 있다.For example, when the user's eyes are covered, the situation judgment value for the eye is output as 0, and 0 is output through the multiplication operation of the situation judgment value for the eye and the sub feature map FM for the eye. Thus, the main reasoning unit 400 may generate a multi-modal feature map M based on other sub feature maps except the sub feature map for the eye.

또한, 손 검출 추론모듈(320)로부터 위치 추론 특징맵(FM₁)을 생성하고, 서브 특징맵(FM), 상황 판단 데이터(P) 및 위치 추론 특징맵(FM₁)에 기반하여 사용자의 감성 상태를 추론하는 멀티 모달 특징맵(M)을 생성할 수 있다. In addition, the location inference feature map FM ₁ is generated from the hand detection inference module 320, and the user's emotion is based on the sub feature map FM, the situation determination data P, and the location inference feature map FM ₁ . A multi-modal feature map (M) for inferring the state can be generated.

멀티 모달 특징맵(M)은 Concat, Merge 및 딥 네트워크(Deep Network) 등을 사용하여 서브 특징맵(FM) 및 위치 추론 특징맵(FM₁) 적어도 하나 이상을 병합하여 생성될 수 있다. The multi-modal feature map (M) may be generated by merging at least one sub-feature map (FM) and location inference feature map (FM ₁ ) using Concat, Merge, and Deep Network.

감성 인식 추론부(430)는 제4 학습 모델(LM₄)을 사용하여, 멀티 모달 특징맵(M)에 기반하여 감성상태를 추론할 수 있다.The emotional recognition reasoning unit 430 may infer the emotional state based on the multi-modal feature map M using the fourth learning model LM ₄ .

이 때, 제4 학습 모델(LM₄)은, LSTM(Long Short-Term Memory), RNNs(Recurrent Neural Network), GRU(Gated Recurrent Unit) 등 순환 신경망과 같은 시간적 학습 모델일 수 있고, 시간적 특징과 공간적 특징을 추론 또는 분석할 수 있는 인공지능 모델, 머신 러닝, 딥 러닝 방법 중 적어도 하나 이상의 방법일 수 있다.At this time, the fourth learning model LM ₄ may be a temporal learning model, such as a cyclic neural network, such as a Long Short-Term Memory (LSTM), a Recurrent Neural Network (RNN), or a Gated Recurrent Unit (GRU). It may be at least one of artificial intelligence models, machine learning, and deep learning methods capable of inferring or analyzing spatial features.

도 5는 도 1의 멀티모달 감성 인식 장치에 의한 멀티모달 감성 인식 방법을 보여주는 순서도이다.5 is a flowchart illustrating a multi-modal emotion recognition method using the multi-modal emotion recognition apparatus of FIG. 1.

도 5를 참조하면, 사용자의 영상 데이터(DV) 및 음성 데이터(DS)를 입력 받는 데이터 입력 단계(S100)를 수행된다.Referring to FIG. 5, a data input step (S100) of receiving a user's image data DV and audio data DS is performed.

그 다음, 음성 데이터(DS)로부터 음성 특징 데이터(DF₂)를 생성하는 음성 전처리 단계, 영상 데이터(DV)로부터 하나 이상의 얼굴 특징 데이터(DF₁)를 생성하는 영상 전처리단계를 포함하는 데이터 전처리 단계(S200)가 수행될 수 있다.Next, a data pre-processing step including a voice pre-processing step of generating voice feature data DF ₂ from the voice data DS and an image pre-processing step of generating one or more facial feature data DF ₁ from the image data DV. (S200) may be performed.

이 때, 데이터 전처리 단계(S200)는 학습 모델을 사용하기 위한 얼굴 특징 데이터(DF₁)와 음성 특징 데이터(DF₂)를 생성할 수 있다.At this time, the data pre-processing step S200 may generate facial feature data DF ₁ and voice feature data DF ₂ for using the learning model.

상기 학습 모델은 인공지능, 머신 러닝 및 딥 러닝 방법이 될 수 있다. The learning model may be artificial intelligence, machine learning and deep learning methods.

그 다음, 영상 데이터(DV)에 기반하여, 시간적 순서에 따른 사용자의 상황 변화 여부에 관한 상황 판단 데이터(P)를 생성하는 예비 추론 단계(S300)가 수행될 수 있다.Then, based on the image data DV, a preliminary inference step S300 of generating context determination data P regarding whether the user's context changes according to temporal order may be performed.

이 때, 상기 시간적 순서는 대화상태의 여부가 될 수 있고, 신체부분의 움직임에 대한 특징을 파악하기 위한 데이터일 수 있다.At this time, the temporal order may be whether or not the conversation state, it may be data for identifying the characteristics of the movement of the body part.

또한, 상황 판단 데이터(P)는 영상 데이터(DV)로부터 겹칩 여부와 대화 상태의 여부를 판별하여, 하나 이상의 얼굴 특징 데이터(DF₁) 또는 음성 특징 데이터(DF₂)의 중요도 또는 사용 여부를 나타내는 파라미터를 포함할 수 있다.In addition, the situation determination data P determines whether or not overlapping and a conversation state are obtained from the video data DV, and indicates the importance or use of one or more facial feature data DF ₁ or voice feature data DF ₂ . It may include parameters.

또한, 데이터 전처리 단계(S200)에서 생성된 하나 이상의 얼굴 특징 데이터(DF₁) 이외의 사용자의 신체 부분에 대한 특징 정보를 추출하여 생성할 수 있다.Also, feature information on a user's body part other than the one or more facial feature data DF ₁ generated in the data pre-processing step S200 may be extracted and generated.

그 다음, 음성 특징 데이터(DF₂) 또는 얼굴 특징 데이터(DF₁)에 기반하여 적어도 하나의 서브 특징맵(FM)을 생성하고, 서브 특징맵(FM) 및 상황 판단 데이터(P)에 기반하여 사용자의 감성 상태를 추론하는 메인 추론 단계(S400)가 수행될 수 있다.Then, at least one sub-feature map FM is generated based on the voice feature data DF ₂ or the face feature data DF ₁ , and based on the sub-feature map FM and the situation determination data P The main reasoning step S400 of inferring the user's emotional state may be performed.

이 때, 사용자로부터 추출된 특징 정보를 포함한 서브 특징맵(FM)과 특징 정보의 중요도 또는 사용여부에 대한 파라미터를 포함한 상황 판단 데이터(P)를 연산하여, 서브 특징맵(FM)에 중요도 또는 사용여부에 대한 정보를 포함하여, 사용자의 감성 상태를 추론할 수 있다.At this time, the sub-feature map (FM) including feature information extracted from the user and the situation determination data (P) including the parameter of importance or use of feature information are calculated, and the importance or use of the sub-feature map (FM) is calculated. Including information about whether or not, it is possible to infer the user's emotional state.

그 다음, 메인 추론 단계(S400)에서의 감성 상태의 추론 결과를 출력하는 결과 도출 단계(S500)가 수행된다.Then, a result derivation step S500 of outputting an inference result of the emotional state in the main reasoning step S400 is performed.

도 6은 도 5의 멀티모달 감성 인식 방법 중 데이터 전처리 단계를 상세하게 보여주는 순서도이다.6 is a flowchart illustrating in detail the data preprocessing step of the multi-modal emotion recognition method of FIG. 5.

도 6을 참조하면, 데이터 전처리 단계(S200)는 영상 전처리 단계(S210)와 음성 전처리 단계(S220)를 포함한다.Referring to FIG. 6, the data pre-processing step (S200) includes an image pre-processing step (S210) and an audio pre-processing step (S220 ).

영상 전처리 단계(S210)는, 영상 데이터(DV)의 전체 영역에서 인식 대상 영상 영역, 인식 대상 영역(A)은 사용자의 얼굴에 대응되는 영역인,을 검출하는 얼굴 검출 단계가 수행된다.In the image pre-processing step S210, a face detection step is performed in which a recognition target image region and a recognition target region A are regions corresponding to a user's face in all regions of the image data DV.

그 다음, 인식 대상 영역(A)을 보정하는 이미지 전처리 단계가 수행된다.Then, an image pre-processing step of correcting the recognition target area A is performed.

상세하게는, 상기 이미지 전처리 단계에서 이미지의 밝기, 블러(Blur)의 보정, 및 영상 데이터(DV)의 노이즈 제거가 수행될 수 있다In detail, in the image pre-processing step, brightness of an image, correction of a blur, and noise removal of image data DV may be performed.

그 다음, 인식 대상 영역(A)의 얼굴 요소 위치 정보(AL)를 추출하는 랜드마크 검출 단계가 수행된다.Then, a landmark detection step of extracting face element location information AL of the recognition target area A is performed.

상세하게는, 인식 대상 영역(A) 중 얼굴, 눈, 코, 입, 이마 등 얼굴 중요 요소의 위치 정보를 파악하여 얼굴 인식이 가능하게 수행될 수 있다.In detail, face recognition may be performed by identifying location information of important elements of the face, such as the face, eyes, nose, mouth, and forehead, among the recognition target regions A.

그 다음, 인식 대상 영역(A)의 얼굴 요소 위치 정보(AL)에 기반하여 위치를 조정하는 위치 조정 단계가 수행될 수 있다.Then, a position adjustment step of adjusting the position based on the face element position information AL of the recognition target area A may be performed.

상세하게는, 랜드마크 검출모듈(213)로부터 추출된 얼굴 요소 위치 정보(AL)를 기준으로 수평 또는 수직에 맞춰 이미지가 정렬될 수 있다.In detail, the image may be aligned with horizontal or vertical based on the face element location information AL extracted from the landmark detection module 213.

그 다음, 인식 대상 영역(A)에서 얼굴 요소 위치 정보(AL)에 기반하여 인식 대상 영역(A) 내에 위치하며 인식 대상 영역(A)보다 작은 서브 인식 대상 영역(AA)을 설정하고, 서브 인식 대상 영역(AA)의 얼굴 특징 데이터(DF₁)를 생성하는 얼굴 요소 추출 단계가 수행될 수 있다.Next, in the recognition target area A, a sub-recognition target area AA located in the recognition target area A and smaller than the recognition target area A is set based on the face element position information AL, and the sub-recognition A face element extraction step of generating face feature data DF ₁ of the target area AA may be performed.

이 때, 서브 인식 대상 영역(AA)은 얼굴전체, 눈, 입, 코, 이마 등 적어도 하나 이상의 얼굴 요소가 판별된 복수의 영역 또는 하나의 영역일 수 있다.In this case, the sub-recognition target area AA may be a plurality of areas or one area in which at least one or more facial elements such as the entire face, eyes, mouth, nose, and forehead are determined.

또한, 상기 얼굴 요소 추출 단계는 서브 인식 대상 영역(AA)이 설정되지 않을 경우, 인식 대상 영역(A)을 기반으로 얼굴 특징 데이터(DF₁)를 생성할 수 있다.In addition, in the face element extraction step, when the sub-recognition target area AA is not set, the facial feature data DF ₁ may be generated based on the recognition target area A.

음성 전처리 단계(S220)는 음성 보정 단계 및 음성 특징 데이터 추출 단계를 포함한다. The voice pre-processing step S220 includes a voice correction step and a voice feature data extraction step.

먼저, 음성 데이터(DS)를 보정하는 상기 음성 보정 단계가 수행된다.First, the voice correction step of correcting the voice data DS is performed.

상세하게는, 상기 음성 보정 단계에서 음성 데이터(DS)에 포함된 다양한 노이즈 및 외부 소음 제거, 음량 조절, 주파수 보정 등 다양한 보정 방법을 수행하여, 보정된 음성 데이터를 생성될 수 있다.Specifically, in the voice correction step, various correction methods such as noise and external noise removal, volume control, and frequency correction included in the voice data DS may be performed to generate corrected voice data.

상기 음성 보정 단계를 거친 음성 데이터(DS)의 특징을 추출하여, 음성 특징 데이터(DF₂)를 생성하는 상기 음성 특징 데이터 추출 단계가 수행된다.The voice feature data extraction step of extracting features of the voice data DS that has undergone the voice correction step and generating voice feature data DF ₂ is performed.

상세하게는, MFCC(Mel-frequency cepstral coefficients), eGeMAPS(Geneva Minimalistic Acoustic Parameter Set), Logbank 등과 같은 음성 데이터, 주파수 및 스펙트럼 분석 모듈 중 하나 이상의 모듈을 통하여 사용자의 음성 특징 데이터(DF₂)를 생성 될 수 있다.Specifically, voice data such as Mel-frequency cepstral coefficients (MFCC), Geneva Minimalistic Acoustic Parameter Set (eGeMAPS), Logbank, etc., generate voice characteristic data (DF ₂ ) of the user through one or more modules of a frequency and spectrum analysis module Can be.

이 때, 상기 음성 특징 데이터 추출 단계는 상기 보정된 음성 데이터를 사용하거나, 상기 음성 보정 단계가 수행되지 않고 음성 데이터(DS)하여 음성 특징 데이터(DF₂)를 생성할 수도 있다.At this time, the voice feature data extraction step may use the corrected voice data or the voice feature DS may be performed to generate voice feature data DF ₂ without performing the voice correction step.

또한, 이는 예시적인 것으로서 적어도 일부의 단계들은 전후의 단계들과 동시에 수행되거나 또는 순서를 바꾸어 수행될 수도 있다.In addition, this is an example, and at least some of the steps may be performed simultaneously with the steps before and after, or may be performed by changing the order.

도 7은 도 5의 멀티모달 감성 인식 방법 중 예비 추론 단계를 상세하게 보여주는 순서도이다.7 is a flowchart illustrating in detail the preliminary inference step of the multi-modal emotion recognition method of FIG. 5.

제1 학습 모델(LM₁)을 이용하고, 얼굴 특징 데이터(DF₁)에 기반하여 대화 판단 데이터(P₁)를 생성하는 대화 상태 추론 단계(S310)가 수행될 수 있다.A conversation state inference step S310 of using the first learning model LM ₁ and generating conversation determination data P ₁ based on the facial feature data DF ₁ may be performed.

대화 상태 추론 단계(S310)에서, 제1 학습 모델(LM₁)을 이용하여 이전 상황에서의 대화 여부와 얼굴 특징 데이터(DF₁)로부터 얼굴 요소의 특징 및 움직임을 감지하여, 대화 상태 여부를 감지될 수 있다.In the conversation state inference step S310, the first learning model LM ₁ is used to detect whether there is a conversation state in the previous situation and to detect features and movements of the facial elements from the facial feature data DF ₁ . Can be.

상세하게는, 사용자의 얼굴 특징 데이터(DF₁)의 전체 또는 부분을 사용하여, 사용자가 대화 중인지를 제1 학습 모델(LM₁)을 이용하여, 대화 판단 여부인 대화 판단 데이터(P₁)가 생성될 수 있다.In detail, using all or part of the facial feature data DF ₁ of the user, the conversation determination data P ₁ , which is a conversation determination, is determined using the first learning model LM ₁ whether the user is in conversation. Can be generated.

이 때, 얼굴 특징 데이터(DF₁)는, 인식 대상 영역(A) 중 사용자의 입에 대응되는 부분에 대한 입 영상 데이터(DV₂)를 포함할 수 있다.In this case, the facial feature data DF ₁ may include mouth image data DV ₂ for a portion of the recognition target area A corresponding to the user's mouth.

또한, 제1 학습 모델(LM₁)을 이용하여, 입 영상 데이터(DV₂)로부터 사용자의 대화 상태 여부에 대한 대화 판단 데이터(P₁)를 생성할 수 있다.In addition, using the first learning model LM ₁ , it is possible to generate conversation determination data P ₁ about whether the user is in a conversation state from the input image data DV ₂ .

그 다음, 영상 데이터(DV)에서 추적 대상 영역(B)에 대한 손 영상 데이터(DV₁)를 검출하고, 제2 학습 모델(LM₂)을 이용하여 손 영상 데이터(DV₁)에 기반한 위치 추론 데이터(DM₁)를 생성하는 손 검출 추론 단계(S320)가 수행된다.Then, from image data (DV) detects the hand image data (DV ₁₎ to the tracking target area (B), and the second using a learning model (LM ₂₎ based on the hand image data (DV ₁₎ location inferences A hand detection inference step S320 of generating data DM ₁ is performed.

이 때, 제2 학습 모델(LM₂)을 사용하여 손에 대한 위치에 대한 이전 상황과의 시간적 추론이 가능할 수 있다. 예를 들어, 일시적으로 손이 얼굴에 겹쳤는지 여부를 판별할 수 있다.At this time, temporal inference with the previous situation about the position of the hand may be possible using the second learning model LM ₂ . For example, it may be determined whether the hand is temporarily overlapping the face.

또한, 손 검출 추론 단계(S320)는, 위치 추론 데이터(DM₁)에 대한 위치 추론 특징맵(FM₁)을 생성하고, 서브 특징맵(FM), 상황 판단 데이터(P), 및 위치 추론 특징맵(FM₁)에 기반하여 사용자의 감성 상태를 추론할 수 있다.In addition, the hand detection inference step S320 generates a location inference feature map FM ₁ for the location inference data DM ₁ , a sub feature map FM, a situation determination data P, and a location inference feature. Based on the map FM ₁ , a user's emotional state may be inferred.

상세하게는, 위치 추론 특징맵(FM₁)은 손에 대한 제스처를 파악할 수 있는 특징 및 손에 대한 위치에 대한 정보 등 손의 움직임의 의미 있는 정보를 포함할 수 있다.In detail, the location inference feature map FM ₁ may include meaningful information about the movement of the hand, such as information on a feature for recognizing a gesture for the hand and location for the hand.

그 다음, 얼굴 특징 데이터(DF₁) 및 위치 추론 데이터(DM₁)에 기반하여 인식 대상 영역(A)과 추적 대상 영역(B)의 중첩 여부를 판단하고, 중첩 여부 판단 결과에 따라 중첩 판단 데이터(P₂)를 생성하는 얼굴 겹침 검사 단계(S330)가 수행된다.Next, based on the facial feature data DF ₁ and the location inference data DM ₁ , it is determined whether the recognition target region A and the tracking target region B overlap, and the overlap determination data is determined according to the overlapping determination result. A face overlap inspection step (S330) of generating (P ₂ ) is performed.

상세하게는, 중첩 판단 데이터(P₂)는 인식 대상 영역(A)과 추적 대상 영역(B)의 중첩 여부를 판단하여, 인식 대상 영역(A)의 해당하는 얼굴 특징 데이터(DF₁)와 음성 특징 데이터(DF₂)의 중요도 및 사용 여부를 결정하는 하나 이상의 파라미터를 포함할 수 있다.Specifically, the overlap determination data P ₂ determines whether the recognition target area A and the tracking target area B overlap, and the corresponding facial feature data DF ₁ of the recognition target area A and voice It may include one or more parameters that determine the importance and use of the feature data DF ₂ .

도 8은 도 5의 멀티모달 감성 인식 방법 중 메인 추론 단계를 상세하게 보여주는 순서도이다.8 is a flowchart illustrating in detail the main reasoning step of the multi-modal emotion recognition method of FIG. 5.

도 8을 참조하면, 메인 추론 단계(S400)는, 복수의 서브 특징맵 생성 단계(S410), 멀티 모달 특징맵 생성 단계(S420) 및 감성 인식 추론 단계(S430)를 포함한다.Referring to FIG. 8, the main reasoning step (S400) includes a plurality of sub-feature map generation steps (S410 ), a multi-modal feature map generation step (S420 ), and emotion recognition reasoning (S430 ).

먼저, 제3 학습 모델(LM₃)을 이용하여 음성 특징 데이터(DF₂) 및 얼굴 특징 데이터(DF₁)에 기반하여 음성 특징 데이터(DF₂) 및 얼굴 특징 데이터(DF₁)에 대한 복수의 서브 특징맵(FM)을 생성하는 복수의 서브 특징맵 생성 단계(S410)가 수행된다.First, the third learning model (LM ₃₎ the use of the voice characteristic data (DF ₂₎ and face characteristic data on the basis of (DF ₁₎ of the plurality of the voice characteristic data (DF ₂₎ and the face characteristic data (DF ₁₎ A plurality of sub feature map generation steps (S410) of generating a sub feature map (FM) are performed.

그 다음, 제3 학습 모델(LM₃)은 상황 판단 데이터(P)를 참조하여, 복수의 서브 특징맵(FM)으로부터 멀티 모달 특징맵(M)을 생성하는 멀티 모달 특징맵 생성 단계(S420)가 수행된다.Next, the third learning model LM ₃ refers to the context determination data P, and generates a multi-modal feature map (S420) for generating a multi-modal feature map (M) from a plurality of sub feature maps (FM). Is performed.

이 때, 상황 판단 데이터(P)는, 사용자의 상황에 따라 기설정된 상황 판단값(PV)을 가지며, 멀티 모달 특징맵 생성 단계(S420)는, 복수의 서브 특징맵(FM) 중 적어도 하나에 상황 판단값(PV)을 적용하여 멀티 모달 특징맵(M)을 포함할 수 있다.At this time, the situation determination data P has a predetermined situation determination value PV according to the user's situation, and the multi-modal feature map generation step S420 is performed on at least one of the plurality of sub-feature maps FM. A multi-modal feature map M may be included by applying a situation determination value PV.

또한, 멀티 모달 특징맵 생성 단계(S420)에서, 손 검출 추론모듈(320)로부터 위치 추론 특징맵(FM₁)을 생성하고, 서브 특징맵(FM), 상황 판단 데이터(P) 및 위치 추론 특징맵(FM₁)에 기반하여 사용자의 감성 상태를 추론하는 멀티 모달 특징맵(M)이 생성될 수 있다. In addition, in the multi-modal feature map generation step (S420 ), the location inference feature map FM ₁ is generated from the hand detection inference module 320, the sub feature map FM, the situation determination data P, and the location inference feature. A multi-modal feature map M that infers the user's emotional state based on the map FM ₁ may be generated.

그 다음, 제4 학습 모델(LM₄)을 사용하여, 멀티 모달 특징맵(M)에 기반하여 감성상태를 추론하는 감성 인식 추론 단계(S430)가 수행된다.Then, using the fourth learning model LM ₄ , the emotion recognition inference step S430 of inferring the emotional state based on the multi-modal feature map M is performed.

도 9는 도 1의 멀티모달 감성 인식 장치에서 상황 변화 여부에 따른 얼굴 인식 과정을 보여주는 예시적인 도면이다.9 is an exemplary diagram illustrating a face recognition process according to whether a situation changes in the multi-modal emotional recognition device of FIG. 1.

도 9를 참조하면, ((A)단계) 사용자가 손을 얼굴에 대고 있으며, 손이 입과 코를 가리고 있지는 않는 상황을 나타내고 있다.Referring to FIG. 9, ((A)), the user puts his hand on his face and the hand does not cover his mouth and nose.

영상 입력부(110)를 통해 사용자의 영상 데이터(DV)가 입력되고, 음성 입력부(120)를 통해 사용자의 음성 데이터(DS)가 입력된다. The user's image data DV is input through the image input unit 110, and the user's voice data DS is input through the audio input unit 120.

이 후, 영상 전처리부(210)는 영상 전처리가 된 얼굴 특징 데이터(DF₁)를 생성하고, 또한, 음성 전처리부(220)를 통해 음성 전처리가 된 음성 특징 데이터(DF₂)를 생성하고, 영상 전처리부(210)는 인식 가능한 사용자의 눈, 코, 입의 얼굴 요소 위치 정보(AL)를 기반으로 눈 인식 영역(A₁), 코 인식 영역(A₂), 입 인식 영역(A₃)을 포함하는 인식 대상 영역(A)이 설정되고, 인식 대상 영역(A)을 예비 추론부(300)로 송신한다.Subsequently, the image preprocessing unit 210 generates facial feature data DF _{1 that} has been preprocessed, and also generates voice feature data DF _{2 that} has been preprocessed through the voice preprocessing unit 220, The image preprocessing unit 210 may recognize the eye recognition area A ₁ , the nose recognition area A ₂ , and the mouth recognition area A ₃ based on the recognizable user's facial element location information AL of the eyes, nose, and mouth. The recognition target area A is set, and the recognition target area A is transmitted to the preliminary reasoning unit 300.

이 후, 예비 추론부(300)는 영상 데이터(DV)로부터 검출된 추적 대상 영역(B₁)에 대한 손 영상 데이터(DV₁)를 생성한다.Thereafter, the preliminary reasoning unit 300 generates hand image data DV ₁ for the tracking target region B ₁ detected from the image data DV.

이 때, 예비 추론부(300)는 손 영상 데이터(DV₁)를 통해 손의 움직임을 파악하는 위치 추론 데이터(DM₁)를 생성되고, 위치 추론 데이터(DM₁)에 기반한 추적 대상 영역(B₁)과 인식 대상 영역(A)의 중첩됨 여부 판단을 기반으로 중첩 판단 데이터(P₂)가 생성된다.At this time, the preliminary reasoning unit 300 generates position inference data DM ₁ that identifies the movement of the hand through the hand image data DV ₁ , and the tracking target area B based on the position inference data DM ₁ ₁ ) and the overlap determination data P ₂ are generated based on the determination whether the recognition target region A overlaps.

여기서, 중첩 판단 데이터(P₂)는 눈 인식 영역(A₁), 코 인식 영역(A₂), 입 인식 영역(A₃)을 사용을 나타내는 파라미터를 포함할 수 있다.Here, the overlap determination data P ₂ may include parameters indicating the use of the eye recognition area A ₁ , the nose recognition area A ₂ , and the mouth recognition area A ₃ .

또한, 대화 상태 추론모듈(310)은 입 영상 데이터(DV₂)에 기반한 입 인식 영역(A₃)을 통하여 대화 상태 여부를 판단하여 대화 판단 데이터(P₁)를 생성한다.In addition, the conversation state inference module 310 determines whether the conversation state is through the mouth recognition area A ₃ based on the mouth image data DV ₂ and generates the conversation determination data P ₁ .

이 후, 서브 특징맵 생성부(410)는 눈, 코, 입에 해당되는 얼굴 특징 데이터(DF₁)를 제3 학습 모델(LM₃)을 사용하여 복수의 서브 특징맵(FM)을 생성한다.Subsequently, the sub-feature map generation unit 410 generates a plurality of sub-feature maps FM using the facial feature data DF ₁ corresponding to the eyes, nose, and mouth using the third learning model LM ₃ . .

이 후, 멀티 모달 특징맵 생성부(420)는 복수의 서브 특징맵(FM)과 손에 해당되는 위치 추론 특징맵(FM₁)을 통합하여 멀티 모달 특징맵(M)을 생성한다.Thereafter, the multi-modal feature map generation unit 420 generates a multi-modal feature map (M) by integrating a plurality of sub-feature maps (FM) and a location inference feature map (FM ₁ ) corresponding to the hand.

이 후, 제4 학습 모델(LM₄)을 통해 이전의 사용자의 행동을 고려하여 감성인식을 추론하고, 이를 감성인식 결과로 나타낼 수 있다.Thereafter, through the fourth learning model LM ₄ , emotion recognition may be inferred in consideration of the behavior of the previous user, and this may be expressed as a result of emotion recognition.

((B)단계) B단계는, A단계의 연속적인 동작을 나타내고 있다.(Step B) Step B represents the continuous operation of Step A.

예를 들어, B단계는 30FPS 속도로 A단계에 이어 연속적으로 촬영된 영상으로 가정 할 수 있다.For example, step B may be assumed to be an image continuously taken after step A at a rate of 30 FPS.

A단계와 마찬가지로, 영상 입력부(110)를 통해 사용자의 영상 데이터(DV)가 입력되고, 음성 입력부(120)를 통해 사용자의 음성 데이터(DS)가 입력된다. As in step A, the user's image data DV is input through the image input unit 110 and the user's voice data DS is input through the audio input unit 120.

이 후, 음성 전처리부(220)를 통해 음성 전처리가 된 음성 특징 데이터(DF₂)를 생성하고, 영상 전처리부(210)는 얼굴 특징 데이터(DF₁) 및 얼굴 요소 위치 정보(AL)를 생성하고, 얼굴 요소 위치 정보(AL)를 기반으로 눈 인식 영역(A₁), 코 인식 영역(A₂), 입 인식 영역(A₃)을 포함하는 인식 대상 영역(A)을 설정하고, 인식 대상 영역(A)을 예비 추론부(300)로 송신한다.Thereafter, the voice pre-processing unit 220 generates voice pre-processed voice feature data DF ₂ , and the image pre-processor 210 generates face feature data DF ₁ and face element location information AL. Then, based on the face element location information AL, the recognition target area A including the eye recognition area A ₁ , the nose recognition area A ₂ , and the mouth recognition area A ₃ is set, and the recognition target The area A is transmitted to the preliminary reasoning unit 300.

이 때, 인식 대상 영역(A)이 사용자의 동작에 따라 크기가 변화할 수 있다.At this time, the size of the recognition target area A may change according to a user's motion.

B단계는 A단계와 비교하여, 인식 대상 영역(A)이 동작에 따라 크기가 변화되는 것을 나타내고 있다.In step B, the size of the recognition target area A is changed according to the operation in comparison with the step A.

이 후, 예비 추론부(300)는 손 영상 데이터(DV₁)에 기반한 위치 추론 데이터(DM₁)를 생성하여, A단계에서 B단계로의 손의 움직임을 추적할 수 있다.Thereafter, the preliminary reasoning unit 300 may generate position inference data DM ₁ based on the hand image data DV ₁ to track the hand movement from step A to step B.

예비 추론부(300)는 위치 추론 데이터(DM₁)에 기반한 추적 대상 영역(B₂)과 인식 대상 영역(A)의 중첩됨 여부 판단을 기반으로 중첩 판단 데이터(P₂)가 생성된다.The preliminary reasoning unit 300 generates superposition determination data P ₂ based on the determination of whether the tracking target region B ₂ and the recognition target region A overlap based on the location inference data DM ₁ .

또한, 예비 추론부(300)는 대화 상태 여부를 판단하여 대화 판단 데이터(P₁)를 생성한다. In addition, the preliminary reasoning unit 300 determines whether or not the conversation state, and generates conversation judgment data P ₁ .

이 때, 예비 추론부(300)는 제1 학습 모델(LM₁)을 이용하여, (A)단계를 포함한 이전 상황에서 감성인식 대상이 되는 사용자의 대화 여부가 지속되고 있는지를 고려하여 대화 상태 여부를 판단 할 수 있다.At this time, the preliminary reasoning unit 300 uses the first learning model LM ₁ to determine whether or not the conversation state is considered in consideration of whether the user who is the emotional recognition target continues in the previous situation including step (A). Can judge.

예를 들어, A단계에서 사용자가 대화 상태가 아닌 것으로 추론된 경우, 상기 결과를 바탕으로, B단계에서 입 인식 영역(A₃)에 기초하여 일시적으로 사용자의 입 모양이 대화 상태에서의 입 모양과 유사하더라도, 예비 추론부(300)는 제1 학습 모델(LM₁)을 이용하여, 사용자가 대화 상태가 아닌 것으로 판단할 수 있다. 즉, 예비 추론부(300)는 A단계에서의 대화 상태 판단 결과에 기초하여, 다음 장면인 B단계에서의 대화 상태 판단 여부에 대한 추론을 실시할 수 있다.For example, if it is inferred that the user is not in a conversational state in step A, based on the result, the user's mouth shape temporarily changes based on the mouth recognition area A ₃ in step B. Even though similar to, the preliminary reasoning unit 300 may determine that the user is not in a conversation state by using the first learning model LM ₁ . That is, the preliminary reasoning unit 300 may perform inference on whether to determine the conversation state in the next scene, step B, based on the result of the conversation state determination in step A.

이 후, 메인 추론부(400)는 수신된 얼굴 특징 데이터(DF₁) 및 음성 특징 데이터(DF₂)를 제3 학습 모델(LM₃)을 사용하여 복수의 서브 특징맵(FM)을 생성하고, 복수의 서브 특징맵(FM)과 손에 해당되는 위치 추론 특징맵(FM₁)을 통합하여 멀티 모달 특징맵(M)을 생성한다.Thereafter, the main reasoning unit 400 generates a plurality of sub-feature maps FM using the received facial feature data DF ₁ and the voice feature data DF ₂ using the third learning model LM ₃ and , A multi-modal feature map (M) is generated by integrating a plurality of sub feature maps (FM) and a location inference feature map (FM ₁ ) corresponding to the hand.

이 후, 메인 추론부(400)는 제4 학습 모델(LM₄)을 통해 이전((A)단계)의 사용자의 행동을 고려하여 감성인식을 추론하고, 이를 감성인식 결과로 나타낼 수 있다.Thereafter, the main reasoning unit 400 may infer emotion recognition in consideration of the user's behavior of the previous (step (A)) through the fourth learning model LM ₄ , and may express this as a result of emotion recognition.

((C)단계) B단계 이후, 사용자가 입을 손으로 가리는 행동을 나타내고 있다.(Step (C)) After step B, the user shows the behavior of covering his mouth with his hand.

영상 전처리부(210)는 인식 가능한 사용자의 눈의 얼굴 요소 위치 정보(AL)를 기반으로 눈 인식 영역(A₁)을 포함하는 인식 대상 영역(A)이 설정되고, 인식 대상 영역(A)을 예비 추론부(300)로 송신한다.The image preprocessing unit 210 sets the recognition target area A including the eye recognition area A ₁ based on the recognizable user's eye facial element location information AL, and sets the recognition target area A. It transmits to the preliminary reasoning unit 300.

이 후, 예비 추론부(300)는 영상 데이터(DV)로부터 검출된 추적 대상 영역(B₃)에 대한 손 영상 데이터(DV₁)를 생성한다. 이 때, 손 영상 데이터(DV₁)를 통해 손의 움직임을 파악하는 위치 추론 데이터(DM₁)를 생성하고, 위치 추론 데이터(DM₁)에 기반한 추적 대상 영역(B₃)과 인식 대상 영역(A)의 중첩 여부 판단을 기반으로 중첩 판단 데이터(P₂)가 생성된다.Thereafter, the preliminary inference unit 300 generates hand image data DV ₁ for the tracking target area B ₃ detected from the image data DV. At this time, the position inference data DM ₁ that identifies the movement of the hand is generated through the hand image data DV ₁ , and the tracking target area B ₃ and the recognition target area (based on the position inference data DM ₁ ) Based on the determination of the overlap of A), the overlap determination data P ₂ is generated.

여기서, 중첩 판단 데이터(P₂)는 눈 인식 영역(A₁)에 기초한 얼굴 특징 데이터(DF₁)의 사용 여부 또는 얼굴 특징 데이터(DF₁)에 적용되는 가중치를 나타내는 파라미터를 포함할 수 있다.Here, the overlap determination data P ₂ may include a parameter indicating whether to use the facial feature data DF ₁ based on the eye recognition area A ₁ or the weight applied to the facial feature data DF ₁ .

또한, 예비 추론부(300)는 (A)단계, (B)단계에서 인식 대상 영역(A)이었던 코 인식 영역(A₂) 또는 입 인식 영역(A₃)과 사용자의 손 위치에 대한 영역인 추적 대상 영역(B₃)과의 중첩을 인지하여, 감성인식 추론에서 제외됨 또는 중요도가 떨어짐을 나타내는 파라미터가 중첩 판단 데이터(P₂)에 포함될 수 있다. In addition, the preliminary reasoning unit 300 is the area for the nose recognition area A ₂ or the mouth recognition area A ₃ and the user's hand position that were the recognition target area A in steps (A) and (B). By recognizing the overlap with the tracking target area B ₃ , a parameter indicating that it is excluded from the emotion recognition inference or has a low importance may be included in the overlap determination data P ₂ .

또한, 예비 추론부(300)는 입 인식 영역(A₃)에 대응되는 입 영상 데이터(DV₂)가 인식되지 않는 상황과 사용자가 이전 대화 상태 여부의 판단 결과를 고려하여, 음성 특징 데이터(DF₂)의 사용 판단 여부의 나타내는 값을 대화 판단 데이터(P₁)에 포함시킬 수 있다.In addition, the preliminary reasoning unit 300 considers the situation in which the mouth image data DV ₂ corresponding to the mouth recognition area A ₃ is not recognized and the result of the determination of whether the user is in a previous conversation state, the voice feature data DF A value indicating whether to use ₂ ) may be included in the conversation determination data P ₁ .

여기서, 상기 이전 대화 상태 여부의 판단 결과는 시간적 학습 모델을 통해 추론한다. 이 때, 시간적 학습 모델은 LSTM(Long Short-Term Memory), RNNs(Recurrent Neural Network), GRU(Gated Recurrent Unit) 등 순환 신경망과 같은 시간적 학습 모델일 수 있다.Here, the determination result of the previous conversation state is inferred through the temporal learning model. In this case, the temporal learning model may be a temporal learning model such as a cyclic neural network such as Long Short-Term Memory (LSTM), Recurrent Neural Network (RNNs), or Gated Recurrent Unit (GRU).

이 후, 서브 특징맵 생성부(410)는 눈에 해당되는 영역의 얼굴 특징 데이터(DF₁)를 제3 학습 모델(LM₃)을 사용하여 복수의 서브 특징맵(FM)을 생성한다.Subsequently, the sub-feature map generation unit 410 generates a plurality of sub-feature maps FM using the facial feature data DF ₁ of the region corresponding to the eye using the third learning model LM ₃ .

이 후, 감정인식 추론부(430)는 제4 학습 모델(LM₄)을 통해 이전의 사용자의 행동을 고려하여 감성인식을 추론하고, 이를 감성인식 결과로 나타낼 수 있다.Thereafter, the emotion recognition reasoning unit 430 may infer emotion recognition in consideration of the behavior of the previous user through the fourth learning model LM ₄ , and may express this as a result of emotion recognition.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by a limited embodiment and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques are performed in a different order than the described method, and/or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, even if replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

이상에서 설명된 시스템 또는 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 시스템, 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(Arithmetic Logic Unit), 디지털 신호 프로세서(Digital signalprocessor), 마이크로컴퓨터, FPA(Field Programmable Array), PLU(Programmable Logic Unit), 마이크로프로세서, 또는 명령(Instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리요소(Processing Element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(Parallel Processor)와 같은, 다른 처리 구성(Processing Configuration)도 가능하다.The system or device described above may be implemented with hardware components, software components, and/or combinations of hardware components and software components. For example, the systems, devices, and components described in the embodiments include, for example, processors, controllers, Arithmetic Logic Units (ALUs), digital signal processors (micro signal processors), microcomputers, and Field Programmable Arrays (FPAs). , A Programmable Logic Unit (PLU), microprocessor, or any other device capable of executing and responding to instructions (Instruction), may be implemented using one or more general purpose computers or special purpose computers. The processing device may run an operating system (OS) and one or more software applications running on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of understanding, a processing device may be described as one being used, but a person having ordinary skill in the art, the processing device may include a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that may include. For example, the processing device may include a plurality of processors or a processor and a controller. In addition, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(Computer Program), 코드(Code), 명령(Instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로 (collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(Component), 물리적 장치, 가상 장치(Virtual Equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(Signal Wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instruction, or a combination of one or more of these, and configures the processing device to operate as desired or processes independently or collectively. You can command the device. Software and/or data may be interpreted by a processing device, or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. , Or may be permanently or temporarily embodied in the transmitted signal wave. The software may be distributed on networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예들에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM, DVD와 같은 광기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-optical Media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도 록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiments may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data, data structures, or the like alone or in combination. The program instructions recorded on the medium may be specially designed and constructed for the embodiments or may be known and usable by those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, magnetic media such as optical media such as CD-ROMs and DVDs, and floppy disks. Includes hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes made by a compiler. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

10 : 멀티 모달 감성 인식 장치
100 : 데이터 입력부 110 : 영상 입력부
120 : 음성 입력부
200 : 데이터 전처리부 210 : 영상 전처리부
211 : 얼굴 검출 모듈 212 : 이미지 전처리 모듈
213 : 랜드마크 검출 모듈 214 : 위치 조정 모듈
215 : 얼굴 요소 추출 모듈
220 : 음성 전처리부 221 : 음성 보정 모듈
222 : 음성 특징 데이터 추출 모듈 300 : 예비 추론부
310 : 대화 상태 추론 모듈
320 : 손 검출 추론 모듈 330 : 얼굴 겹침 검사 모듈
400 : 메인 추론부
411 : 제1 서브 특징맵 생성부 412 : 제2 서브 특징맵 생성부
413 : 제3 서브 특징맵 생성부 414 : 제n 서브 특징맵 생성부
420 : 멀티 모달 특징맵 생성부 430 : 감성 인식 추론부
500 : 출력부
S100: 데이터 입력 단계 S200: 데이터 전처리 단계
S210: 영상 전처리 단계 S220: 음성 전처리 단계
S300: 예비 추론 단계 S310: 대화 상태 추론 단계
S320: 손 검출 추론 단계 S330: 얼굴 겹침 검사 단계
S400: 메인 추론 단계 S410: 서브 특징맵 생성 단계
S420: 멀티 모달 특징맵 생성 단계 S430: 감성 인식 추론 단계
S500: 결과 도출 단계 A₁ : 눈 인식 영역
A₂ : 코 인식 영역 A₃ : 입 인식 영역
B_1,B_2,B₃ : 추적 대상 영역10: multi-modal emotion recognition device
100: data input unit 110: video input unit
120: voice input unit
200: data pre-processing unit 210: image pre-processing unit
211: face detection module 212: image pre-processing module
213: landmark detection module 214: position adjustment module
215: face element extraction module
220: voice pre-processing unit 221: voice correction module
222: voice feature data extraction module 300: preliminary reasoning unit
310: dialogue state inference module
320: hand detection reasoning module 330: face overlap inspection module
400: main reasoning
411: first sub-feature map generator 412: second sub-feature map generator
413: third sub-feature map generator 414: n-th sub-feature map generator
420: Multi-modal feature map generation unit 430: emotional recognition reasoning unit
500: output
S100: Data input step S200: Data pre-processing step
S210: image pre-processing step S220: audio pre-processing step
S300: Preliminary reasoning step S310: Dialogue state reasoning step
S320: Hand detection inference step S330: Face overlap inspection step
S400: Main reasoning step S410: Sub feature map generation step
S420: Multi-modal feature map generation step S430: Emotion recognition inference step
S500: Steps to derive results A ₁ : Eye recognition area
A ₂ : nose recognition area A ₃ : mouth recognition area
B _1, B _2, B ₃ : Area to be tracked

Claims

In the emotional recognition method of processing an image to determine the emotional state of a person,
The image, which provides an image and audio expressing a human appearance, includes a first image portion, a second image portion immediately following the first image portion, and a third image portion immediately following the second image portion, step;
In the first image section, the first image section is processed to determine the emotional state of the person, and in the first image section, the face of the person and at least one hand are shown, and the at least one hand is the view of the person. Characterized in that no part of the face overlaps; And
The second image section processes the second image section to determine the emotional state of the person, and in the second image section, the face of the person and at least one hand are shown and the at least one hand of the person Characterized in that it overlaps with the face; includes,
The step of processing the first image unit,
Processing at least one frame of the first image unit to determine whether the at least one hand covers the face of the person;
Finding a first face element of the person in the at least one frame of the first image unit;
Acquiring first facial feature data of the first image portion based on a shape of the first facial element shown in the at least one frame of the first image portion while the first facial element is located;
Processing audio data of the first video unit to obtain a voice feature based on characteristics of the human voice in the first video unit; and,
And determining the emotional state of the person with respect to the first image unit based on a plurality of data including first facial feature data and voice features of the first image unit.
The processing of the second image unit may include:
In order to determine whether the face of the person is obscured by at least one hand, the at least one hand covers the face of the person in the second image section, in particular, processing the at least one frame of the second image section. The step of determining whether or not to cover, and
Finding the first face element of the person in at least one frame of the second image portion;
Acquiring first facial feature data of the first image portion based on a shape of the first facial element shown in the at least one frame of the second image portion while the first facial element is located;
Processing the audio data of the second video unit to obtain voice characteristic data based on the human voice characteristics in the second video unit;
Based on a plurality of data including the first facial feature data of the second image portion, the audio feature data of the second image portion, and additional data indicating a location where at least one hand covers a part of the face of the person. And determining a person's emotional state with respect to the second image unit.

According to claim 1,
Determining the emotional state of the person with respect to the second image unit based on the plurality of data,
When a part of the person's face is covered by at least one hand in the second image unit, a person who places more weight on the voice characteristic data of the second image unit than the first face feature data of the second image unit Emotion recognition method that processes an image to determine its emotional state.

According to claim 1,
Determining the emotional state of the person with respect to the second image unit based on the plurality of data,
In the first image portion, if any part of the face of the person is not covered by at least one hand, in the second image portion, when a part of the face of the person is covered by at least one hand, the first Emotion recognition method for processing an image to determine the emotional state of a person who places more weight on the audio feature data of the second image portion than the audio feature data of the image portion.

According to claim 1,
The step of processing the first image unit,
Finding a second face element of the person in the at least one frame of the first image portion,
And acquiring second facial feature data of the first image portion based on the shape of the second facial element shown in the at least one frame of the first image portion while the second facial element is located. Including,
In particular, the first image is based on a plurality of data including the first facial element feature data of the first image portion, the second facial element feature data of the first image portion, and audio feature data of the first image portion. Determine the person's emotional state of wealth,
The processing of the second image unit may include:
Determining whether the second face element is obscured by at least one hand, finding the second face element of the person in the at least one frame of the second image portion;
The method further includes acquiring second face feature data of the second image part based on a predetermined weight for the obscuration of the second face feature and the second face element of the first image part,
In particular, the at least the first facial element feature data of the second image portion, the second facial element feature data of the second image portion, the audio feature data of the second image portion, and the part of the face of the person Emotion recognition method for processing an image to determine the emotional state of the person, characterized in that for determining the emotional state of the person with respect to the second image unit based on the additional data indicating the position of one hand.

According to claim 1,
And processing the third image section to determine the emotional state of the person with respect to the third image section, and wherein at least no part of the person's face is covered by my hand, the third image On the wound, the person's face and the equator's one hand are visible,
The processing of the third image unit may include:
Processing at least one frame of the third image unit to determine whether the at least one hand covers the face of the person;
Finding a first face element of the person in at least one frame of the third image unit;
Acquiring first facial feature data of the third image portion based on a shape of the first facial element shown in the at least one frame of the third image portion while the first face element is located;
Processing audio data of the first video unit to obtain audio characteristic data of the third video unit based on the human voice characteristics in the third video unit; And
Determining an emotional state of the person in the first image unit based on a plurality of data including the first facial feature data and the audio feature data in the third image unit; Emotion recognition method to process video.

A computer readable storage medium storing instructions that, when executed by a computer, perform the method of claim 1.