KR102428916B1

KR102428916B1 - Method And Apparatus for Classifying Emotion Based on Multi-Modal

Info

Publication number: KR102428916B1
Application number: KR1020200157046A
Authority: KR
Inventors: 노희열; 이선행; 이유중; 장동영; 윤찬녕; 신영하
Original assignee: 오로라월드 주식회사
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2022-08-03
Also published as: KR20220069677A; WO2022108057A1

Abstract

본 개시는 감정 분류장치 및 방법을 제공한다.
본 개시의 일 측면에 의하면, 화상을 기초로 예측한 감정 분류를 음성을 기초로 측정한 감정 분류를 이용하여 검증하고, 검증 정확도를 기초로 감정 분류모델을 갱신하는 장치 및 방법을 제공한다.The present disclosure provides an apparatus and method for classifying emotions.
According to one aspect of the present disclosure, an apparatus and method for verifying emotion classification predicted based on an image using emotion classification measured based on voice and updating an emotion classification model based on verification accuracy are provided.

Description

{Method And Apparatus for Classifying Emotion Based on Multi-Modal}

본 발명은 멀티-모달 기반의 감정 분류장치 및 방법에 관한 것이다.The present invention relates to a multi-modal-based emotion classification apparatus and method.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information for the present embodiment and does not constitute the prior art.

인간의 얼굴 모델로부터 인간의 감정을 정확하게 인식하기 위하여, 음성(voice), 표정(facial expression), 뇌파(brain wave), 근전도(electromyography), 피부 전기저항(skin resistance), 체온(body temperature) 및 심전도(electrocardiogram) 등 다양한 생체 데이터(biometric data)가 사용되고 있다. 이러한 다양한 데이터의 사용에도 불구하고 정확한 감정인식에는 어려움이 있었다. 이러한 어려움을 보완하기 위하여, 시각(sight), 청각(hearing), 촉각(touch), 미각(taste), 후각(smell) 등 다양한 감각 채널로부터 입력받은 정보를 종합하여 감정 분류를 수행하는, 멀티-모달(multi-modal) 분류 시스템이 제안되고 있다.In order to accurately recognize human emotions from a human face model, voice, facial expression, brain wave, electromyography, skin resistance, body temperature, and Various biometric data such as an electrocardiogram are used. Despite the use of these various data, it was difficult to accurately recognize emotions. In order to compensate for these difficulties, multi- that performs emotion classification by synthesizing information received from various sensory channels such as sight, hearing, touch, taste, and smell. A multi-modal classification system has been proposed.

한국 공개특허공보 제10-2017-0094836 (2017.08.22)Korean Patent Publication No. 10-2017-0094836 (2017.08.22)

본 개시의 일 측면에 의하면, 화상을 기초로 예측한 감정 분류를 음성을 기초로 측정한 감정 분류를 이용하여 검증하고, 검증 정확도를 기초로 감정 분류모델을 갱신하는 장치 및 방법을 제공하는 데 주된 목적이 있다.According to one aspect of the present disclosure, there is provided an apparatus and method for verifying emotion classification predicted based on an image using emotion classification measured based on voice, and updating an emotion classification model based on verification accuracy There is a purpose.

본 개시의 일 측면에 의하면, 동일한 감정 분류에 해당하는 실화상, 열화상 및 음성 데이터에 있어서, 상기 실화상으로부터 상기 실화상에 포함된 객체의 표정을 인식하여 표정 정보를 추출하여 클러스터링 기반의 감정 분류모델을 이용하여 상기 실화상에 포함된 객체의 감정 분류를 예측하고, 상기 열화상으로부터 상기 열화상에 포함된 객체의 온도 정보를 추출하여 상기 감정 분류모델을 이용하여 상기 열화상에 포함된 객체의 감정 분류를 예측하는 분류예측부; 객체의 감정 분류에 따른 주파수 대역을 필터링하는 대역 필터(band-pass filter)를 이용하여 상기 음성 데이터로부터 상기 음성 데이터가 나타내는 감정 분류를 측정하는 분류측정부; 및 상기 실화상을 기초로 예측한 감정 분류 및 상기 열화상을 기초로 예측한 감정 분류 각각을 상기 음성 데이터를 기초로 측정한 감정 분류를 이용하여 검증하고, 검증 결과를 기초로 상기 감정 분류모델을 갱신하는 갱신부를 포함하는 것을 특징으로 하는 멀티-모달 기반의 감정 분류장치를 제공한다.According to one aspect of the present disclosure, from the real image, thermal image, and voice data corresponding to the same emotion classification, the expression information of an object included in the real image is recognized from the real image, and expression information is extracted and clustering-based emotion Predicting the emotional classification of the object included in the real image using the classification model, extracting temperature information of the object included in the thermal image from the thermal image, and using the emotion classification model to predict the object included in the thermal image a classification prediction unit for predicting the classification of emotions; a classification measurement unit for measuring an emotion classification indicated by the voice data from the voice data using a band-pass filter for filtering a frequency band according to the emotion classification of an object; and verifying each of the emotion classification predicted based on the real image and the emotion classification predicted based on the thermal image using the emotion classification measured based on the voice data, and the emotion classification model based on the verification result. It provides a multi-modal-based emotion classification apparatus, characterized in that it includes an updater for updating.

본 개시의 다른 측면에 의하면, 동일한 감정 분류에 해당하는 실화상, 열화상 및 음성 데이터에 있어서, 상기 실화상으로부터 상기 실화상에 포함된 객체의 표정을 인식하여 표정 정보를 추출하여 클러스터링 기반의 감정 분류모델로부터 상기 실화상에 포함된 객체의 감정 분류를 예측하는 과정; 상기 열화상으로부터 상기 열화상에 포함된 객체의 온도 정보를 추출하여 상기 감정 분류모델로부터 상기 열화상에 포함된 객체의 감정 분류를 예측하는 과정; 객체의 감정 분류에 따른 주파수 대역을 필터링하는 대역 필터(band-pass filter)를 이용하여 상기 음성 데이터로부터 상기 음성 데이터가 나타내는 감정 분류를 측정하는 과정; 상기 실화상을 기초로 예측한 감정 분류 및 상기 열화상을 기초로 예측한 감정 분류 각각을 상기 음성 데이터를 기초로 측정한 감정 분류를 이용하여 검증하고, 검증 결과를 기초로 상기 감정 분류모델을 갱신하는 과정을 포함하는 것을 특징으로 하는 멀티-모달 기반의 감정 분류방법을 제공한다.According to another aspect of the present disclosure, from the real image, thermal image, and voice data corresponding to the same emotion classification, the expression information of an object included in the real image is recognized from the real image, and expression information is extracted, and clustering-based emotion predicting the emotional classification of the object included in the real image from the classification model; extracting temperature information of the object included in the thermal image from the thermal image and predicting the emotional classification of the object included in the thermal image from the emotion classification model; measuring an emotion classification represented by the voice data from the voice data using a band-pass filter that filters a frequency band according to the emotion classification of an object; Each of the emotion classification predicted based on the real image and the emotion classification predicted based on the thermal image is verified using the emotion classification measured based on the voice data, and the emotion classification model is updated based on the verification result. It provides a multi-modal-based emotion classification method comprising the process of:

본 개시의 다른 측면에 의하면, 전술한 감정 분류장치에 있어서, 상기 갱신부는, 상기 제1 값에 대해 예측된 감정 분류 및 상기 제2 값에 대해 예측된 감정 분류 각각의 검증 정확도 또는 검증 신뢰도를 기초로 각 값에 대응하는 가중치(weight)를 갱신함으로써 상기 감정 분류모델을 갱신하는 것을 특징으로 하는 멀티-모달 기반의 감정 분류장치를 제공한다.According to another aspect of the present disclosure, in the above-described emotion classification apparatus, the update unit is based on the verification accuracy or verification reliability of the emotion classification predicted for the first value and the emotion classification predicted for the second value to update the emotion classification model by updating a weight corresponding to each value.

본 개시의 또다른 측면에 의하면, 상기 검증 정확도가 기 설정된 정확도 미만이거나 검증 신뢰도가 기 설정된 신뢰도 미만인 경우 상기 갱신부는, 상기 감정 분류모델의 가중치를 변경하고, 분류예측부는, 변경된 감정 분류모델을 이용하여 상기 표정 정보 및 상기 온도 정보를 기초로 감정 분류를 재예측하는 것을 특징으로 하는 멀티-모달 기반의 감정 분류장치를 제공한다.According to another aspect of the present disclosure, when the verification accuracy is less than a preset accuracy or when the verification reliability is less than a preset reliability, the updater changes the weight of the emotion classification model, and the classification prediction unit uses the changed emotion classification model to provide a multi-modal-based emotion classification apparatus, characterized in that the emotion classification is re-predicted based on the expression information and the temperature information.

본 개시의 일 측면에 의하면, 실화상(normal image) 및 열화상(thermal image)을 기초로 예측한 감정 분류를 음성을 기초로 측정한 감정 분류를 이용하여 검증하고, 검증 정확도를 기초로 감정 분류모델을 갱신함으로써, 실화상, 열화상 및 음성 데이터를 기초로 한 멀티-모달 기반의 감정 분류장치 및 방법을 제공하는 효과가 있다.According to an aspect of the present disclosure, emotion classification predicted based on a normal image and a thermal image is verified using emotion classification measured based on voice, and emotion classification based on verification accuracy By updating the model, there is an effect of providing a multi-modal-based emotion classification apparatus and method based on real image, thermal image and voice data.

본 개시의 다른 측면에 의하면, 각 모달리티(modality)에 대하여 산출한 감정 분류를 기초로 검증 정확도를 산출함으로써, 감정 분류장치 및 방법이 채용하는 감정 분류모델의 각 모달리티에 해당하는 가중치를 변경함으로써 감정 분류의 정확도 및/또는 신뢰도를 향상시키는 효과가 있다.According to another aspect of the present disclosure, by calculating the verification accuracy based on the emotion classification calculated for each modality, by changing the weight corresponding to each modality of the emotion classification model employed by the emotion classification apparatus and method, emotion It has the effect of improving the accuracy and/or reliability of classification.

도 1은 본 개시의 일 실시예에 따른 감정 분류장치를 나타내는 블록구성도이다.
도 2는 본 개시의 일 실시예에 따라 화상으로부터 감정 분류를 예측하는 방법을 나타내는 개념도이다.
도 3은 본 개시의 일 실시예에 따라 음성 데이터로부터 감정 분류를 측정하는 방법을 나타내는 개념도이다.
도 4는 본 개시의 일 실시예에 따른 감정 분류방법을 나타내는 흐름도이다.
도 5는 본 개시의 일 실시예에 따른 감정 분류장치를 기초로 산출된 감정 분류결과의 예시도이다.1 is a block diagram illustrating an emotion classification apparatus according to an embodiment of the present disclosure.
2 is a conceptual diagram illustrating a method of predicting emotion classification from an image according to an embodiment of the present disclosure.
3 is a conceptual diagram illustrating a method of measuring emotion classification from voice data according to an embodiment of the present disclosure.
4 is a flowchart illustrating a method for classifying emotions according to an embodiment of the present disclosure.
5 is an exemplary diagram of an emotion classification result calculated based on the emotion classification apparatus according to an embodiment of the present disclosure.

이하, 본 개시의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 열람부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 개시를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 개시의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present disclosure will be described in detail with reference to exemplary drawings. It should be noted that in adding the reading reference numerals to the components of each drawing, the same components are to have the same reference numerals as much as possible even though they are displayed on different drawings. In addition, in describing the present disclosure, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present disclosure, the detailed description thereof will be omitted.

또한, 본 개시의 구성 요소를 설명하는 데 있어서, 제2, 제1 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 '포함', '구비'한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 '…부', '모듈' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In addition, in describing the components of the present disclosure, terms such as second, first, etc. may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. Throughout the specification, when a part 'includes' or 'includes' a certain component, this means that other components may be further included, rather than excluding other components, unless otherwise stated. . In addition, the '... Terms such as 'unit' and 'module' mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software.

첨부된 도면과 함께 이하에 개시될 상세한 설명은 본 개시의 예시적인 실시형태를 설명하고자 하는 것이며, 본 개시가 실시될 수 있는 유일한 실시형태를 나타내고자 하는 것이 아니다.DETAILED DESCRIPTION The detailed description set forth below in conjunction with the appended drawings is intended to describe exemplary embodiments of the present disclosure and is not intended to represent the only embodiments in which the present disclosure may be practiced.

본 개시에서의 화상(image)은 객체, 예컨대 사람의 얼굴이 포함된 정지 화상(still image)과 동영상(video)을 모두 포함한다.An image in the present disclosure includes both a still image and a video including an object, for example, a human face.

도 1은 본 개시의 일 실시예에 따른 감정 분류장치를 나타내는 블록구성도이다.1 is a block diagram illustrating an emotion classification apparatus according to an embodiment of the present disclosure.

본 개시의 일 실시예에 따른 감정 분류장치(100)는 분류예측부(classificatin prediction unit, 102), 분류측정부(classification measurement unit, 104 및 갱신부(update unit, 106)를 전부 또는 일부 포함한다. 도 1에 도시된 감정 분류장치(100)는 본 개시의 일 실시예에 따른 것으로서, 도 1에 도시된 모든 구성이 필수 구성요소는 아니며, 다른 실시예에서 일부 구성이 추가, 변경 또는 삭제될 수 있다. 예컨대, 다른 실시예에서 감정 분류장치는 검증이 끝난 예측된 감정 분류결과를 표시하는 디스플레이부(display unit, 미도시)를 더 포함할 수 있다.The emotion classification apparatus 100 according to an embodiment of the present disclosure includes all or part of a classification prediction unit 102 , a classification measurement unit 104 and an update unit 106 . The emotion classification apparatus 100 shown in Fig. 1 is according to an embodiment of the present disclosure, and not all components shown in Fig. 1 are essential components, and some components may be added, changed or deleted in other embodiments. For example, in another embodiment, the emotion classification apparatus may further include a display unit (not shown) for displaying the verified and predicted emotion classification results.

도 1은 설명의 편의를 위해 감정 분류장치(100)를 장치로서 도시하였으나, 이는 설명의 편의를 위한 것으로, 다른 실시예에서 노이즈 제거장치는 각 구성(102 내지 106)의 기능을 수행하는 소프트웨어 모듈 또는 프로세서로 구현될 수 있다.1 shows the emotion classification apparatus 100 as a device for convenience of explanation, this is for convenience of description, and in another embodiment, the noise removing device is a software module that performs the functions of each of the components 102 to 106 Alternatively, it may be implemented as a processor.

동일한 감정 분류에 해당하는 실화상(normal image), 열화상(thermal image) 및 음성 데이터에 있어서, 분류예측부(102)는 실화상으로부터 실화상에 포함된 객체의 표정을 인식하여 표정 정보를 추출하고, 열화상으로부터 열화상에 포함된 객체의 온도 정보를 추출하여 클러스터링 기반의 감정 분류모델(emotion classification model)로부터 각 객체의 감정 분류를 예측한다. 구체적으로, 분류예측부(102)는 감정 분류모델을 이용하여 실화상으로부터 추출된 표정 정보를 수치화한 제1 값과 열화상으로부터 추출된 온도 정보인 제2 값 각각에 대한 감정 분류를 예측한다. 여기서 감정 분류모델의 감정 분류 예측은, 하나 이상의 감정 분류에 대하여 각 감정 분류에 해당할 확률을 예측함으로써 수행될 수 있다. 이러한 감정 분류는 행복(happy), 슬픈(sad), 무서운(scary), 화난(angry), 놀라운(surprised) 등일 수 있으며, 어디에도 포함되지 않는 미분류 상태인 중립(neutral)을 더 포함할 수 있으나 이에 한하지 않는다. 예컨대, 감동적인(touching), 기대(hopeful), 사랑스러운(adorable), 만족스러운(satisfied), 혼란스러운(confused), 미운(hateful), 부러운(envious) 등 감정 분류와 관련하여 통상적으로 감정 분류모델에 사용할 수 있는 감정 분류라면 채용할 수 있다.In a normal image, a thermal image, and audio data corresponding to the same emotion classification, the classification prediction unit 102 recognizes the expression of an object included in the real image from the real image and extracts expression information. Then, the temperature information of the object included in the thermal image is extracted from the thermal image, and the emotion classification of each object is predicted from the clustering-based emotion classification model. Specifically, the classification prediction unit 102 predicts the emotion classification for each of the first value obtained by digitizing the expression information extracted from the real image and the second value that is the temperature information extracted from the thermal image using the emotion classification model. Here, the emotion classification prediction of the emotion classification model may be performed by predicting a probability corresponding to each emotion classification with respect to one or more emotion classifications. These emotional classifications may be happy, sad, scary, angry, surprised, etc., and may further include neutral, which is an unclassified state that is not included anywhere. does not limit For example, in relation to emotion classification, such as touching, hopeful, adorable, satisfied, confused, hateful, envious, etc., the emotion classification model is usually Any sentiment classification that can be used for

분류예측부(102)는 실화상으로부터 객체의 얼굴 위치를 추출하고, 추출된 얼굴 위치를 기초로 얼굴 랜드마크(face landmark, 예: 눈, 코, 입, 광대, 이마 등)의 정보(예: 위치, 면적 등)를 추출하여 기 학습된 표정 값 추출 모델로부터 제1 값을 획득할 수 있다. 얼굴 랜드마크의 정보는 실화상을 그레이 스케일(gray scale) 변환하여 이미지의 경계선, 즉 에지(edge)를 추출함으로써 획득될 수 있다. 제1 값은, 추출된 에지의 값을 2차원 행렬로 변환하고, 변환된 2차원 행렬을 카메라 초점 값을 이용하여 3차원 행렬로 변환하고, 변환된 3차원 행렬을 기 학습된 표정 값 추출 모델에 입력시킴으로써 획득할 수 있다.The classification prediction unit 102 extracts the position of the face of the object from the real image, and information (eg, face landmark, eg, eyes, nose, mouth, cheeks, forehead, etc.) based on the extracted facial position. location, area, etc.) to obtain a first value from a pre-learned expression value extraction model. The facial landmark information may be obtained by converting a real image to a gray scale and extracting a boundary, that is, an edge of the image. The first value is the value of the extracted edge is converted into a two-dimensional matrix, the converted two-dimensional matrix is converted into a three-dimensional matrix using the camera focus value, and the transformed three-dimensional matrix is used as a pre-learned expression value extraction model It can be obtained by entering

실화상 및 열화상이 공통된 객체를 촬영한 영상인 경우 분류예측부(102)는 실화상으로부터 추출한 객체의 얼굴 위치를 기초로 하나 이상의 관심영역(ROI, Region Of Interest)을 설정하고, 관심영역 각각에 대한 온도를 제2 값으로서 추출한다. 제2 값은, 가시광선 카메라에서 인식된 얼굴 위치를 참조하여 얼굴 위치에 대한 가중치를 부여하여 얼굴에 해당하는 열화상을 획득한 후, 이러한 얼굴의 열화상으로부터 얼굴 랜드마크에 해당하는 관심영역의 온도를 추출함으로써 획득할 수 있다. When the real image and the thermal image are images of a common object, the classification prediction unit 102 sets one or more regions of interest (ROI) based on the face position of the object extracted from the real image, and each region of interest The temperature for is extracted as the second value. The second value refers to the position of the face recognized by the visible ray camera and gives a weight to the position of the face to obtain a thermal image corresponding to the face, and then determines the region of interest corresponding to the facial landmark from the thermal image of the face. It can be obtained by extracting the temperature.

분류측정부(104)는 감정 상태에 따른 주파수 대역을 필터링하는 대역 필터(band-pass filter)를 이용하여 음성 데이터로부터 감정 분류를 측정한다. 여기서, 대역 필터는 감정 분류에 따른 주파수 평균값을 기초로 생성된 것일 수 있다. 또 다른 예로 대역 필터는 음성 데이터를 진폭(amplitude) 및 주파수(frequency)를 기초로 고속 푸리에 변환(FFT, Fast Fourier Transform)한 제1 그래프와 동일한 음성 데이터를 시간과 진폭을 기초로 단시간 푸리에 변환(STFT, Short-Time Fourier Transform)한 제2 그래프 간 공통되는 특징(예: 동일 영역에서의 그래프 개형 등)을 분석하여 생성된 것일 수 있다.The classification measurement unit 104 measures emotion classification from voice data using a band-pass filter that filters a frequency band according to the emotional state. Here, the band filter may be generated based on a frequency average value according to emotion classification. As another example, the band filter converts the same speech data as the first graph in which the speech data is Fast Fourier Transform (FFT) based on amplitude and frequency, based on time and amplitude, and performs a short-time Fourier transform ( STFT, Short-Time Fourier Transform) may be generated by analyzing common characteristics (eg, graph reformation in the same area) between the second graphs.

갱신부(106)는 실화상 및 열화상을 포함하는 화상을 기초로 예측한 감정 분류를 음성 데이터에 기초하여 측정한 감정 분류를 이용하여 검증하고, 검증 정확도에 기초하여 감정 분류모델을 갱신한다. 이러한 갱신은 구체적으로, 감정예측부(102)가 예측한 감정 분류를 감정측정부(104)가 측정한 감정 분류를 이용하여 검증하고, 검증 정확도(verification accuracy)를 기준으로 감정 분류모델의 각 모달리티(modality)에 대응하는 가중치(weight)를 갱신함으로써 수행될 수 있다. 또 다른 예로, 감정예측부(102)는 전술한 제1 값에 대하여 예측된 감정 분류와, 전술한 제2 값에 대하여 예측된 감정 분류 각각의 검증 정확도를 기초로 감정 분류모델의 각 값에 대응하는 가중치를 갱신함으로써 수행될 수 있다.The updater 106 verifies the emotion classification predicted based on the image including the real image and the thermal image using the emotion classification measured based on the voice data, and updates the emotion classification model based on the verification accuracy. Specifically, this update verifies the emotion classification predicted by the emotion prediction unit 102 using the emotion classification measured by the emotion measurement unit 104, and each modality of the emotion classification model based on the verification accuracy This may be performed by updating a weight corresponding to (modality). As another example, the emotion prediction unit 102 corresponds to each value of the emotion classification model based on the verification accuracy of each of the emotion classification predicted for the first value and the emotion classification predicted for the second value. This can be done by updating the weights.

검증 정확도가 기 설정된 정확도 미만이거나 검증 신뢰도(verification reliability)가 기 설정된 신뢰도 미만인 경우, 분류예측부(102)는 감정 분류를 재예측하고, 갱신부(106)는 재예측된 감정 분류를 검증하여 상기 감정 분류모델을 갱신시킴이 바람직하다. If the verification accuracy is less than the preset accuracy or if the verification reliability is less than the preset reliability, the classification prediction unit 102 re-predicts the emotion classification, and the update unit 106 verifies the re-predicted emotion classification to say It is desirable to update the emotion classification model.

도 2는 본 개시의 일 실시예에 따라 화상으로부터 감정 분류를 예측하는 방법을 나타내는 개념도이다.2 is a conceptual diagram illustrating a method of predicting emotion classification from an image according to an embodiment of the present disclosure.

감정 분류장치가 카메라 또는 가시광선 센서(visible light sensor) 등으로부터 실화상을 획득하면, 실화상으로부터 얼굴의 위치를 인식하여 얼굴로부터 특징을 추출한다. 얼굴의 위치는 예컨대, 실화상을 정규화(normalization), 이진화(binarization), 외곽선 추출(outline extraction), 인식 분류기(예: Haar, HOG 등)의 전부 또는 일부를 이용하여 신호 및/또는 데이터 처리함으로써 인식할 수 있다. 이후, 원래의 실화상을 그레이 스케일로 변환하거나 얼굴의 위치를 추적하기 위하여 신호 및/또는 데이터 처리된 실화상을 그레이 스케일로 변환하여, 변환된 이미지로부터 경계선을 추출하여 눈, 코, 입, 이마, 광대 등 얼굴 랜드마크 정보를 추출한다. 이후, 감정 분류장치는 얼굴 랜드마크 정보를 2차원 행렬로 변환하고, 실화상을 획득한 카메라 또는 가시광선 센서의 초점 값(focus value)을 이용하여 변환된 2차원 행렬을 3차원 행렬로 변환한다. 감정 분류장치는 기 학습된 표정 값 추출 모델을 이용하여 변환된 3차원 행렬로부터 얼굴 표정에 대한 값을 추출하고, 감정 분류모델을 이용하여 감정 분류모델이 분류할 수 있는 각 감정 분류에 속할 확률값을 예측한다.When the emotion classification apparatus obtains a real image from a camera or a visible light sensor, it recognizes the position of the face from the real image and extracts features from the face. The position of the face is determined by processing signals and/or data using, for example, all or part of a real image normalization, binarization, outline extraction, and recognition classifier (eg, Haar, HOG, etc.). can recognize Thereafter, the original real image is converted to gray scale or the real image processed with signal and/or data is converted to gray scale in order to track the position of the face, and boundary lines are extracted from the converted image to extract the eyes, nose, mouth, and forehead. , and facial landmark information such as clowns are extracted. Then, the emotion classification apparatus converts the facial landmark information into a two-dimensional matrix, and converts the converted two-dimensional matrix into a three-dimensional matrix by using the focus value of the camera or visible light sensor that obtained the real image. . The emotion classification device extracts a value for a facial expression from the transformed 3D matrix using a pre-learned expression value extraction model, and uses the emotion classification model to determine a probability value belonging to each emotion classification that the emotion classification model can classify. predict

다양한 모달리티에 기반한 감정 분류모델을 구현하기 위하여, 감정 분류장치는 열화상 카메라(thermal imaging camera) 등으로부터 열화상을 획득하여, 획득한 열화상으로부터 관심영역, 예컨대 감정 분류에 영향이 있을 것으로 예상 가능한 눈, 입 등의 영역에 대한 온도 정보를 얻을 수 있다. 관심영역의 온도 정보를 얻기 위하여, 감정 분류장치는 예컨대 카메라나 가시광선 센서 등이 인식한 얼굴의 위치 정보를 참조하여 화상의 위치에 대한 가중치를 부여함으로써 전체 열화상으로부터 얼굴이 위치한 영역을 인식한다. 감정 분류장치는 인식된 얼굴이 위치한 영역을 이용하여 관심영역(ROI, region of interest)을 설정하고, 눈, 코, 입, 광대, 이마 등에 해당하는 영역의 온도를 추출하여 감정 분류모델을 이용하여 감정 분류모델이 분류할 수 있는 각 감정 분류에 속할 확률값을 예측한다.In order to implement an emotion classification model based on various modalities, the emotion classification apparatus acquires a thermal image from a thermal imaging camera, etc. Temperature information can be obtained for areas such as eyes and mouth. In order to obtain the temperature information of the region of interest, the emotion classification device recognizes the region where the face is located from the entire thermal image by giving weight to the position of the image with reference to the position information of the face recognized by a camera or a visible light sensor, for example. . The emotion classification device sets a region of interest (ROI) using the region where the recognized face is located, extracts the temperature of the region corresponding to the eyes, nose, mouth, cheeks, forehead, etc., and uses the emotion classification model to Predict the probability value belonging to each emotion classification that the emotion classification model can classify.

도 3은 본 개시의 일 실시예에 따라 음성 데이터로부터 감정 분류를 측정하는 방법을 나타내는 개념도이다.3 is a conceptual diagram illustrating a method of measuring emotion classification from voice data according to an embodiment of the present disclosure.

감정 분류장치는 마이크 등으로부터 다양한 감정 분류에 해당하는 음성 정보를 획득하고, 이러한 음성 정보를 처리하여 대역 필터를 생성한다. 감정 분류장치는 음성 정보를 주파수 영역의 특징을 유지하도록 디지털 정보로 변환시켜 음성 데이터를 얻는다. 이후, 노이즈 제거 필터(예: 가우시안 필터(Gaussian filter) 등)를 이용하여 음성 데이터의 주파수 영역으로부터 노이즈를 제거(noise suppression)하고, 음성 데이터 또는 노이즈가 제거된 음성 데이터로부터 음의 높낮이(pitch), 어절(word clause), 말의 속도 등을 추출하여 수치화한다. 음성 데이터에는 고속 푸리에 변환(FFT)과 단시간 푸리에 변환(STFT)을 각각 적용하여, 적용된 결과로서 생성된 그래프를 이용하여 각 그래프의 공통 영역의 특징을 찾아 분석하여 감정 상태에 따른 주파수 평균값을 산출하고, 산출된 주파수 평균값을 이용하여 대역 필터를 생성한다.The emotion classification apparatus obtains voice information corresponding to various emotion classifications from a microphone, etc., and processes the voice information to generate a band-pass filter. The emotion classification apparatus obtains voice data by converting voice information into digital information to maintain the characteristics of the frequency domain. Thereafter, noise suppression is performed in the frequency domain of the voice data using a noise removal filter (eg, a Gaussian filter, etc.), and the pitch of the voice data or the noise-removed voice data is performed. , word clause, and speed of speech are extracted and quantified. Fast Fourier Transform (FFT) and Short Time Fourier Transform (STFT) are applied to voice data, respectively, and using the graph generated as a result of the application, find and analyze the characteristics of the common area of each graph to calculate the average frequency according to the emotional state, , a band filter is generated using the calculated average frequency.

감정 분류장치는, 검증에 사용하기 위한 음성 데이터를 푸리에 변환하고, 변환된 음성 데이터에 대역 필터를 적용하여 대역 필터가 분류할 수 있는 각 감정 분류의 확률 값을 측정한다.The emotion classification apparatus Fourier transforms speech data to be used for verification, applies a bandpass filter to the transformed speech data, and measures a probability value of each emotion classification that the bandpass filter can classify.

도 4는 본 개시의 일 실시예에 따른 감정 분류방법을 나타내는 흐름도이다.4 is a flowchart illustrating a method for classifying emotions according to an embodiment of the present disclosure.

감정 분류장치는 동일한 감정 분류에 해당하는 실화상 및 열화상을 획득하고 이를 이용하여 감정 분류를 예측한다(S400). 감정 분류의 예측에는 감정 분류모델이 이용된다.The emotion classification apparatus obtains a real image and a thermal image corresponding to the same emotion classification, and predicts the emotion classification using the obtained real image (S400). An emotion classification model is used for prediction of emotion classification.

또한, 감정 분류장치는 S400 단계와 병렬적으로 또는 S400 단계의 전/후로, 실화상 및 열화상과 동일한 감정 분류에 해당하는 음성 데이터를 획득하여, 획득된 음성 데이터에 대역 필터를 적용하여 감정 분류를 측정한다(S402). 이러한 측정은, 각 감정 분류에 해당하는 확률 값을 산출함으로써 수행될 수 있다.In addition, the emotion classification apparatus acquires voice data corresponding to the same emotion classification as the real image and thermal image in parallel with step S400 or before/after step S400, and applies a band filter to the obtained voice data to classify emotions is measured (S402). Such measurement may be performed by calculating a probability value corresponding to each emotion classification.

감정 분류장치는 실화상 및 열화상 등 화상을 기초로 예측한 감정 분류를 검증하는데, 검증에는 음성 데이터 등 음성을 기초로 측정한 감정 분류를 이용한다(S404). 검증은 예측된 감정 분류와 측정된 감정 분류가 오차범위 내에서 일치하는지를 판단함으로써 수행된다(S406). 이러한 검증은 기 설정된 검증 조건(예: 신뢰도, 정확도 등)을 더 고려하여 수행됨이 바람직하다.The emotion classification apparatus verifies the emotion classification predicted based on images such as real images and thermal images, and the emotion classification measured based on voice, such as voice data, is used for verification (S404). The verification is performed by determining whether the predicted emotion classification and the measured emotion classification match within an error range (S406). Such verification is preferably performed in consideration of preset verification conditions (eg, reliability, accuracy, etc.).

S406 단계에서 예측된 감정 분류와 측정된 감정 분류가 오차범위 내에서 일치한다고 판단하는 경우, 예측된 감정 분류를 감정 분류결과로서 산출하고 절차를 종료한다.If it is determined in step S406 that the predicted emotion classification and the measured emotion classification match within the error range, the predicted emotion classification is calculated as an emotion classification result and the procedure is terminated.

예측된 감정 분류와 측정된 감정 분류가 일치하지 않는다고 판단하는 경우, 감정 분류장치는 감정 분류모델의 각 모달리티에 대응하는 가중치를 변경한다(S408). 이후, 변경된 감정 분류모델을 이용하여 S400 단계에서 획득하였던 실화상 및 열화상을 이용하여 감정 분류를 재예측한다.If it is determined that the predicted emotion classification and the measured emotion classification do not match, the emotion classification apparatus changes a weight corresponding to each modality of the emotion classification model (S408). Thereafter, the emotion classification is re-predicted using the real image and the thermal image obtained in step S400 using the changed emotion classification model.

도 4에서는 과정 각 과정을 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 개시의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것이다. 다시 말해, 본 개시의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 개시의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 4에 기재된 순서를 변경하여 실행하거나 각 과정 중 하나 이상의 과정을 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 도 4의 시계열적인 순서로 한정되는 것은 아니다.Although it is described that each process is sequentially executed in FIG. 4 , this is merely illustrative of the technical idea of an embodiment of the present disclosure. In other words, those of ordinary skill in the art to which an embodiment of the present disclosure pertain may change the order described in FIG. 4 without departing from the essential characteristics of an embodiment of the present disclosure, or perform one or more of each process. Since it may be applied by various modifications and variations to be executed in parallel, it is not limited to the time-series sequence of FIG. 4 .

도 5는 본 개시의 일 실시예에 따른 감정 분류장치를 기초로 산출된 감정 분류결과의 예시도이다.5 is an exemplary diagram of an emotion classification result calculated based on the emotion classification apparatus according to an embodiment of the present disclosure.

감정 분류모델의 검증 결과, 감정 분류모델이 기 설정된 정확도 및/또는 기 설정된 신뢰도를 만족하는 경우에 감정 분류장치는 도 5와 같이 감정 분류모델의 감정 분류결과를 산출할 수 있다.As a result of the verification of the emotion classification model, when the emotion classification model satisfies a preset accuracy and/or a preset reliability, the emotion classification apparatus may calculate an emotion classification result of the emotion classification model as shown in FIG. 5 .

이러한 감정 분류결과의 산출은 실시간으로 이루어짐이 바람직하다. 예컨대, 도 5와 같이 객체로부터 실시간으로 실화상, 열화상 및 음성 데이터를 획득하고, 실화상 및 열화상으로부터 각각 표정 값과 온도를 추출하여 감정 분류모델로부터 감정 분류를 예측한다. 이후, 감정 분류장치는 음성 데이터를 푸리에 변환하여 대역 필터에 통과시킴으로써 음성 데이터에 대응하는 감정 분류를 측정한다. 측정된 감정 분류를 기초로 예측된 감정 분류를 검증하여, 검증 결과에 타당성이 인정되면 도 5와 같이 데이터의 값 또는 그래프 등으로 각 감정 분류에 해당하는 퍼센트를 산출한다. 이러한 산출은 타당성이 인정된 일부 감정 분류결과에 한하여 수행될 수 있다. 또한, 이러한 산출은 '미분류'에 해당하는 감정 분류가 기 설정된 퍼센트 이상인 경우 검증 결과가 타당하지 않은 것으로 판단하고, 수행되지 않을 수 있다.The calculation of the emotion classification result is preferably performed in real time. For example, as shown in FIG. 5 , real images, thermal images, and voice data are obtained from an object in real time, and expression values and temperatures are extracted from the real images and thermal images, respectively, and emotion classification is predicted from the emotion classification model. Thereafter, the emotion classification apparatus measures emotion classification corresponding to the speech data by Fourier transforming the speech data and passing it through a band filter. The predicted emotion classification is verified based on the measured emotion classification, and when the validation result is validated, a percentage corresponding to each emotion classification is calculated as a data value or a graph as shown in FIG. 5 . This calculation can be performed only for some emotion classification results that have been validated. In addition, when the emotion classification corresponding to 'unclassified' is greater than or equal to a preset percentage, this calculation may not be performed because the verification result is determined to be invalid.

본 명세서에 설명되는 장치, 부(unit), 과정, 단계 등의 다양한 구현예들은, 디지털 전자 회로, 집적 회로, FPGA(field programmable gate array), ASIC(application specific integrated circuit), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합으로 실현될 수 있다. 이러한 다양한 구현예들은 프로그래밍 가능 시스템상에서 실행 가능한 하나 이상의 컴퓨터 프로그램들로 구현되는 것을 포함할 수 있다. 프로그래밍 가능 시스템은, 저장 시스템, 적어도 하나의 입력 디바이스, 그리고 적어도 하나의 출력 디바이스로부터 데이터 및 명령을 수신하고 이들에게 데이터 및 명령을 전송하도록 결합된 적어도 하나의 프로그래밍 가능 프로세서(이것은 특수 목적 프로세서일 수 있거나 혹은 범용 프로세서일 수 있음)를 포함한다. 컴퓨터 프로그램들(이것은 또한 프로그램들, 소프트웨어, 소프트웨어 애플리케이션들 혹은 코드로서 알려져 있음)은 프로그래밍 가능 프로세서에 대한 명령어들을 포함하며 "컴퓨터가 읽을 수 있는　기록매체"에 저장된다. Various implementations of the devices, units, processes, steps, etc., described herein may include digital electronic circuits, integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or a combination thereof. These various implementations may include being implemented in one or more computer programs executable on a programmable system. The programmable system includes at least one programmable processor (which may be a special purpose processor) coupled to receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device. or may be a general-purpose processor). Computer programs (also known as programs, software, software applications or code) contain instructions for a programmable processor and are stored on a "computer-readable recording medium".

컴퓨터가 읽을 수 있는　기록매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 이러한 컴퓨터가 읽을 수 있는　기록매체는 ROM, CD-ROM, 자기 테이프, 플로피디스크, 메모리 카드, 하드 디스크, 광자기 디스크, 스토리지 디바이스 등의 비휘발성(non-volatile) 또는 비 일시적인(non-transitory) 매체 또는 데이터 전송 매체(data transmission medium)와 같은 일시적인(transitory) 매체를 더 포함할 수도 있다. 또한, 컴퓨터가 읽을 수 있는　기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다.The computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. These computer-readable recording media are non-volatile or non-transitory, such as ROM, CD-ROM, magnetic tape, floppy disk, memory card, hard disk, magneto-optical disk, and storage device. It may further include a medium or a transitory medium such as a data transmission medium. In addition, the computer-readable 　 recording medium may be distributed in a network-connected computer system, and the computer-readable code may be stored and executed in a distributed manner.

본 명세서에 설명되는 시스템들 및 기법들의 다양한 구현예들은, 프로그램가능 컴퓨터에 의하여 구현될 수 있다. 여기서, 컴퓨터는 프로그램가능 프로세서, 데이터 저장 시스템(휘발성 메모리, 비휘발성 메모리, 또는 다른 종류의 저장 시스템이거나 이들의 조합을 포함함) 및 적어도 한 개의 커뮤니케이션 인터페이스를 포함한다. 예컨대, 프로그램가능 컴퓨터는 서버, 네트워크 기기, 셋톱박스, 내장형 장치, 컴퓨터 확장 모듈, 개인용 컴퓨터, 랩톱, PDA(Personal Data Assistant), 클라우드 컴퓨팅 시스템 또는 모바일 장치 중 하나일 수 있다.Various implementations of the systems and techniques described herein may be implemented by a programmable computer. Here, the computer includes a programmable processor, a data storage system (including volatile memory, non-volatile memory, or other types of storage systems or combinations thereof) and at least one communication interface. For example, the programmable computer may be one of a server, a network appliance, a set-top box, an embedded device, a computer expansion module, a personal computer, a laptop, a Personal Data Assistant (PDA), a cloud computing system, or a mobile device.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of this embodiment, and a person skilled in the art to which this embodiment belongs may make various modifications and variations without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are intended to explain rather than limit the technical spirit of the present embodiment, and the scope of the technical spirit of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the claims below, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present embodiment.

100: 감정 분류장치
102: 분류예측부
104: 분류측정부
106: 갱신부100: emotion classification device
102: classification prediction unit
104: classification measurement unit
106: update part

Claims

In the real image, thermal image, and audio data corresponding to the same emotion classification,
From the real image, one or more emotion classifications of the object included in the real image are predicted using a clustering-based emotion classification model by recognizing the facial expression of the object included in the real image and extracting facial expression information, and from the thermal image a classification prediction unit extracting temperature information of the object included in the thermal image and predicting one or more emotional classifications of the object included in the thermal image by using the emotion classification model;
a classification measurement unit for measuring one or more emotion classifications indicated by the voice data from the voice data using a band-pass filter that filters a frequency band according to the emotion classification of an object; and
Based on one or more emotion classifications indicated by the voice data measured by the classification measurement unit, the classification prediction unit predicts one or more emotion classifications based on the real image and one or more emotion classifications predicted based on the thermal image, respectively An update unit that verifies for each emotion classification and updates the weight of the emotion classification model based on the verification result
Multi-modal-based emotion classification device comprising a.

According to claim 1,
The classification prediction unit,
Multi-modal-based emotion classification apparatus, characterized in that by using the emotion classification model, predicting emotion classification for each of the first value obtained by digitizing the expression information and the second value that is the temperature information.

3. The method of claim 2,
The classification prediction unit,
The first face landmark information is extracted from the real image by extracting the face position of the object included in the real image, and the first face landmark information is extracted using a pre-learned expression value extraction model. Multi-modal-based emotion classification device, characterized in that for calculating a value.

4. The method of claim 3,
The information extraction of the facial landmark is,
A multi-modal-based emotion classification apparatus, characterized in that it is performed by converting the real image to gray scale and extracting an edge of the image.

5. The method of claim 4,
The first value is, the value of the extracted edge is converted into a two-dimensional matrix, the converted two-dimensional matrix is converted into a three-dimensional matrix using the camera focus value, and the converted three-dimensional matrix is converted to the pre-learned expression value. Multi-modal-based emotion classification device, characterized in that it is calculated by input to the extraction model.

3. The method of claim 2,
The real image and the thermal image are images of a common object,
The classification prediction unit,
Multi-characterized in that extracting the face position of the common object from the real image to set one or more regions of interest (ROI), and extracting temperature information for each region of interest as the second value Modal-based emotion classification device.

According to claim 1,
The emotion classification prediction using the emotion classification model is,
Multi-modal-based emotion classification apparatus, characterized in that it is performed by predicting a probability corresponding to each emotion classification with respect to one or more emotion classifications.

According to claim 1,
The band filter is a multi-modal-based emotion classification apparatus, characterized in that it is generated based on an average value of each frequency corresponding to one or more emotion classifications.

According to claim 1,
The band filter may include a first graph representing a result of Fast Fourier Transform (FFT) of at least one voice data based on amplitude and frequency, and a first graph representing a result of performing Fast Fourier Transformation (FFT) on the voice data based on time and amplitude. A multi-modal-based emotion classification apparatus, characterized in that it is generated based on a feature of a common area between the second graphs representing a result of Short-Time Fourier Transform (STFT).

3. The method of claim 2,
The update unit,
Updating the emotion classification model by updating the weight corresponding to each value based on the verification accuracy or verification reliability of each of the emotion classification predicted for the first value and the emotion classification predicted for the second value Multi-modal-based emotion classification device, characterized in that.

11. The method of claim 10,
When the verification accuracy is less than a preset accuracy or when the verification reliability is less than a preset reliability, the update unit,
changing the weight of the emotion classification model,
The classification prediction unit,
Multi-modal-based emotion classification apparatus, characterized in that the emotion classification is re-predicted based on the expression information and the temperature information by using the changed emotion classification model.

In the real image, thermal image, and audio data corresponding to the same emotion classification,
predicting one or more emotion classifications of the object included in the real image from the clustering-based emotion classification model by recognizing the facial expression of the object included in the real image from the real image, extracting facial expression information;
extracting temperature information of the object included in the thermal image from the thermal image and predicting one or more emotional classifications of the object included in the thermal image from the emotion classification model;
measuring one or more emotion classifications indicated by the voice data from the voice data using a band-pass filter that filters a frequency band according to the emotion classification of an object;
Based on the one or more emotion classifications indicated by the voice data, the emotion classification predicted based on the real image and the one or more emotion classification predicted based on the thermal image are respectively verified, and the emotion classification model based on the verification result The process of updating the weight of
Multi-modal-based emotion classification method comprising a.