KR20210076528A

KR20210076528A - Method and apparatus for recognizing emotion

Info

Publication number: KR20210076528A
Application number: KR1020190167843A
Authority: KR
Inventors: 나인섭; 이신우
Original assignee: 조선대학교산학협력단
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2021-06-24
Also published as: KR102305613B1

Abstract

Disclosed are a method and device for recognizing an emotion by executing an artificial intelligence (AI) algorithm and/or a machine learning algorithm in a connected 5G environment for the Internet of Things. The emotion recognition method according to an embodiment of the present invention comprises: a step of obtaining an image containing a face; a step of preprocessing the image; a step of outputting an emotion shown in the image by applying a plurality of deep neural network models trained in advance to estimate the emotion shown in the face; and a step of performing weighted majority voting on output values from the deep neural network models to determine the emotion shown by the face.

Description

Emotion recognition method and device {METHOD AND APPARATUS FOR RECOGNIZING EMOTION}

본 발명은 탑재된 인공지능(artificial intelligence, AI) 알고리즘 및/또는 기계학습(machine learning) 알고리즘을 실행하여 감정을 인식할 수 있도록 하는 감정 인식 방법 및 장치에 관한 것이다.The present invention relates to an emotion recognition method and apparatus for recognizing emotion by executing a mounted artificial intelligence (AI) algorithm and/or machine learning algorithm.

컴퓨터는 인간의 일상생활에 중요한 일부분이 되었을 뿐 아니라, 다양한 형태로 편리성을 제공하고 있다. 앞으로도 컴퓨터와 인간과의 밀접성 및 상호작용은 계속해서 증가할 것으로 예상된다. 인간과 컴퓨터 간의 자연스러운 상호 작용을 위해서 컴퓨터는 사용자의 의도를 종합적으로 판단하고 그에 맞는 반응을 해야 한다. 감정은 인간의 마음 상태를 표출하는 가장 중요한 요소로 사용자의 만족을 극대화하기 위해서는 사용자의 감정 인식이 중요하다. Computers have not only become an important part of human daily life, but also provide convenience in various forms. It is expected that the closeness and interaction between computers and humans will continue to increase in the future. For natural interaction between humans and computers, the computer must comprehensively judge the user's intentions and respond accordingly. Emotions are the most important factor that expresses the state of the human mind, and in order to maximize user satisfaction, it is important to recognize the user's emotions.

감정 인식은 이전에는 해결하기 어려운 분야였다. 하지만, 지금은 인공신경망 문제가 해결되었고, 하드웨어 발전으로 인하여 이전에 이론적으로만 다루었던 방법론을 실현할 수 있게 되었다. 또한, ImageNet의 출현으로 딥 러닝에 필요한 양질의 데이터를 쉽게 구할 수 있게 되었고, 자연영상 처리에 대한 연구가 더 활발하게 진행되었다.Emotion recognition was previously a difficult field to address. However, now the artificial neural network problem has been solved, and due to the development of hardware, it is possible to realize the methodology previously dealt with only theoretically. In addition, with the advent of ImageNet, it became possible to easily obtain high-quality data required for deep learning, and research on natural image processing was more actively conducted.

감정인식 분야에서 쓰이는 합성 곱 신경망(Convolutional neural network)은 본래 영상처리를 위해 개발되었다. CNN은 2가지 장점 때문에 이미지처리 분야에서 널리 쓰이고 있다. 즉 CNN은 전체 이미지를 전부 인식할 필요가 없고 부분만 인식하면 된다는 점과 동일한 특징이 들어오면 커널의 가중치를 그대로 유지하는 점이 이미지 프로세싱을 보다 효율적으로 만들어준다.Convolutional neural networks used in emotion recognition were originally developed for image processing. CNNs are widely used in image processing because of two advantages. In other words, the fact that CNN does not need to recognize the entire image and only needs to recognize parts, and the fact that the kernel weights are maintained when the same features are introduced makes image processing more efficient.

그러나 감정의 인식은 얼굴 근육의 다양한 변화를 감지해야 하기 때문에 하나의 CNN으로만 감정을 처리하기에는 한계가 있을 수 있다. 이에, 한가지의 CNN보다는 여러 개의 CNN으로 학습한 모델들의 종합적인 평가를 통해 정확한 감정 인식을 수행해야 할 필요성이 있다.However, since emotion recognition has to detect various changes in facial muscles, there may be a limit to processing emotions with only one CNN. Therefore, there is a need to perform accurate emotion recognition through comprehensive evaluation of models trained with multiple CNNs rather than one CNN.

전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다.The above-mentioned background art is technical information possessed by the inventor for derivation of the present invention or acquired in the process of derivation of the present invention, and cannot necessarily be said to be a known technique disclosed to the general public prior to the filing of the present invention.

본 개시의 일 과제는, 탑재된 인공지능(artificial intelligence, AI) 알고리즘 및/또는 기계학습(machine learning) 알고리즘을 실행하여 가중 다수결(Weighted Majority Voting)을 반영한 감정 예측이 가능하도록 하는데 있다.An object of the present disclosure is to execute a built-in artificial intelligence (AI) algorithm and/or a machine learning algorithm to enable emotion prediction reflecting Weighted Majority Voting.

본 개시의 일 과제는, 다단계 특징 기반 다중 심층학습기술을 이용하여 보다 정확한 감정 분류를 통한 감정 인식이 가능하도록 하는데 있다.An object of the present disclosure is to enable emotion recognition through more accurate emotion classification using multi-step feature-based multi-deep learning technology.

본 개시의 일 과제는, 얼굴을 포함하는 이미지에서 다수의 CNN 기반의 다층 모델을 이용해 표정을 세밀하게 분석하여 감정을 예측할 수 있도록 하는데 있다.An object of the present disclosure is to predict emotions by using a plurality of CNN-based multi-layer models in an image including a face to analyze facial expressions in detail.

본 개시의 일 과제는, 다수의 CNN 기반의 다층 모델으로 감정 예측을 수행하여 단일 CNN으로부터 추출된 표정을 통한 감정 예측 결과보다 정확도를 향상시키고자 하는데 있다.An object of the present disclosure is to perform emotion prediction with multiple CNN-based multi-layer models to improve the accuracy of emotion prediction results through facial expressions extracted from a single CNN.

본 개시의 실시예의 목적은 이상에서 언급한 과제에 한정되지 않으며, 언급되지 않은 본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 발명의 실시 예에 의해 보다 분명하게 이해될 것이다. 또한, 본 발명의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 알 수 있을 것이다.The object of the embodiment of the present disclosure is not limited to the above-mentioned tasks, and other objects and advantages of the present invention not mentioned may be understood by the following description, and will be more clearly understood by the embodiment of the present invention. will be. It will also be appreciated that the objects and advantages of the present invention may be realized by means of the instrumentalities and combinations thereof indicated in the claims.

본 개시의 일 실시 예에 따른 감정 인식 방법은, 탑재된 인공지능(artificial intelligence, AI) 알고리즘 및/또는 기계학습(machine learning) 알고리즘을 실행하여 세밀한 표정 분석을 통해 감정을 인식할 수 있도록 하는 단계를 포함할 수 있다.The emotion recognition method according to an embodiment of the present disclosure includes the steps of executing a built-in artificial intelligence (AI) algorithm and/or machine learning algorithm to recognize emotions through detailed facial expression analysis may include.

구체적으로 본 개시의 일 실시 예에 따른 감정 인식 방법은, 얼굴이 포함된 이미지를 획득하는 단계와, 이미지의 전처리(preprocess)를 수행하는 단계와, 얼굴에 나타난 감정을 추정하도록 미리 훈련된 다수의 심층 신경망 모델을 적용하여 이미지에 나타난 감정을 출력하는 단계와, 다수의 심층 신경망 모델로부터의 출력값들에 가중 다수결(weighted majority voting)을 수행하여, 얼굴이 나타내는 감정을 판단하는 단계를 포함할 수 있다.Specifically, the emotion recognition method according to an embodiment of the present disclosure includes the steps of obtaining an image including a face, performing preprocessing of the image, and a plurality of pre-trained It may include the step of outputting the emotion shown in the image by applying the deep neural network model, and the step of determining the emotion expressed by the face by performing a weighted majority voting on the output values from a plurality of deep neural network models. .

본 개시의 일 실시 예에 따른 감정 인식 방법을 통하여, 얼굴을 포함하는 이미지에서 다단계 특징 추출 기반 다중 심층 신경망 모델을 이용하여 표정을 세밀하게 분석하고, 가중 다수결(Weighted Majority Voting)을 반영하여 최종 감정을 예측함으로써, 보다 정확한 감정 인식이 수행될 수 있도록 할 수 있다. Through the emotion recognition method according to an embodiment of the present disclosure, the facial expression is analyzed in detail using a multi-level feature extraction-based multi-deep neural network model from an image including a face, and the final emotion is reflected by a Weighted Majority Voting By predicting , it is possible to perform more accurate emotion recognition.

이 외에도, 본 발명의 구현하기 위한 다른 방법, 다른 시스템 및 상기 방법을 실행하기 위한 컴퓨터 프로그램이 저장된 컴퓨터로 판독 가능한 기록매체가 더 제공될 수 있다.In addition to this, another method for implementing the present invention, another system, and a computer-readable recording medium storing a computer program for executing the method may be further provided.

전술한 것 외의 다른 측면, 특징, 이점이 이하의 도면, 특허청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.

본 개시의 실시 예에 의하면, 다단계 특징 기반 다중 심층학습기술을 이용하여 표정을 세밀하게 분석함으로써, 감정 인식 성능을 향상시킬 수 있다.According to an embodiment of the present disclosure, it is possible to improve emotion recognition performance by analyzing facial expressions in detail using multi-step feature-based multi-deep learning techniques.

또한, 네트워크 구조가 서로 다른 다수의 심층 신경망 모델을 기반으로 하여 다단계 특징 추출을 수행하고, 다단계 특징 추출 결과에 가중 다수결(Weighted Majority Voting)을 반영하여 최종 감정을 예측함으로써, 보다 정확한 감정 인식이 수행될 수 있도록 할 수 있다.In addition, more accurate emotion recognition is performed by performing multi-step feature extraction based on multiple deep neural network models with different network structures and predicting the final emotion by reflecting Weighted Majority Voting in the multi-step feature extraction result. can make it happen

또한, 다수의 CNN 기반의 다층 모델으로 감정 예측을 수행하여 단일 CNN으로부터 추출된 표정을 통한 감정 예측 결과보다 정확도를 향상시킬 수 있다.In addition, by performing emotion prediction with multiple CNN-based multi-layer models, it is possible to improve the accuracy of emotion prediction results through facial expressions extracted from a single CNN.

또한, 심층학습기술로 얼굴 특징점 검출 기반 감정 인식 기술을 구현함으로써, 각종 얼굴을 기반으로 한 이미지 서비스에 응용 제공할 수 있다.In addition, by implementing an emotion recognition technology based on facial feature point detection as a deep learning technology, it can be applied and provided to image services based on various faces.

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 개시의 일 실시 예에 따른 감정 인식 장치를 개략적으로 나타낸 블록도이다.
도 2는 본 개시의 일 실시 예에 따른 처리부를 개략적으로 나타낸 블록도이다.
도 3은 본 개시의 일 실시 예에 따른 입력 이미지들의 예시도이다.
도 4는 본 개시의 일 실시 예에 따른 다단계 특징 추출 과정을 개략적으로 나타낸 예시도이다.
도 5는 본 개시의 일 실시 예에 따른 다단계 특징 추출 과정을 설명하기 위한 예시도이다.
도 6은 본 개시의 일 실시 예에 따른 감정 인식 방법을 설명하기 위한 흐름도이다.1 is a block diagram schematically illustrating an emotion recognition apparatus according to an embodiment of the present disclosure.
2 is a block diagram schematically illustrating a processing unit according to an embodiment of the present disclosure.
3 is an exemplary diagram of input images according to an embodiment of the present disclosure.
4 is an exemplary diagram schematically illustrating a multi-step feature extraction process according to an embodiment of the present disclosure.
5 is an exemplary diagram for explaining a multi-step feature extraction process according to an embodiment of the present disclosure.
6 is a flowchart illustrating an emotion recognition method according to an embodiment of the present disclosure.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 설명되는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 아래에서 제시되는 실시 예들로 한정되는 것이 아니라, 서로 다른 다양한 형태로 구현될 수 있고, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 아래에 제시되는 실시 예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Advantages and features of the present invention, and methods for achieving them will become apparent with reference to the detailed description in conjunction with the accompanying drawings. However, it should be understood that the present invention is not limited to the embodiments presented below, but may be implemented in a variety of different forms, and includes all transformations, equivalents, and substitutes included in the spirit and scope of the present invention. . The embodiments presented below are provided to complete the disclosure of the present invention, and to fully inform those of ordinary skill in the art to the scope of the present invention. In describing the present invention, if it is determined that a detailed description of a related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, 포함하다 또는 가지다 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, the terms include or have is intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features, number, step , it should be understood that it does not preclude in advance the possibility of the existence or addition of an operation, component, part, or combination thereof. Terms such as first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

이하, 본 발명에 따른 실시 예들을 첨부된 도면을 참조하여 상세히 설명하기로 하며, 첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings, and in the description with reference to the accompanying drawings, the same or corresponding components are given the same reference numerals, and overlapping descriptions thereof are omitted. decide to do

먼저, 본 실시 예는 미리 훈련된 다수의 심층 신경망 모델을 기반으로 얼굴 이미지에서 감정을 인식하는 방법에 관한 것이다. 특히 본 실시 예에서는, 다수의 심층 신경망 모델을 이용하여 단계별 특징점 분석 결과에 가중 다수결(Weighted Majority Voting)을 수행하여 정확한 감정 인식이 가능하도록 할 수 있다. 또한, 본 실시 예에서 심층 신경망 모델은 합성곱 신경망(CNN: Convolution Neural Network)일 수 있다. First, this embodiment relates to a method for recognizing emotions in a face image based on a plurality of pre-trained deep neural network models. In particular, in this embodiment, it is possible to accurately recognize emotions by performing weighted majority voting on the step-by-step feature point analysis result using a plurality of deep neural network models. Also, in the present embodiment, the deep neural network model may be a convolutional neural network (CNN).

CNN의 네트워크 구조는 합성곱 계층(convolutional layer)과 풀링 계층(pooling layer)을 포함할 수 있다. 합성곱 계층은 특정 시스템에 입력이 가해졌을 때 시스템의 반응이 어떻게 되는지 해석하기 위한 것이다. 합성곱 계층은 이미지 처리 분야에서 주로 필터 연산에 사용이 되며, 이미지로부터 특정 특징(feature)들을 추출하기 위한 필터를 구현할 때 사용될 수 있다. 예를 들어, 3*3 또는 그 이상의 윈도우 혹은 마스크를 이미지 전체에 대해서 반복적으로 수행을 하게 되면, 그 마스크의 계수(weight) 값들에 따라 적정한 결과를 얻을 수 있다. 합성곱 계층에서의 입출력 데이터를 특징 맵(feature map)이라고 할 수 있다.The network structure of CNN may include a convolutional layer and a pooling layer. The convolutional layer is for interpreting how the system responds when an input is applied to a specific system. The convolutional layer is mainly used for filter operation in the image processing field, and can be used when implementing a filter for extracting specific features from an image. For example, if 3*3 or more windows or masks are repeatedly performed on the entire image, an appropriate result can be obtained according to the weight values of the mask. The input/output data in the convolutional layer may be referred to as a feature map.

풀링 계층은 세로, 가로 방향의 공간을 줄이는 연산으로 overfitting을 방지하기 위해 사용될 수 있다. 풀링은 최대 풀링(max pooling), 평균 풀링(average pooling) 등이 포함될 수 있다. 최대 풀링은 대상 영역에서 최댓값을 취하는 연산이며, 평균 풀링은 대상 영역의 평균을 계산하는 것이다.The pooling layer can be used to prevent overfitting by reducing the space in the vertical and horizontal directions. Pooling may include max pooling, average pooling, and the like. Maximum pooling is an operation that takes the maximum value in a target region, and average pooling is calculating the average of a target region.

즉, CNN의 네트워크는 컨볼루션 레이어(Convolution Layer)와 맥스풀링 레이어(Maxpooling Layer)를 반복적으로 스택을 쌓는 특징 추출(Feature Extraction) 부분과 이전 계층의 모든 뉴런과 결합된 형태의 층인 완전연결 레이어(Fully Connected Layer)를 구성하고 마지막 출력 층에 소프트맥스(SoftMax)를 적용한 분류 부분으로 나눌 수 있다. In other words, the CNN network consists of a feature extraction part that repeatedly stacks a convolution layer and a maxpooling layer, and a fully connected layer (a layer that is combined with all neurons in the previous layer). Fully Connected Layer) and can be divided into classification parts in which SoftMax is applied to the last output layer.

CNN에서의 컨볼루션은 필터링을 위한 계수가 고정되어 있는 것이 아니라, 학습을 통해 계수 값을 정할 수 있다. 다시 말해, CNN 알고리즘을 통해 처리하고자 하는 과제에 따라 최종 컨볼루션 kernel의 계수가 달라질 수 있다. 동일 과제일지라도 학습에 사용하는 학습 데이터에 따라서도 달라질 수 있고, 설정한 하이퍼 파라미터(hyper-parameter)의 값에 따라서도 달라질 수 있다. 이때, 계수의 값은 기울기(gradient)에 기반한 역전파(back-propagation)에 의해 결정될 수 있다.In convolution in CNN, coefficients for filtering are not fixed, but coefficient values can be determined through learning. In other words, the coefficients of the final convolution kernel may vary depending on the task to be processed through the CNN algorithm. Even for the same task, it may vary depending on the learning data used for learning, and may also vary depending on the value of a set hyper-parameter. In this case, the value of the coefficient may be determined by back-propagation based on a gradient.

또한 CNN은 컨볼루션의 특성을 살린 신경망 연산을 하는 것으로, 국지적 연결성의 특징을 가질 수 있다. 즉 CNN은 수용영역(receptive field)과 유사하게 로컬 정보를 활용할 수 있다. 여기서, 수용영역은 외부 자극이 전체 영향을 끼치는 것이 아니라 특정 영역에만 영향을 주는 것으로, 최종출력의 1개 픽셀에 영향을 미치는 입력이미지의 영역을 의미할 수 있다. 그리고 CNN은 공간적으로 인접한 신호들에 대한 상관관계(correlation)를 비선형 필터를 적용하여 추출해 낼 수 있다. 이런 필터를 여러 개를 적용하면 다양한 local 특징을 추출해 낼 수 있게 된다. 즉, 서브 샘플링(Subsampling) 과정을 거치면서 이미지의 크기를 줄이고 local feature들에 대한 필터 연산을 반복적으로 적용하면 점차 global feature를 얻을 수 있게 된다. 또한, CNN은 동일한 계수를 갖는 filter를 전체 이미지에 반복적으로 적용하여 변수의 수를 획기적으로 줄일 수 있으며, 토폴로지(topology) 변화에 무관한 불변성(invariance)를 얻을 수 있게 된다. 한편, CNN의 네트워크 구조는 상술하는 기재에 한정되지는 않는다.In addition, CNN is a neural network operation that takes advantage of the characteristics of convolution, and may have a characteristic of local connectivity. That is, CNN can utilize local information similar to the receptive field. Here, the receptive area does not affect the whole but only a specific area, and may mean the area of the input image that affects one pixel of the final output. And CNN can extract the correlation of spatially adjacent signals by applying a non-linear filter. By applying several of these filters, various local features can be extracted. That is, if the size of the image is reduced while going through the subsampling process and the filter operation for local features is repeatedly applied, global features can be gradually obtained. In addition, CNN can remarkably reduce the number of variables by repeatedly applying a filter having the same coefficient to the entire image, and obtain invariance independent of topology changes. On the other hand, the network structure of CNN is not limited to the above description.

도 1은 본 개시의 일 실시 예에 따른 감정 인식 장치를 개략적으로 나타낸 블록도이다.1 is a block diagram schematically illustrating an emotion recognition apparatus according to an embodiment of the present disclosure.

도 1을 참조하면, 감정 인식 장치는 통신 인터페이스(100), 카메라(200), 센싱부(300), 프로세서(400), 메모리(500), 표시부(600) 및 처리부(700)를 포함할 수 있다.Referring to FIG. 1 , the emotion recognition apparatus may include a communication interface 100 , a camera 200 , a sensing unit 300 , a processor 400 , a memory 500 , a display unit 600 , and a processing unit 700 . have.

통신 인터페이스(100)는 얼굴 이미지를 포함하는 이미지를 수신하는 통신 수단일 수 있다. 통신 인터페이스(100)는 카메라(200)로부터 이미지를 수신할 수 있고, 그 외 서버(미도시)나 별도 입력 수단을 통해 이미지를 수신할 수 있다. 그리고 통신 인터페이스(100)는 수신한 이미지를 프로세서(400)에 전송할 수 있다.The communication interface 100 may be a communication means for receiving an image including a face image. The communication interface 100 may receive an image from the camera 200 , and may receive an image through another server (not shown) or a separate input means. In addition, the communication interface 100 may transmit the received image to the processor 400 .

통신 인터페이스(100)는 네트워크(미도시)와 연동하여 감정 인식 장치 및/또는 서버 간의 송수신 신호를 패킷 데이터 형태로 제공할 수 있다. 또한 통신 인터페이스(100)는 감정 인식 장치로부터의 소정의 정보 요청 신호를 서버로 전송하거나 서버로부터의 소정의 정보 요청 신호를 감정 인식 장치로 전송할 수 있다. 그리고 통신 인터페이스(100)는 서버가 처리한 응답 신호를 수신하여 감정 인식 장치로 전송하거나, 감정 인식 장치에서 처리된 응답 신호를 수신하여 서버로 전송할 수 있다.The communication interface 100 may provide a transmission/reception signal between the emotion recognition device and/or the server in the form of packet data by interworking with a network (not shown). Also, the communication interface 100 may transmit a predetermined information request signal from the emotion recognition apparatus to the server or may transmit a predetermined information request signal from the server to the emotion recognition apparatus. In addition, the communication interface 100 may receive the response signal processed by the server and transmit it to the emotion recognition apparatus, or may receive the response signal processed by the emotion recognition apparatus and transmit it to the server.

또한 통신 인터페이스(100)는 다른 네트워크 장치와 유무선 연결을 통해 제어 신호 또는 데이터 신호와 같은 신호를 송수신하기 위해 필요한 하드웨어 및 소프트웨어를 포함하는 장치일 수 있다.Also, the communication interface 100 may be a device including hardware and software necessary for transmitting and receiving signals such as control signals or data signals through wired/wireless connection with other network devices.

또한, 통신 인터페이스(100)는 각종 사물 지능 통신(IoT(internet of things), IoE(internet of everything), IoST(internet of small things) 등)을 지원할 수 있으며, M2M(machine to machine) 통신, V2X(vehicle to everything communication) 통신, D2D(device to device) 통신 등을 지원할 수 있다.In addition, the communication interface 100 may support various things intelligent communication (internet of things (IoT), internet of everything (IoE), internet of small things (IoST), etc.), and M2M (machine to machine) communication, V2X (vehicle to everything communication) communication, D2D (device to device) communication, etc. may be supported.

한편, 서버는 각종 인공지능 알고리즘을 적용하는데 필요한 빅데이터 및, 감정 인식 장치를 동작시키는 데이터를 제공하는 데이터베이스 서버일 수 있다. 감정 인식 장치의 프로세싱 능력에 따라, 서버에서 수행되는 기능이 달라질 수 있다.Meanwhile, the server may be a database server that provides big data necessary for applying various artificial intelligence algorithms and data for operating the emotion recognition device. Functions performed by the server may vary according to the processing capability of the emotion recognition device.

또한 서버가 AI 서버인 경우, 서버는 AI 프로세싱을 수행하는 서버와 빅 데이터에 대한 연산을 수행하는 서버를 포함할 수 있다. 그 밖에 서버는 사용자 단말기(미도시)에 설치된 감정 인식 시스템 애플리케이션 또는 감정 인식 시스템 웹 브라우저를 이용하여 사용자가 감정 인식 시스템을 이용할 수 있도록 하는 웹 서버 또는 애플리케이션 서버를 포함할 수 있다.In addition, when the server is an AI server, the server may include a server that performs AI processing and a server that performs an operation on big data. In addition, the server may include a web server or an application server that allows a user to use the emotion recognition system by using an emotion recognition system application or an emotion recognition system web browser installed in a user terminal (not shown).

여기서, 사용자 단말기는 감정 인식 시스템 애플리케이션 또는 감정 인식 시스템 사이트에 접속한 후 인증 과정을 통하여 감정 인식 시스템 작동 또는 제어를 위한 서비스를 제공받을 수 있다. 본 실시 예에서 인증 과정을 마친 사용자 단말기는 감정 인식 시스템을 작동시키고, 제어할 수 있다. 본 실시 예에서 사용자 단말기는 사용자가 조작하는 데스크 탑 컴퓨터, 스마트폰, 노트북, 태블릿 PC, 스마트 TV, 휴대폰, PDA(personal digital assistant), 랩톱, 미디어 플레이어, 마이크로 서버, GPS(global positioning system) 장치, 전자책 단말기, 디지털방송용 단말기, 네비게이션, 키오스크, MP3 플레이어, 디지털 카메라, 가전기기 및 기타 모바일 또는 비모바일 컴퓨팅 장치일 수 있으나, 이에 제한되지 않는다. 또한, 사용자 단말기는 통신 기능 및 데이터 프로세싱 기능을 구비한 시계, 안경, 헤어 밴드 및 반지 등의 웨어러블 단말기 일 수 있다. 사용자 단말기는 상술한 내용에 제한되지 아니하며, 웹 브라우징이 가능한 단말기는 제한 없이 차용될 수 있다. Here, the user terminal may be provided with a service for operating or controlling the emotion recognition system through an authentication process after accessing the emotion recognition system application or the emotion recognition system site. In this embodiment, the user terminal that has completed the authentication process may operate and control the emotion recognition system. In this embodiment, the user terminal is a desktop computer, a smartphone, a notebook computer, a tablet PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a laptop, a media player, a micro server, a global positioning system (GPS) device operated by the user , e-book terminals, digital broadcast terminals, navigation devices, kiosks, MP3 players, digital cameras, home appliances, and other mobile or non-mobile computing devices, but is not limited thereto. In addition, the user terminal may be a wearable terminal such as a watch, glasses, a hair band, and a ring having a communication function and a data processing function. The user terminal is not limited to the above, and a terminal capable of web browsing may be borrowed without limitation.

한편, 서버는 AI 장치들과 네트워크를 통하여 연결되고, 연결된 AI 장치들의 AI 프로세싱을 적어도 일부를 도울 수 있다. AI 장치는 예를 들어, 사용자 단말기뿐만 아니라, 로봇, 자율 주행 차량, XR 장치, 가전 등을 포함할 수 있으며, AI 장치들을 통해 감정 인식 시스템 환경을 구성할 수 있다. 이때, 서버는 AI 장치를 대신하여 머신 러닝 알고리즘에 따라 인공 신경망을 학습시킬 수 있고, 학습 모델을 직접 저장하거나 AI 장치에 전송할 수 있다. Meanwhile, the server may be connected to the AI devices through a network, and may help at least a part of AI processing of the connected AI devices. The AI device may include, for example, not only a user terminal, but also a robot, an autonomous vehicle, an XR device, a home appliance, and the like, and may configure an emotion recognition system environment through the AI devices. In this case, the server may train the artificial neural network according to the machine learning algorithm on behalf of the AI device, and may directly store the learning model or transmit it to the AI device.

여기서 인공 지능(artificial intelligence, AI)은, 인간의 지능으로 할 수 있는 사고, 학습, 자기계발 등을 컴퓨터가 할 수 있도록 하는 방법을 연구하는 컴퓨터 공학 및 정보기술의 한 분야로, 컴퓨터가 인간의 지능적인 행동을 모방할 수 있도록 하는 것을 의미할 수 있다.Here, artificial intelligence (AI) is a field of computer science and information technology that studies how computers can do the thinking, learning, and self-development that can be done with human intelligence. It could mean making it possible to imitate intelligent behavior.

또한, 인공 지능은 그 자체로 존재하는 것이 아니라, 컴퓨터 과학의 다른 분야와 직간접적으로 많은 관련을 맺고 있다. 특히 현대에는 정보기술의 여러 분야에서 인공 지능적 요소를 도입하여, 그 분야의 문제 풀이에 활용하려는 시도가 매우 활발하게 이루어지고 있다.Also, artificial intelligence does not exist by itself, but has many direct and indirect connections with other fields of computer science. In particular, in modern times, attempts are being made to introduce artificial intelligence elements in various fields of information technology and use them to solve problems in that field.

머신 러닝(machine learning)은 인공 지능의 한 분야로, 컴퓨터에 명시적인 프로그램 없이 배울 수 있는 능력을 부여하는 연구 분야를 포함할 수 있다. 구체적으로 머신 러닝은, 경험적 데이터를 기반으로 학습을 하고 예측을 수행하고 스스로의 성능을 향상시키는 시스템과 이를 위한 알고리즘을 연구하고 구축하는 기술이라 할 수 있다. 머신 러닝의 알고리즘들은 엄격하게 정해진 정적인 프로그램 명령들을 수행하는 것이라기 보다, 입력 데이터를 기반으로 예측이나 결정을 이끌어내기 위해 특정한 모델을 구축하는 방식을 취할 수 있다.Machine learning is a branch of artificial intelligence, which can include fields of study that give computers the ability to learn without explicit programming. Specifically, machine learning can be said to be a technology that studies and builds a system and an algorithm for learning based on empirical data, making predictions, and improving its own performance. Algorithms in machine learning can take the approach of building specific models to make predictions or decisions based on input data, rather than executing strictly set static program instructions.

한편, 네트워크는 감정 인식 장치와 서버를 연결하는 역할을 수행할 수 있다. 이러한 네트워크는 예컨대 LANs(local area networks), WANs(wide area networks), MANs(metropolitan area networks), ISDNs(integrated service digital networks) 등의 유선 네트워크나, 무선 LANs, CDMA, 블루투스, 위성 통신 등의 무선 네트워크를 망라할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다. 또한 네트워크는 근거리 통신 및/또는 원거리 통신을 이용하여 정보를 송수신할 수 있다. 여기서 근거리 통신은 블루투스(bluetooth), RFID(radio frequency identification), 적외선 통신(IrDA, infrared data association), UWB(ultra-wideband), ZigBee, Wi-Fi (wireless fidelity) 기술을 포함할 수 있고, 원거리 통신은 CDMA(code division multiple access), FDMA(frequency division multiple access), TDMA(time division multiple access), OFDMA(orthogonal frequency division multiple access), SC-FDMA(single carrier frequency division multiple access) 기술을 포함할 수 있다.Meanwhile, the network may serve to connect the emotion recognition device and the server. Such networks include, for example, wired networks such as local area networks (LANs), wide area networks (WANs), metropolitan area networks (MANs), and integrated service digital networks (ISDNs), and wireless networks such as wireless LANs, CDMA, Bluetooth, and satellite communications. It may encompass a network, but the scope of the present invention is not limited thereto. In addition, the network may transmit and receive information using short-range communication and/or long-distance communication. Here, the short-distance communication may include Bluetooth, radio frequency identification (RFID), infrared data association (IrDA), ultra-wideband (UWB), ZigBee, and wireless fidelity (Wi-Fi) technologies. Communication may include code division multiple access (CDMA), frequency division multiple access (FDMA), time division multiple access (TDMA), orthogonal frequency division multiple access (OFDMA), single carrier frequency division multiple access (SC-FDMA) technology. can

또한 네트워크는 허브, 브리지, 라우터, 스위치 및 게이트웨이와 같은 네트워크 요소들의 연결을 포함할 수 있다. 네트워크는 인터넷과 같은 공용 네트워크 및 안전한 기업 사설 네트워크와 같은 사설 네트워크를 비롯한 하나 이상의 연결된 네트워크들, 예컨대 다중 네트워크 환경을 포함할 수 있다. 네트워크에의 액세스는 하나 이상의 유선 또는 무선 액세스 네트워크들을 통해 제공될 수 있다. 더 나아가 네트워크는 사물 등 분산된 구성 요소들 간에 정보를 주고 받아 처리하는 IoT(Internet of Things, 사물인터넷) 망 및/또는 5G 통신을 지원할 수 있다.A network may also include connections of network elements such as hubs, bridges, routers, switches, and gateways. A network may include one or more connected networks, such as a multi-network environment, including a public network such as the Internet and a private network such as a secure enterprise private network. Access to the network may be provided through one or more wired or wireless access networks. Furthermore, the network may support an Internet of Things (IoT) network and/or 5G communication that exchanges and processes information between distributed components such as objects.

카메라(200)는 주변 환경을 촬영하는 것으로, 특히 본 실시 예에서는 얼굴이 포함되는 이미지를 촬영할 수 있다. 카메라(200)는 이미지 촬영을 위한 수단으로, 카메라(200)에서 촬영된 이미지는 프로세서(400)에 전송될 수 있다. 카메라(200)는 적어도 하나의 광학렌즈와, 광학렌즈를 통과한 광에 의해 상이 맺히는 다수개의 광다이오드(photodiode, 예를 들어, pixel)를 포함하여 구성된 이미지센서(예를 들어, CMOS image sensor)와, 광다이오드들로부터 출력된 신호를 바탕으로 이미지를 구성하는 디지털 신호 처리기(DSP: digital signal processor)를 포함할 수 있다. 디지털 신호 처리기는 정지이미지는 물론이고, 정지이미지로 구성된 프레임들로 이루어진 동이미지를 생성할 수 있다. 한편, 카메라(200)가 촬영하여 획득된 이미지는 메모리(500)에 저장될 수 있다.The camera 200 captures the surrounding environment, and in particular, in the present embodiment, an image including a face may be photographed. The camera 200 is a means for taking an image, and the image captured by the camera 200 may be transmitted to the processor 400 . The camera 200 includes at least one optical lens and an image sensor (eg, CMOS image sensor) configured to include a plurality of photodiodes (eg, pixels) on which an image is formed by light passing through the optical lens. and a digital signal processor (DSP) configured to construct an image based on signals output from the photodiodes. The digital signal processor may generate a still image as well as a moving image composed of frames composed of still images. Meanwhile, an image obtained by photographing the camera 200 may be stored in the memory 500 .

센싱부(300)는 감정 인식 장치의 주변 상황을 센싱하는 각종 센서를 포함할 수 있다. 본 실시 예에서는 카메라(200) 이외의 정보에 대해서는 센싱부(300)를 통해 획득할 수 있다.The sensing unit 300 may include various sensors for sensing the surrounding situation of the emotion recognition device. In this embodiment, information other than the camera 200 may be acquired through the sensing unit 300 .

본 실시 예에서 센싱부(300)는 예를 들어, 라이다 센서(Lidar sensor), 무게 감지 센서, 조도 센서(illumination sensor), 터치 센서(touch sensor), 가속도 센서(acceleration sensor), 자기 센서(magnetic sensor), 중력 센서(G-sensor), 자이로스코프 센서(gyroscope sensor), 모션 센서(motion sensor), RGB 센서, 적외선 센서(IR 센서: infrared sensor), 지문인식 센서(finger scan sensor), 초음파 센서(ultrasonic sensor), 광 센서(optical sensor), 마이크로폰(microphone), 배터리 게이지(battery gauge), 환경 센서(예를 들어, 기압계, 습도계, 온도계, 방사능 감지 센서, 열 감지 센서, 가스 감지 센서 등), 화학 센서(예를 들어, 전자 코, 헬스케어 센서, 생체 인식 센서 등) 중 적어도 하나를 포함할 수 있다. 한편, 본 실시 예에서 감정 인식 장치는 이러한 센서들 중 적어도 둘 이상의 센서에서 센싱되는 정보들을 조합하여 활용할 수 있다.In this embodiment, the sensing unit 300 is, for example, a lidar sensor, a weight sensor, an illumination sensor, a touch sensor, an acceleration sensor, a magnetic sensor ( magnetic sensor), gravity sensor (G-sensor), gyroscope sensor, motion sensor, RGB sensor, infrared sensor (IR sensor: infrared sensor), fingerprint recognition sensor (finger scan sensor), ultrasound sensor (ultrasonic sensor), optical sensor (optical sensor), microphone (microphone), battery gauge (battery gauge), environmental sensor (eg barometer, hygrometer, thermometer, radiation detection sensor, thermal sensor, gas detection sensor, etc.) ), and a chemical sensor (eg, an electronic nose, a healthcare sensor, a biometric sensor, etc.). Meanwhile, in the present embodiment, the emotion recognition apparatus may combine and utilize information sensed by at least two or more of these sensors.

프로세서(400)는 카메라(200) 및/또는 통신 인터페이스(100)를 통해 입력 받은 이미지를 처리부(700)에 전송할 수 잇다. 그리고 프로세서(400)는 미리 훈련된 다수의 심층 신경망 모델을 적용하여 이미지의 다층 특징점을 각각 추출하며, 이미지의 다층 특징점에 기반한 다층 특징 맵(feature map)들에 가중 다수결(weighted majority voting)을 수행하여, 얼굴이 나타내는 감정을 예측할 수 있다. 그리고 프로세서(400)는 감정을 예측한 결과를 표시부(600)를 통해 출력할 수 있다.The processor 400 may transmit an image input through the camera 200 and/or the communication interface 100 to the processing unit 700 . Then, the processor 400 extracts multi-layer feature points of the image by applying a plurality of pre-trained deep neural network models, and performs weighted majority voting on multi-layer feature maps based on the multi-layer feature points of the image. Thus, the emotion expressed by the face can be predicted. In addition, the processor 400 may output a result of predicting the emotion through the display unit 600 .

프로세서(400)는 일종의 중앙처리장치로서 메모리(500)에 탑재된 제어 소프트웨어를 구동하여 감정 인식 장치 전체의 동작을 제어할 수 있다. 프로세서(400)는 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다. 여기서, '프로세서(processor)'는, 예를 들어 프로그램 내에 포함된 코드 또는 명령어로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다. 이와 같이 하드웨어에 내장된 데이터 처리 장치의 일 예로써, 마이크로프로세서(microprocessor), 중앙처리장치(central processing unit: CPU), 프로세서 코어(processor core), 멀티프로세서(multiprocessor), ASIC(application-specific integrated circuit), FPGA(field programmable gate array) 등의 처리 장치를 망라할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.The processor 400 is a kind of central processing unit and may control the entire operation of the emotion recognition device by driving control software mounted on the memory 500 . The processor 400 may include any type of device capable of processing data. Here, the 'processor' may refer to a data processing device embedded in hardware having a physically structured circuit to perform a function expressed by, for example, a code or an instruction included in a program. As an example of the data processing device embedded in the hardware as described above, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific integrated (ASIC) circuit) and a processing device such as a field programmable gate array (FPGA), but the scope of the present invention is not limited thereto.

본 실시 예에서 프로세서(400)는 감정 인식 장치가 최적의 감정 인식 결과를 출력하도록, 감정 인식에 대하여 딥러닝(Deep Learning) 등 머신 러닝(machine learning)을 수행할 수 있고, 메모리(500)는, 머신 러닝에 사용되는 데이터, 결과 데이터 등을 저장할 수 있다. In this embodiment, the processor 400 may perform machine learning, such as deep learning, on emotion recognition so that the emotion recognition device outputs an optimal emotion recognition result, and the memory 500 , data used in machine learning, and result data.

머신 러닝의 일종인 딥러닝(deep learning) 기술은 데이터를 기반으로 다단계로 깊은 수준까지 내려가 학습할 수 있다. 딥러닝은 단계를 높여갈수록 복수의 데이터들로부터 핵심적인 데이터를 추출하는 머신 러닝 알고리즘의 집합을 나타낼 수 있다. Deep learning, a type of machine learning, can learn from data in multiple stages down to a deep level. Deep learning can represent a set of machine learning algorithms that extract core data from a plurality of data as the level increases.

딥러닝 구조는 인공신경망(ANN)을 포함할 수 있으며, 예를 들어 딥러닝 구조는 CNN(convolutional neural network), RNN(recurrent neural network), DBN(deep belief network) 등 심층신경망(DNN)으로 구성될 수 있다. 본 실시 예에 따른 딥러닝 구조는 공지된 다양한 구조를 이용할 수 있다. 예를 들어, 본 발명에 따른 딥러닝 구조는 CNN, RNN, DBN 등을 포함할 수 있다. RNN은, 자연어 처리 등에 많이 이용되고 있으며, 시간의 흐름에 따라 변하는 시계열 데이터(time-series data) 처리에 효과적인 구조로 매 순간마다 레이어를 쌓아 올려 인공신경망 구조를 구성할 수 있다. DBN은 딥러닝 기법인 RBM(restricted boltzman machine)을 다층으로 쌓아 구성되는 딥러닝 구조를 포함할 수 있다. RBM 학습을 반복하여, 일정 수의 레이어가 되면 해당 개수의 레이어를 가지는 DBN을 구성할 수 있다. CNN은 사람이 물체를 인식할 때 물체의 기본적인 특징들을 추출되는 다음 뇌 속에서 복잡한 계산을 거쳐 그 결과를 기반으로 물체를 인식한다는 가정을 기반으로 만들어진 사람의 뇌 기능을 모사한 모델을 포함할 수 있다.The deep learning structure may include an artificial neural network (ANN). For example, the deep learning structure is composed of a deep neural network (DNN) such as a convolutional neural network (CNN), a recurrent neural network (RNN), and a deep belief network (DBN). can be The deep learning structure according to the present embodiment may use various well-known structures. For example, the deep learning structure according to the present invention may include CNN, RNN, DBN, and the like. RNN is widely used in natural language processing, etc., and is an effective structure for processing time-series data that changes with the passage of time. It is possible to construct an artificial neural network structure by stacking layers at every moment. DBN may include a deep learning structure composed of multiple layers of restricted boltzman machine (RBM), a deep learning technique. By repeating RBM learning, when a certain number of layers is reached, a DBN having the corresponding number of layers can be configured. CNNs can include models that simulate human brain functions based on the assumption that when a person recognizes an object, basic features of the object are extracted, then undergoes complex calculations in the brain and recognizes an object based on the result. have.

한편, 인공신경망의 학습은 주어진 입력에 대하여 원하는 출력이 나오도록 노드간 연결선의 웨이트(weight)를 조정(필요한 경우 바이어스(bias) 값도 조정)함으로써 이루어질 수 있다. 또한, 인공신경망은 학습에 의해 웨이트(weight) 값을 지속적으로 업데이트시킬 수 있다. 또한, 인공신경망의 학습에는 역전파(back propagation) 등의 방법이 사용될 수 있다.On the other hand, learning of the artificial neural network can be accomplished by adjusting the weight of the connection line between nodes (and adjusting the bias value if necessary) so that a desired output is produced with respect to a given input. In addition, the artificial neural network may continuously update a weight value by learning. In addition, a method such as back propagation may be used for learning the artificial neural network.

메모리(500)는 감정 인식 장치의 동작에 필요한 각종 정보들을 저장하고, 감정 인식 장치를 동작시킬 수 있는 제어 소프트웨어를 저장할 수 있는 것으로, 휘발성 또는 비휘발성 기록 매체를 포함할 수 있다.The memory 500 may store various types of information necessary for the operation of the emotion recognition apparatus and may store control software capable of operating the emotion recognition apparatus, and may include a volatile or nonvolatile recording medium.

메모리(500)는 하나 이상의 프로세서와 연결되는 것으로, 프로세서에 의해 실행될 때, 프로세서로 하여금 감정 인식 장치를 제어하도록 야기하는(cause) 코드들을 저장할 수 있다.The memory 500 is connected to one or more processors, and when executed by the processor, may store codes that cause the processor to control the emotion recognition apparatus.

여기서, 메모리(500)는 자기 저장 매체(magnetic storage media) 또는 플래시 저장 매체(flash storage media)를 포함할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다. 이러한 메모리(500)는 내장 메모리 및/또는 외장 메모리를 포함할 수 있으며, DRAM, SRAM, 또는 SDRAM 등과 같은 휘발성 메모리, OTPROM(one time programmable ROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, NAND 플래시 메모리, 또는 NOR 플래시 메모리 등과 같은 비휘발성 메모리, SSD. CF(compact flash) 카드, SD 카드, Micro-SD 카드, Mini-SD 카드, Xd 카드, 또는 메모리 스틱(memory stick) 등과 같은 플래시 드라이브, 또는 HDD와 같은 저장 장치를 포함할 수 있다.Here, the memory 500 may include magnetic storage media or flash storage media, but the scope of the present invention is not limited thereto. The memory 500 may include internal memory and/or external memory, and may include volatile memory such as DRAM, SRAM, or SDRAM, one time programmable ROM (OTPROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, Non-volatile memory, such as NAND flash memory, or NOR flash memory, SSD. It may include a flash drive such as a compact flash (CF) card, an SD card, a Micro-SD card, a Mini-SD card, an Xd card, or a memory stick, or a storage device such as an HDD.

표시부(600)는 감정 인식 결과의 출력을 위한 출력 수단으로, 예를 들어 디스플레이 등을 포함할 수 있다. 디스플레이는 프로세서(400)의 제어 하에 감정 인식 장치의 감정 인식 과정 또는 결과를 디스플레이 할 수 있다. 실시 예에 따라서, 디스플레이는 터치패드와 상호 레이어 구조를 이루어 터치스크린으로 구성될 수 있다. 이 경우에, 디스플레이는 사용자의 터치에 의한 정보의 입력이 가능한 조작부로도 사용될 수 있다. 이를 위해 디스플레이는 터치 인식 디스플레이 제어기 또는 이외의 다양한 입출력 제어기로 구성될 수 있다. 이와 같은 디스플레이는 예를 들어 터치 인식이 가능한 OLED(organic light emitting display) 또는 LCD(liquid crystal display) 또는 LED(light emitting display)와 같은 소정의 디스플레이 부재일 수 있다.The display unit 600 is an output means for outputting the emotion recognition result, and may include, for example, a display. The display may display an emotion recognition process or result of the emotion recognition apparatus under the control of the processor 400 . According to an embodiment, the display may be configured as a touch screen by forming a layer structure with the touch pad. In this case, the display may also be used as a manipulation unit capable of inputting information by a user's touch. To this end, the display may be configured with a touch-sensitive display controller or other various input/output controllers. Such a display may be, for example, a predetermined display member such as an organic light emitting display (OLED) or liquid crystal display (LCD) or light emitting display (LED) capable of touch recognition.

또한 본 실시 예에서는, 마이크, 디스플레이 등의 입력 수단을 구비하여, 마이크로 입력되는 발화자의 발화 특성을 분석하여 감정을 인식할 수도 있다. In addition, in the present embodiment, an emotion may be recognized by analyzing the speech characteristics of the speaker input through the microphone by providing an input means such as a microphone or a display.

처리부(700)는 프로세서(400)와 연계하여 학습을 수행하거나, 프로세서(400)로부터 학습 결과를 수신할 수 있다. 본 실시 예에서 처리부(700)는 도 1에 도시된 바와 같이 프로세서(400) 외부에 구비될 수도 있고, 프로세서(400) 내부에 구비되어 프로세서(400)처럼 동작할 수도 있다. 또한 처리부(700)는 서버 내에서 구현될 수도 있다. 즉 처리부(700)는 메모리(500)에 저장된 코드들에 기반하여, 최적의 감정 인식이 수행되도록 처리할 수 있다. 이하 처리부(700)의 상세한 내용은 도 2를 참조하여 설명하기로 한다.The processing unit 700 may perform learning in connection with the processor 400 or may receive a learning result from the processor 400 . In this embodiment, the processing unit 700 may be provided outside the processor 400 as shown in FIG. 1 , or may be provided inside the processor 400 to operate like the processor 400 . Also, the processing unit 700 may be implemented in a server. That is, the processing unit 700 may process to perform optimal emotion recognition based on the codes stored in the memory 500 . Hereinafter, details of the processing unit 700 will be described with reference to FIG. 2 .

도 2는 본 개시의 일 실시 예에 따른 처리부를 개략적으로 나타낸 블록도이다. 도 1에 대한 설명과 중복되는 부분은 그 설명을 생략하기로 한다.2 is a block diagram schematically illustrating a processing unit according to an embodiment of the present disclosure. Parts overlapping with the description of FIG. 1 will be omitted.

도 2를 참조하면, 처리부(700)는 얼굴 인식부(710), 전처리부(720), 특징점 검출부(730), 다층 분석부(740) 및 감정 판단부(750)를 포함할 수 있다.Referring to FIG. 2 , the processing unit 700 may include a face recognition unit 710 , a preprocessing unit 720 , a feature point detection unit 730 , a multilayer analysis unit 740 , and an emotion determination unit 750 .

처리부(700)는 얼굴 이미지를 포함하는 이미지를 수신하여 감정 인식을 수행할 수 있다. 이때, 이미지는 카메라(200)를 통해 획득될 수 있으나, 사용자의 입력 인터페이스나 별도 서버로부터 수신될 수도 있다.The processing unit 700 may receive an image including a face image and perform emotion recognition. In this case, the image may be acquired through the camera 200, but may also be received from a user input interface or a separate server.

도 3은 본 개시의 일 실시 예에 따른 입력 이미지들의 예시도이다.3 is an exemplary diagram of input images according to an embodiment of the present disclosure.

도 3에 도시된 바와 같이, 본 실시 예에서는 다양한 표정을 가진 얼굴들의 이미지와 얼굴이 포함되지 않은 다양한 이미지들도 입력될 수 있다. 이에 얼굴 인식부(710)는 이미지에서 얼굴 영역을 인식할 수 있다.As shown in FIG. 3 , in this embodiment, images of faces having various expressions and various images not including faces may be input. Accordingly, the face recognition unit 710 may recognize the face region in the image.

보다 구체적으로, 얼굴 인식부(710)는 이미지에 포함된 얼굴 영역의 개수를 조정할 수 있다. 즉, 얼굴 인식부(710)는 이미지에 포함된 얼굴 후보영역을 추출하고, 얼굴 후보영역의 정확도를 산출할 수 있다. 그리고 얼굴 인식부(710)는 얼굴 후보영역의 정확도에 기초해 최종 얼굴 영역을 결정하여, 이미지에 포함된 얼굴 영역의 개수를 조정할 수 있다.More specifically, the face recognition unit 710 may adjust the number of face regions included in the image. That is, the face recognition unit 710 may extract a face candidate region included in the image and calculate the accuracy of the face candidate region. In addition, the face recognition unit 710 may determine the final face region based on the accuracy of the face candidate region and adjust the number of face regions included in the image.

즉 얼굴 인식부(710)는 카메라(200)를 통해 획득된 이미지 신호에 대응되는 이미지에서 얼굴 영역을 인식할 수 있다. 이때, 얼굴 인식부(710)는 우선 얼굴이라고 판단되는 후보영역을 추출하고, 얼굴 인식 알고리즘에 기초하여 추출된 후보영역의 정확도를 수치화할 수 있다.That is, the face recognition unit 710 may recognize a face region in an image corresponding to an image signal obtained through the camera 200 . In this case, the face recognition unit 710 may first extract a candidate region determined to be a face, and quantify the accuracy of the extracted candidate region based on a face recognition algorithm.

그리고, 예를 들어, 얼굴 인식부(710)는 조명 변화에 덜 민감하며, 다양한 피부색을 검출하는 HR비로 피부색을 검출하고, 라벨링으로 영역을 분할한 후 임의의 크기의 영역을 후보 얼굴 영역으로 결정할 수 있다. 또한 얼굴 인식부(710)는 후보 얼굴 영역에서 얼굴 특징등의 기하학적 위치 정보와 색 정보를 이용하여 눈과 입을 검출하여 최종 얼굴 영역을 검출할 수 있다. 다만, 얼굴 인식 방법은 상술한 내용에 한정되는 것이 아니며, 다양한 얼굴 인식 알고리즘이 적용될 수 있다.And, for example, the face recognition unit 710 is less sensitive to changes in lighting, detects a skin color with an HR ratio that detects various skin colors, divides an area by labeling, and determines an arbitrary size area as a candidate face area. can In addition, the face recognition unit 710 may detect the final face region by detecting the eyes and mouth using geometric position information such as facial features and color information in the candidate face region. However, the face recognition method is not limited to the above description, and various face recognition algorithms may be applied.

한편, 얼굴 인식부(710)는 연속적으로 촬영된 복수의 유사한 이미지들 중 적어도 하나의 이미지의 2차원 특징점들을 이용하여 3차원 얼굴 프로파일을 획득할 수도 있다.Meanwhile, the face recognition unit 710 may acquire a 3D face profile by using 2D feature points of at least one image among a plurality of consecutively photographed similar images.

전처리부(720)는 인식된 얼굴 영역에 대해 전처리(preprocess)를 수행할 수 있다. 즉, 전처리부(720)는 최종 결정된 얼굴 영역의 크기, 위치, 색상, 밝기 및 방향 중 적어도 하나 이상을 조정하여, 얼굴 영역의 전처리를 수행할 수 있다.The preprocessor 720 may perform preprocessing on the recognized face region. That is, the preprocessor 720 may perform preprocessing of the face region by adjusting at least one of the finally determined size, position, color, brightness, and direction of the face region.

전처리부(720)는 이미지의 특징점 검출이 용이하게 이루어지도록 전처리 작업을 수행할 수 있다. 이때, 전처리부(720)는 특징점 검출을 위한 입력 데이터의 크기에 대응되도록 이미지의 크기를 자동으로 조절할 수 있다. 예를 들어, 전처리부(720)는 이미지에서 관심 영역(얼굴 영역)에 해당하는 부분을 추출하여 정사각형 형태로 만들 수 있다. 그리고 전처리부(720)는 이미지를 특징추출 및 연산처리에 적합한 크기로 변경할 수 있다. 또한 전처리부(720)는 보간법을 사용하여 이미지의 크기를 적절한 범위로 줄이거나 또는 증가시킬 수 있으며, 경우에 따라 출력은 입력과 동일할 수 있다. 또한 전처리부(720)는 이미지에서 일정한 값을 R, G, B 픽셀에서 뺀 뒤 나누어 전체 픽셀의 밝기값이 기 설정된 범위를 갖도록 할 수 있다. 전처리부(720)는 상술한 전처리 방법 외에도, 이미지에서 기 설정된 카테고리들에 대한 값을 기 설정된 범위로 변경시키는 전처리를 수행할 수 있다.The pre-processing unit 720 may perform a pre-processing operation so that the feature point of the image is easily detected. In this case, the preprocessor 720 may automatically adjust the size of the image to correspond to the size of the input data for detecting the feature point. For example, the preprocessor 720 may extract a portion corresponding to the region of interest (face region) from the image to form a square shape. In addition, the preprocessor 720 may change the image to a size suitable for feature extraction and calculation processing. In addition, the preprocessor 720 may reduce or increase the size of the image to an appropriate range by using an interpolation method, and in some cases, the output may be the same as the input. Also, the preprocessor 720 may subtract a predetermined value from the R, G, and B pixels in the image and divide it so that the brightness values of all pixels have a preset range. In addition to the above-described pre-processing method, the pre-processing unit 720 may perform pre-processing of changing values of preset categories in an image to a preset range.

특히, 본 실시 예에서, 전처리부(720)는 이미지를 회색조(grayscale) 이미지로 변환할 수 있다. 그리고 전처리부(720)는 회색조 이미지를 정수 도메인(예를 들어, 0~255)을 취하는 임의의 차원의 행렬로 변환할 수 있다.In particular, in this embodiment, the preprocessor 720 may convert the image into a grayscale image. In addition, the preprocessor 720 may convert the grayscale image into an arbitrary-dimensional matrix taking an integer domain (eg, 0 to 255).

즉 전처리부(720)는 이미지의 감정을 보다 정확하게 인식하기 위해 특징점 검출을 위한 전처리를 수행할 수 있다. 이에 전처리부(720)는 이미지에서 '밝은 값'과 '어두운 값'을 찾기 위해, 이미지를 회색조 이미지로 변환할 수 있다. 그리고 전처리부(720)는 회색조로 변환한 이미지를 행렬로 변환하여 특징점 검출을 위한 알고리즘을 수행할 수 있도록 할 수 있다. That is, the pre-processing unit 720 may perform pre-processing for detecting feature points in order to more accurately recognize the emotion of the image. Accordingly, the preprocessor 720 may convert the image into a grayscale image in order to find a 'bright value' and a 'dark value' in the image. In addition, the preprocessor 720 may convert the grayscale-converted image into a matrix to perform an algorithm for feature point detection.

도 4는 본 개시의 일 실시 예에 따른 다단계 특징 추출 과정을 개략적으로 나타낸 예시도이고, 도 5는 본 개시의 일 실시 예에 따른 다단계 특징 추출 과정을 설명하기 위한 예시도이다.4 is an exemplary diagram schematically illustrating a multi-step feature extraction process according to an embodiment of the present disclosure, and FIG. 5 is an exemplary diagram for explaining a multi-step feature extraction process according to an embodiment of the present disclosure.

도 4 및 도 5를 참조하면, 특징점 검출부(730)는 얼굴에 나타난 감정을 추정하도록 미리 훈련된 다수의 심층 신경망 모델을 적용하여 이미지에 나타난 감정을 출력할 수 있다. 즉, 특징점 검출부(730)는 이미지에서 얼굴 영역에 대한 특징점을 검출하고 동질 영역들에 대한 특징 영역을 검출할 수 있다. 그리고 특징점 검출부(730)는 특징 영역에 기초하여 얼굴에 나타난 감정을 예측할 수 있다. 4 and 5 , the feature point detector 730 may output the emotion shown in the image by applying a plurality of deep neural network models trained in advance to estimate the emotion shown on the face. That is, the feature point detector 730 may detect a feature point for a face region in the image and detect a feature region for homogeneous regions. In addition, the feature point detector 730 may predict the emotion displayed on the face based on the feature region.

얼굴의 특징점(facial landmark)은 얼굴의 특징이 되는 부분에 표시된 점이며, 눈, 코, 입, 귀 등에 표시될 수 있다. 즉 특징점 검출부(730)는 얼굴의 특징점에 기초하여 이미지로부터 눈썹, 눈, 코, 입, 턱, 귀 등의 위치를 검출 및 추적할 수 있다. 그리고 특징점 검출부(730)는 윤곽선, 눈동자, 눈 모양, 코 모양, 입 모양, 이마 모양, 광대뼈 모양 또는 턱 모양 등을 검출할 수 있다. 나아가, 본 실시 예에서는 얼굴의 가로:세로 비율, 눈 크기, 입 크기, 이마:눈썹:코끝:턱끝 간 위치 비율 등을 포함하는 얼굴 비율 데이터를 산출함으로써 얼굴 표정을 검출할 수 있고, 이에 본 실시 예에서는 표정을 통해 감정 인식이 가능하도록 하는 것이다.A facial landmark is a point displayed on a part that is a feature of the face, and may be displayed on the eyes, nose, mouth, ears, and the like. That is, the feature point detector 730 may detect and track the positions of the eyebrows, eyes, nose, mouth, chin, ears, etc. from the image based on the feature points of the face. In addition, the feature point detection unit 730 may detect an outline, a pupil, an eye shape, a nose shape, a mouth shape, a forehead shape, a cheekbone shape, a chin shape, and the like. Furthermore, in this embodiment, the facial expression can be detected by calculating face ratio data including the horizontal:vertical ratio, eye size, mouth size, forehead:eyebrow:nose tip:chin position ratio, etc. In this example, it is possible to recognize emotions through facial expressions.

이때, 다수의 미리 훈련된 심층 신경망 모델은 제 1 타입의 신경망 모델 그룹 및 제 2 타입의 신경망 모델 그룹을 포함할 수 있다. 그리고 제 1 타입의 신경망 모델 그룹은, 입력된 이미지를 전체적으로 분석하여 감정을 출력하는 제 1 타입의 신경망 모델, 입력된 이미지의 하위 레벨 특징을 추출하여 감정을 출력하는 제 1 타입의 하위-레벨(low-level) 신경망 모델, 입력된 이미지의 중간 레벨 특징을 추출하여 감정을 출력하는 제 1 타입의 중간-레벨(mid-level) 신경망 모델 및 입력된 이미지의 상위 레벨 특징을 추출하여 감정을 출력하는 제 1 타입의 상위-레벨(high-level) 신경망 모델을 포함할 수 있다.In this case, the plurality of pretrained deep neural network models may include a first type of neural network model group and a second type of neural network model group. And the first type of neural network model group includes a first type of neural network model that outputs emotions by analyzing the input image as a whole, and a first type of low-level ( A low-level neural network model, a first-type mid-level neural network model that extracts mid-level features of an input image to output emotions, and a first-type mid-level neural network model that extracts high-level features of an input image and outputs emotions A first type of high-level neural network model may be included.

또한, 제 2 타입의 신경망 모델 그룹은, 입력된 이미지를 전체적으로 분석하여 감정을 출력하는 제 2 타입의 신경망 모델, 입력된 이미지의 하위 레벨 특징을 추출하여 감정을 출력하는 제 2 타입의 하위-레벨 신경망 모델, 입력된 이미지의 중간 레벨 특징을 추출하여 감정을 출력하는 제 2 타입의 중간-레벨 신경망 모델 및 입력된 이미지의 상위 레벨 특징을 추출하여 감정을 출력하는 제 2 타입의 상위-레벨 신경망 모델을 포함할 수 있다. 여기서, 제 1 타입의 신경망 모델 및 제 2 타입의 신경망 모델은 본래의 신경망 모델로, 입력된 이미지의 전체 영역에 대해 분석할 수 있다. 그리고 그 외의 레벨 단위의 신경망 모델들은 해당 레벨에 대응하는 크기의 영역에 대해서 분석할 수 있다.In addition, the second type of neural network model group includes a second type of neural network model that outputs emotions by analyzing the input image as a whole, and a second type of low-level that outputs emotions by extracting low-level features of the input image. A neural network model, a second-type middle-level neural network model that extracts mid-level features of an input image to output emotions, and a second-type high-level neural network model that extracts high-level features of an input image and outputs emotions may include. Here, the first type of neural network model and the second type of neural network model are original neural network models, and the entire region of the input image may be analyzed. In addition, neural network models in units of other levels may be analyzed for an area having a size corresponding to the corresponding level.

즉, 제 1 타입의 하위-레벨 신경망 모델은 입력된 이미지의 얼굴에서 제 1 사이즈의 윈도우 내의 특징 영역을 기초로 얼굴이 나타내는 감정을 판단하도록 구성되고, 제 1 타입의 중간-레벨 신경망 모델은 입력된 이미지의 얼굴에서 제 2 사이즈의 윈도우 내의 특징 영역을 기초로 얼굴이 나타내는 감정을 판단하도록 구성되며, 제 1 타입의 상위-레벨 신경망 모델은 입력된 이미지의 얼굴에서 제 3 사이즈의 윈도우 내의 특징 영역을 기초로 얼굴이 나타내는 감정을 판단하도록 구성될 수 있다.That is, the low-level neural network model of the first type is configured to determine the emotion represented by the face based on the feature region within the window of the first size in the face of the input image, and the middle-level neural network model of the first type is the input image. and determine the emotion represented by the face on the basis of the feature region within the window of the second size in the face of the image, wherein the high-level neural network model of the first type comprises the feature region within the window of the third size in the face of the input image. It may be configured to determine the emotion expressed by the face based on the .

또한, 제 2 타입의 하위-레벨 신경망 모델은 입력된 이미지의 얼굴에서 제 1 사이즈의 윈도우 내의 특징 영역을 기초로 얼굴이 나타내는 감정을 판단하도록 구성되며, 제 2 타입의 중간-레벨 신경망 모델은 입력된 이미지의 얼굴에서 제 2 사이즈의 윈도우 내의 특징 영역을 기초로 얼굴이 나타내는 감정을 판단하도록 구성되고, 제 2 타입의 상위-레벨 신경망 모델은 입력된 이미지의 얼굴에서 제 3 사이즈의 윈도우 내의 특징 영역을 기초로 얼굴이 나타내는 감정을 판단하도록 구성될 수 있다. 여기서, 제 1 사이즈의 윈도우는 제 2 사이즈의 윈도우보다 작고, 제 3 사이즈의 윈도우는 제 2 사이즈의 윈도우보다 클 수 있다.In addition, the low-level neural network model of the second type is configured to determine the emotion represented by the face based on the feature region within the window of the first size in the face of the input image, and the middle-level neural network model of the second type is the input image. and determine the emotion represented by the face based on the feature region in the window of the second size in the face of the image, wherein the high-level neural network model of the second type is the feature region in the window of the third size in the face of the input image It may be configured to determine the emotion expressed by the face based on the . Here, the window of the first size may be smaller than the window of the second size, and the window of the third size may be larger than the window of the second size.

본 실시 예에서는, 제 1 타입의 신경망 모델은 ResNet이고, 제 2 타입의 신경망 모델은 VGGNet일 수 있다. 다만 이에 한정되는 것은 아니다. In this embodiment, the first type of neural network model may be ResNet, and the second type of neural network model may be VGGNet. However, the present invention is not limited thereto.

즉, 본 실시 예에서는, 네트워크 구조가 서로 다른 다수의 심층 신경망 모델을 적용하여, 각각의 심층 신경망 모델에서 하위 레벨, 중간 레벨 및 상위 레벨 기반의 감정을 추출할 수 있다. 같은 CNN 기반 심층 신경망 모델이라고 하더라도 네트워크 구조가 다르기 때문에 같은 이미지에 대한 다른 감정 인식 결과가 출력될 수 있다. 따라서 본 실시 예에서는, 단일 심층 신경망 모델을 이용하는 것이 아니라, 둘 이상의 심층 신경망 모델을 복합적으로 적용하여 보다 정확한 감정 인식이 가능하도록 할 수 있다.That is, in the present embodiment, by applying a plurality of deep neural network models having different network structures, it is possible to extract low-level, middle-level, and high-level-based emotions from each deep neural network model. Even with the same CNN-based deep neural network model, different emotion recognition results for the same image may be output because the network structure is different. Therefore, in the present embodiment, rather than using a single deep neural network model, two or more deep neural network models may be combined to enable more accurate emotion recognition.

또한, 도 5에 도시된 바와 같이, 제 1 사이즈의 윈도우는 얼굴의 이목구비의 일부의 선을 포함하는 사이즈이고, 제 2 사이즈의 윈도우는 이목구비 각각을 포함하는 사이즈이며, 제 3 사이즈의 윈도우는 얼굴 전체를 포함하는 사이즈일 수 있다. 예를 들어, 제 1 사이즈는 눈썹 끝, 눈 끝, 입술 끝과 같이 이목구비의 선의 방향을 통해 감정을 분석할 수 있도록 하는 영역을 포함하는 사이즈일 수 있다. 이에 눈썹 끝이 쳐지거나, 눈 끝이 쳐지거나, 입술 끝이 쳐지는 등의 얼굴에 대해서는 슬픔 등의 감정이 분석될 수 있도록 하는 것이다.Also, as shown in FIG. 5 , the window of the first size is a size including a line of a part of the facial features, the window of the second size is a size including each of the features, and the window of the third size is the size of the face. It may be a size including the whole. For example, the first size may be a size including a region for analyzing emotions through the direction of the line of features, such as the tip of the eyebrow, the tip of the eye, and the tip of the lips. Accordingly, emotions such as sadness can be analyzed for faces such as sagging eyebrows, sagging eyes, or sagging lips.

다층 분석부(740)는 각각의 신경망 모델로부터의 출력값에 가중 다수결 기반 가중치들를 부여할 수 있다. 예를 들어, 다층 분석부(740)는 제 1 타입의 신경망 모델을 통한 출력값에 제 2 가중치, 제 1 타입의 하위-레벨(low-level) 신경망 모델을 통한 출력값에 제 4 가중치, 제 1 타입의 중간-레벨(mid-level) 신경망 모델을 통한 출력값에 제 3 가중치, 제 1 타입의 상위-레벨(high-level) 신경망 모델을 통한 출력값에 제 1 가중치를 부여할 수 있다.The multilayer analyzer 740 may assign weighted majority-based weights to output values from each neural network model. For example, the multilayer analyzer 740 may include a second weight for an output value through the first type of neural network model, a fourth weight for an output value through the first type low-level neural network model, and a first type A third weight may be given to an output value through a mid-level neural network model of , and a first weight may be assigned to an output value through a first type high-level neural network model.

또한, 제 2 타입의 신경망 모델을 통한 출력값에 제 2 가중치, 제 2 타입의 하위-레벨(low-level) 신경망 모델을 통한 출력값에 제 4 가중치, 제 2 타입의 중간-레벨(mid-level) 신경망 모델을 통한 출력값에 제 3 가중치, 제 2 타입의 상위-레벨(high-level) 신경망 모델을 통한 출력값에 제 1 가중치를 부여할 수 있다. In addition, the second weight for the output value through the neural network model of the second type, the fourth weight for the output value through the second type low-level neural network model, and the mid-level value for the second type A third weight may be assigned to an output value through the neural network model, and a first weight may be assigned to an output value through a second type high-level neural network model.

이때, 제 1 가중치는 가장 큰 가중치이고, 제 4 가중치는 가장 낮은 가중치이며, 제 2 가중치는 제 1 가중치보다 낮고 제 4 가중치보다 높은 가중치이며, 제 3 가중치는 제 2 가중치보다 낮고 제 4 가중치보다 높은 가중치일 수 있다. 이러한 가중치는 한정되지 않고 설정에 따라 달라질 수 있다. In this case, the first weight is the largest weight, the fourth weight is the lowest weight, the second weight is lower than the first weight and higher than the fourth weight, and the third weight is lower than the second weight and higher than the fourth weight It may be of high weight. These weights are not limited and may vary according to settings.

다시 말해, 본래의 신경망 모델 및 상기 신경망 모델의 각각의 레벨에서 출력되는 출력값들에 따라 정확도가 다를 수 있기 때문에, 다층 분석부(740)는 단계별로 다층 분석을 수행하여 결과를 조합할 수 있다.In other words, since the accuracy may be different according to the original neural network model and output values output at each level of the neural network model, the multi-layer analyzer 740 may perform multi-layer analysis step by step and combine the results.

또한 본 실시 예에서는, 네트워크 구조가 서로 다른 각각의 심층 신경망 모델들 마다 가중치를 다르게 설정할 수도 있다. 예를 들어, 레이어 수가 더 많거나 보다 상위 레벨의 분석이 가능한 심층 신경망에 대해서 높은 가중치를 부여할 수도 있다.In addition, in the present embodiment, different weights may be set for each deep neural network model having a different network structure. For example, a high weight may be given to a deep neural network that has a larger number of layers or is capable of higher-level analysis.

감정 판단부(750)는 가중 다수결 결과에 기초하여 설정된 감정들 중 하나의 감정을 결정할 수 있다. 이때, 설정된 감정들은 분노(Anger), 행복(Happiness), 놀람(Surprise), 역겨움(Disgust), 슬픔(Sadness), 공포(Fear) 및 중립(Neutral)으로 분류될 수 있다. 이에, 본 실시 예에서는 결정된 감정을 표시부(600)를 통해 출력할 수 있다.The emotion determination unit 750 may determine one emotion among the set emotions based on the weighted majority vote result. In this case, the set emotions may be classified into Anger, Happiness, Surprise, Disgust, Sadness, Fear, and Neutral. Accordingly, in the present embodiment, the determined emotion may be output through the display unit 600 .

도 6는 본 개시의 일 실시 예에 따른 감정 인식 방법을 설명하기 위한 흐름도이다. 도 1 내지 도 5에 대한 설명과 중복되는 부분은 그 설명을 생략하기로 한다.6 is a flowchart illustrating an emotion recognition method according to an embodiment of the present disclosure. Parts overlapping with the description of FIGS. 1 to 5 will be omitted.

도 5를 참조하면, S100단계에서, 프로세서(400)는 얼굴 이미지를 포함하는 이미지를 획득한다. 이때 프로세서(400)는 카메라(200)를 통해 이미지를 획득할 수 있으나, 사용자의 입력 인터페이스나 별도 서버로부터 수신될 수도 있다.Referring to FIG. 5 , in step S100 , the processor 400 acquires an image including a face image. In this case, the processor 400 may acquire the image through the camera 200 , but may also be received from a user input interface or a separate server.

프로세서(400)는 획득한 이미지에 포함된 얼굴 영역의 개수를 조정할 수 있다. 즉, 프로세서(400)는 이미지에 포함된 얼굴 후보영역을 추출하고, 얼굴 후보영역의 정확도를 산출할 수 있다. 그리고 프로세서(400)는 얼굴 후보영역의 정확도에 기초해 최종 얼굴 영역을 결정하여, 이미지에 포함된 얼굴 영역의 개수를 조정할 수 있다.The processor 400 may adjust the number of face regions included in the acquired image. That is, the processor 400 may extract a face candidate region included in the image and calculate the accuracy of the face candidate region. In addition, the processor 400 may determine the final face region based on the accuracy of the face candidate region and adjust the number of face regions included in the image.

즉 프로세서(400)는 카메라(200)를 통해 획득된 이미지 신호에 대응되는 이미지에서 얼굴 영역을 인식할 수 있다. 이때, 프로세서(400)는 우선 얼굴이라고 판단되는 후보영역을 추출하고, 얼굴 인식 알고리즘에 기초하여 추출된 후보영역의 정확도를 수치화할 수 있다.That is, the processor 400 may recognize the face region from the image corresponding to the image signal acquired through the camera 200 . In this case, the processor 400 may first extract a candidate region determined to be a face, and quantify the accuracy of the extracted candidate region based on a face recognition algorithm.

S200단계에서, 프로세서(400)는 이미지의 전처리를 수행한다. 전처리부(720)는 인식된 얼굴 영역에 대해 전처리(preprocess)를 수행할 수 있다. 즉, 프로세서(400)는 최종 결정된 얼굴 영역의 크기, 위치, 색상, 밝기 및 방향 중 적어도 하나 이상을 조정하여, 얼굴 영역의 전처리를 수행할 수 있다. In step S200, the processor 400 performs pre-processing of the image. The preprocessor 720 may perform preprocessing on the recognized face region. That is, the processor 400 may perform preprocessing of the face region by adjusting at least one of the finally determined size, position, color, brightness, and direction of the face region.

프로세서(400)는 이미지의 특징점 검출이 용이하게 이루어지도록 전처리 작업을 수행할 수 있다. 이때, 프로세서(400)는 이미지를 회색조(grayscale) 이미지로 변환할 수 있다. 그리고 프로세서(400)는 회색조 이미지를 정수 도메인(예를 들어, 0~255)을 취하는 임의의 차원의 행렬로 변환할 수 있다. 즉 프로세서(400)는 이미지의 감정을 보다 정확하게 인식하기 위해 특징점 검출을 위한 전처리를 수행할 수 있다. 이에 프로세서(400)는 이미지에서 '밝은 값'과 '어두운 값'을 찾기 위해, 이미지를 회색조 이미지로 변환할 수 있다. 그리고 프로세서(400)는 회색조로 변환한 이미지를 행렬로 변환하여 특징점 검출을 위한 알고리즘을 수행할 수 있도록 할 수 있다. The processor 400 may perform a pre-processing operation so that the feature point of the image is easily detected. In this case, the processor 400 may convert the image into a grayscale image. In addition, the processor 400 may convert the grayscale image into an arbitrary-dimensional matrix taking an integer domain (eg, 0 to 255). That is, the processor 400 may perform pre-processing for detecting the feature point in order to more accurately recognize the emotion of the image. Accordingly, the processor 400 may convert the image into a grayscale image in order to find a 'bright value' and a 'dark value' in the image. In addition, the processor 400 may convert the grayscale-converted image into a matrix to perform an algorithm for feature point detection.

S300단계에서, 프로세서(400)는 다수의 심층 신경망 모델을 적용해, 이미지의 다층 특징점을 각각 추출하여 감정을 출력한다. 프로세서(400)는 얼굴에 나타난 감정을 추정하도록 미리 훈련된 다수의 심층 신경망 모델을 적용하여 이미지에 나타난 감정을 출력할 수 있다. 즉, 프러세서(400)는 이미지에서 얼굴 영역에 대한 특징점을 검출하고 동질 영역들에 대한 특징 영역을 검출할 수 있다. 그리고 프로세서(400)는 특징 영역에 기초하여 얼굴에 나타난 감정을 예측할 수 있다. In step S300, the processor 400 applies a plurality of deep neural network models, extracts multi-layered feature points of the image, respectively, and outputs emotions. The processor 400 may output the emotion shown in the image by applying a plurality of deep neural network models trained in advance to estimate the emotion shown in the face. That is, the processor 400 may detect a feature point for a face region in the image and detect a feature region for homogeneous regions. In addition, the processor 400 may predict the emotion displayed on the face based on the feature region.

또한, 제 2 타입의 신경망 모델 그룹은, 입력된 이미지를 전체적으로 분석하여 감정을 출력하는 제 2 타입의 신경망 모델, 입력된 이미지의 하위 레벨 특징을 추출하여 감정을 출력하는 제 2 타입의 하위-레벨 신경망 모델, 입력된 이미지의 중간 레벨 특징을 추출하여 감정을 출력하는 제 2 타입의 중간-레벨 신경망 모델 및 입력된 이미지의 상위 레벨 특징을 추출하여 감정을 출력하는 제 2 타입의 상위-레벨 신경망 모델을 포함할 수 있다. 여기서, 제 1 타입의 신경망 모델 및 제 2 타입의 신경망 모델은 본래의 신경망 모델로, 입력된 이미지의 전체 영역에 대해 분석할 수 있다. 그리고 그 외의 레벨 단위의 신경망 모델들은 해당 레벨에 대응하는 크기의 영역에 대해서 분석할 수 있다.In addition, the second type of neural network model group includes a second type of neural network model that outputs emotions by analyzing the input image as a whole, and a second type of low-level that outputs emotions by extracting low-level features of the input image. A neural network model, a second-type middle-level neural network model that extracts mid-level features of an input image to output emotions, and a second-type high-level neural network model that extracts high-level features of an input image and outputs emotions may include. Here, the first type of neural network model and the second type of neural network model are original neural network models, and the entire region of the input image may be analyzed. In addition, neural network models in units of other levels may be analyzed for an area having a size corresponding to the level.

또한, 제 1 사이즈의 윈도우는 얼굴의 이목구비의 일부의 선을 포함하는 사이즈이고, 제 2 사이즈의 윈도우는 이목구비 각각을 포함하는 사이즈이며, 제 3 사이즈의 윈도우는 얼굴 전체를 포함하는 사이즈일 수 있다. In addition, the window of the first size may be a size including a line of a part of the facial features, the window of the second size may be a size including each of the features, and the window of the third size may be a size including the entire face. .

S400단계에서, 프로세서(400)는 가중 다수결을 수행하여 감정을 예측한다. 즉 프로세서(400)는 각각의 신경망 모델로부터의 출력값에 가중 다수결 기반 가중치들를 부여할 수 있다. 예를 들어, 프로세서(400)는 제 1 타입의 신경망 모델을 통한 출력값, 제 1 타입의 하위-레벨(low-level) 신경망 모델을 통한 출력값, 제 1 타입의 중간-레벨(mid-level) 신경망 모델을 통한 출력값, 제 1 타입의 상위-레벨(high-level) 신경망 모델을 통한 출력값 각각에 가중치를 부여할 수 있다.In step S400 , the processor 400 predicts emotion by performing a weighted majority vote. That is, the processor 400 may assign weighted majority-based weights to output values from each neural network model. For example, the processor 400 may generate an output value through a first type of neural network model, an output value through a first type low-level neural network model, and a first type mid-level neural network. A weight may be assigned to each of the output value through the model and the output value through the first type high-level neural network model.

또한, 제 2 타입의 신경망 모델을 통한 출력값, 제 2 타입의 하위-레벨(low-level) 신경망 모델을 통한 출력값, 제 2 타입의 중간-레벨(mid-level) 신경망 모델을 통한 출력값, 제 2 타입의 상위-레벨(high-level) 신경망 모델을 통한 출력값 각각에 가중치를 부여할 수 있다. In addition, the output value through the neural network model of the second type, the output value through the low-level neural network model of the second type, the output value through the mid-level neural network model of the second type, the second A weight may be assigned to each output value through a high-level neural network model of the type.

즉, 본래의 신경망 모델 및 상기 신경망 모델의 각각의 레벨에서 출력되는 출력값들에 따라 정확도가 다를 수 있기 때문에, 프로세서(400)는 단계별로 다층 분석을 수행하여 결과를 조합할 수 있다.That is, since the accuracy may be different according to the original neural network model and output values output from each level of the neural network model, the processor 400 may perform multi-layer analysis step by step and combine the results.

그리고 프로세서(400)는 가중 다수결 결과에 기초하여 설정된 감정들 중 하나의 감정을 결정할 수 있다. 이때, 설정된 감정들은 분노(Anger), 행복(Happiness), 놀람(Surprise), 역겨움(Disgust), 슬픔(Sadness), 공포(Fear) 및 중립(Neutral)으로 분류될 수 있다. 이에, 본 실시 예에서는 결정된 감정을 표시부(600)를 통해 출력할 수 있다.In addition, the processor 400 may determine one of the emotions set based on the weighted majority vote result. In this case, the set emotions may be classified into Anger, Happiness, Surprise, Disgust, Sadness, Fear, and Neutral. Accordingly, in the present embodiment, the determined emotion may be output through the display unit 600 .

이상 설명된 본 발명에 따른 실시 예는 컴퓨터 상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.The embodiment according to the present invention described above may be implemented in the form of a computer program that can be executed through various components on a computer, and such a computer program may be recorded in a computer-readable medium. In this case, the medium includes a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and a ROM. , RAM, flash memory, and the like, hardware devices specially configured to store and execute program instructions.

한편, 상기 컴퓨터 프로그램은 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 프로그램의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함될 수 있다.Meanwhile, the computer program may be specially designed and configured for the present invention, or may be known and used by those skilled in the computer software field. Examples of the computer program may include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

본 발명의 명세서(특히 특허청구범위에서)에서 "상기"의 용어 및 이와 유사한 지시 용어의 사용은 단수 및 복수 모두에 해당하는 것일 수 있다. 또한, 본 발명에서 범위(range)를 기재한 경우 상기 범위에 속하는 개별적인 값을 적용한 발명을 포함하는 것으로서(이에 반하는 기재가 없다면), 발명의 상세한 설명에 상기 범위를 구성하는 각 개별적인 값을 기재한 것과 같다. In the specification of the present invention (especially in the claims), the use of the term "above" and similar referential terms may be used in both the singular and the plural. In addition, when a range is described in the present invention, each individual value constituting the range is described in the detailed description of the invention as including the invention to which individual values belonging to the range are applied (unless there is a description to the contrary). same as

본 발명에 따른 방법을 구성하는 단계들에 대하여 명백하게 순서를 기재하거나 반하는 기재가 없다면, 상기 단계들은 적당한 순서로 행해질 수 있다. 반드시 상기 단계들의 기재 순서에 따라 본 발명이 한정되는 것은 아니다. 본 발명에서 모든 예들 또는 예시적인 용어(예들 들어, 등등)의 사용은 단순히 본 발명을 상세히 설명하기 위한 것으로서 특허청구범위에 의해 한정되지 않는 이상 상기 예들 또는 예시적인 용어로 인해 본 발명의 범위가 한정되는 것은 아니다. 또한, 당업자는 다양한 수정, 조합 및 변경이 부가된 특허청구범위 또는 그 균등물의 범주 내에서 설계 조건 및 팩터에 따라 구성될 수 있음을 알 수 있다.The steps constituting the method according to the present invention may be performed in an appropriate order, unless there is an explicit order or description to the contrary. The present invention is not necessarily limited to the order in which the steps are described. The use of all examples or exemplary terms (eg, etc.) in the present invention is merely for the purpose of describing the present invention in detail, and the scope of the present invention is limited by the examples or exemplary terms unless defined by the claims. it's not going to be In addition, those skilled in the art will recognize that various modifications, combinations, and changes may be made in accordance with design conditions and factors within the scope of the appended claims or their equivalents.

따라서, 본 발명의 사상은 상기 설명된 실시 예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 또는 이로부터 등가적으로 변경된 모든 범위는 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments, and the scope of the spirit of the present invention is not limited to the scope of the scope of the present invention. will be said to belong to

100 : 통신 인터페이스
200 : 카메라
300 : 센싱부
400 : 프로세서
500 : 메모리
600 : 표시부
700 : 처리부
710 : 얼굴 인식부
720 : 전처리부
730 : 특징점 검출부
740 : 다층 분석부
750 : 감정 판단부100: communication interface
200 : camera
300: sensing unit
400 : processor
500 : memory
600: display
700: processing unit
710: face recognition unit
720: preprocessor
730: feature point detection unit
740: multi-layer analysis unit
750: emotion judgment unit

Claims

A method for emotion recognition, comprising:
acquiring an image including a face;
performing preprocessing of the image;
outputting the emotion shown in the image by applying a plurality of deep neural network models trained in advance to estimate the emotion shown in the face; and
Performing a weighted majority voting on the output values from the plurality of deep neural network models, comprising the step of determining the emotion represented by the face,
How to recognize emotions.

The method of claim 1,
Acquiring the image includes:
extracting a face candidate region included in the image;
calculating the accuracy of the face candidate region; and
determining a final face region based on the accuracy of the face candidate region;
How to recognize emotions.

The method of claim 1,
The pre-processing of the image comprises:
converting the image to a grayscale image; and
transforming the grayscale image into a matrix of any dimension taking an integer domain;
How to recognize emotions.

The method of claim 1,
wherein the plurality of pre-trained deep neural network models include a first type of neural network model group and a second type of neural network model group,
The first type of neural network model group includes a first type of neural network model that outputs emotions by analyzing the input image as a whole, and a low-level of a first type that outputs emotions by extracting low-level features of the input image ( A low-level neural network model, a first-type mid-level neural network model that extracts mid-level features of an input image to output emotions, and a first-type mid-level neural network model that extracts high-level features of an input image and outputs emotions a first type of high-level neural network model,
The second type of neural network model group includes a second type of neural network model that outputs emotions by analyzing the input image as a whole, and a second type of low-level neural network that outputs emotions by extracting low-level features of the input image. model, a second type of middle-level neural network model that outputs emotions by extracting middle-level features of the input image, and a second-type high-level neural network model that outputs emotions by extracting high-level features of the input image. containing,
How to recognize emotions.

5. The method of claim 4,
The low-level neural network model of the first type is configured to determine an emotion represented by the face based on a feature region within a window of a first size in the face of the input image, and the middle-level neural network model of the first type is configured to determine an emotion represented by the face based on a feature area within a window of a second size in the face of the input image, and the high-level neural network model of the first type is configured to determine a third emotion in the face of the input image. configured to determine the emotion represented by the face based on the feature area within the window of size,
The low-level neural network model of the second type is configured to determine an emotion represented by the face based on a feature region within a window of a first size in the face of the input image, and the middle-level neural network model of the second type is configured to determine an emotion represented by the face based on a feature region in a window of a second size in the face of the input image, and the high-level neural network model of the second type is configured to determine a third emotion in the face of the input image. configured to determine the emotion expressed by the face based on the feature area within the window of size,
the window of the first size is smaller than the window of the second size, and the window of the third size is larger than the window of the second size;
How to recognize emotions.

5. The method of claim 4,
The first type of neural network model is ResNet,
The second type of neural network model is VGGNet,
How to recognize emotions.

6. The method of claim 5,
The window of the first size is a size including a line of a part of the facial features,
The window of the second size is a size including each of the features,
The window of the third size is a size that includes the entire face,
How to recognize emotions.

5. The method of claim 4,
The step of determining the emotion is
Comprising the step of determining one of the emotions set by assigning weighted majority vote-based weights to the output value from each neural network model,
How to recognize emotions.

9. The method of claim 8,
The set emotions are classified into Anger, Happiness, Surprise, Disgust, Sadness, Fear and Neutral.
How to recognize emotions.

An emotion recognition device comprising:
a communication interface for receiving an image including a face;
one or more processors; and
a memory coupled to the one or more processors;
The memory, when executed by the processor, causes the processor to:
Performing preprocessing of the image, applying a plurality of deep neural network models trained in advance to estimate the emotion displayed on the face, and outputting the emotion displayed on the image, to the output values from the plurality of deep neural network models performing a weighted majority voting, storing codes that cause to determine the emotion represented by the face;
emotion recognition device.

11. The method of claim 10,
The memory, when executed by the processor, causes the processor to:
extracting a face candidate region included in the image, calculating the accuracy of the face candidate region, and storing codes that cause to determine a final face region based on the accuracy of the face candidate region;
How to recognize emotions.

11. The method of claim 10,
The memory, when executed by the processor, causes the processor to:
converting the image to a grayscale image and transforming the grayscale image into a matrix of arbitrary dimensions taking an integer domain to store codes causing preprocessing of the image,
emotion recognition device.

11. The method of claim 10,
wherein the plurality of pre-trained deep neural network models include a first type of neural network model group and a second type of neural network model group,
The first type of neural network model group includes a first type of neural network model that outputs emotions by analyzing the input image as a whole, and a low-level of a first type that outputs emotions by extracting low-level features of the input image ( A low-level neural network model, a first-type mid-level neural network model that extracts mid-level features of an input image to output emotions, and a first-type mid-level neural network model that extracts high-level features of an input image and outputs emotions a first type of high-level neural network model,
The second type of neural network model group includes a second type of neural network model that outputs emotions by analyzing the input image as a whole, and a second type of low-level neural network that outputs emotions by extracting low-level features of the input image. model, a second type of middle-level neural network model that outputs emotions by extracting middle-level features of the input image, and a second-type high-level neural network model that outputs emotions by extracting high-level features of the input image. containing,
emotion recognition device.

14. The method of claim 13,
The low-level neural network model of the first type is configured to determine an emotion represented by the face based on a feature region within a window of a first size in the face of the input image, and the middle-level neural network model of the first type is configured to determine an emotion represented by the face based on a feature region within a window of a second size in the face of the input image, and the high-level neural network model of the first type is configured to determine a third emotion in the face of the input image. configured to determine the emotion represented by the face based on the feature area within the window of size,
The low-level neural network model of the second type is configured to determine an emotion represented by the face based on a feature region within a window of a first size in the face of the input image, and the middle-level neural network model of the second type is configured to determine an emotion represented by the face based on a feature region in a window of a second size in the face of the input image, and the high-level neural network model of the second type is configured to determine a third emotion in the face of the input image. configured to determine the emotion expressed by the face based on the feature area within the window of size,
the window of the first size is smaller than the window of the second size, and the window of the third size is larger than the window of the second size;
emotion recognition device.

14. The method of claim 13,
The first type of neural network model is ResNet,
The second type of neural network model is VGGNet,
emotion recognition device.

15. The method of claim 14,
The window of the first size is a size including a line of a part of the facial features,
The window of the second size is a size including each of the features,
The window of the third size is a size that includes the entire face,
emotion recognition device.

14. The method of claim 13,
The memory, when executed by the processor, causes the processor to:
Storing codes that determine one of the emotions set by giving weighted majority vote-based weights to the output value from each neural network model, and cause to judge the emotion,
emotion recognition device.

18. The method of claim 17,
The set emotions are classified into Anger, Happiness, Surprise, Disgust, Sadness, Fear and Neutral.
emotion recognition device.