KR102503885B1

KR102503885B1 - Apparatus and method for predicting human depression level using multi-layer bi-lstm with spatial and dynamic information of video frames

Info

Publication number: KR102503885B1
Application number: KR1020200106833A
Authority: KR
Inventors: 이영구; 무하마드 아제르 우딘
Original assignee: 경희대학교 산학협력단
Priority date: 2019-11-28
Filing date: 2020-08-25
Publication date: 2023-02-27
Also published as: KR20210066697A

Abstract

본 발명의 실시 예에 따른 미세 얼굴 표정을 분석하여 인간의 우울증 수준을 예측하는 장치는 비디오 데이터를 저장하는 데이터 저장부, 비디오 데이터에서 공간 정보를 생성하는 공간 정보 생성부, 비디오 데이터에서 3개의 연속된 프레임을 추출하고, 연속된 프레임을 기준으로 얼굴 역학을 분석하기 위한 VLDN(volume local directional number) 특징 맵을 생성하고, CNN(Deep Convolutional Neural Network) 모델에 입력하여 얼굴 움직임에 대한 동적 정보를 생성하는 VLDN 특징 맵 생성부, 공간 정보와 동적 정보를 TMP(Temporal Median Pooling) 방법을 통하여 출력값으로 생성하는 정보 처리부, 출력값을 재귀신경망을 기반으로 인간의 우울증 수준을 예측하는 예측부를 포함한다. An apparatus for predicting a human depression level by analyzing fine facial expressions according to an embodiment of the present invention includes a data storage unit for storing video data, a spatial information generation unit for generating spatial information from video data, and three consecutive data streams from video data. Extracted frames, create a volume local directional number (VLDN) feature map for analyzing facial dynamics based on consecutive frames, and input to a Deep Convolutional Neural Network (CNN) model to generate dynamic information about facial movements. It includes a VLDN feature map generation unit that generates spatial information and dynamic information as output values through a Temporal Median Pooling (TMP) method, and a prediction unit that predicts the level of human depression based on the output value through a recursive neural network.

Description

APPARATUS AND METHOD FOR PREDICTING HUMAN DEPRESSION LEVEL USING MULTI-LAYER BI-LSTM WITH SPATIAL AND DYNAMIC INFORMATION OF VIDEO FRAMES }

본 발명은 비디오 데이터를 분석하여 인간의 우울증 수준을 예측하기 위한 장치 및 방법에 대한 것이다. 보다 구체적으로는 비디오 데이터의 공간 정보와 비디오 데이터의 동적 정보를 고려하여 다층 BI-LSTM을 사용한 인간의 우울증 수준을 예측하기 위한 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for predicting a level of depression in a human by analyzing video data. More specifically, it relates to an apparatus and method for predicting human depression level using multi-layer BI-LSTM considering spatial information of video data and dynamic information of video data.

최근 사회적으로 인간의 정신 건강에 관한 정신의학적 분석이 증가하고 있다. 인간의 정신건강 질병 중 가장 널리 알려진 것은 우울증으로 알려진 Major Depressive Disorder(MDD)이다. 우울증은 환자의 가족, 직장 생활, 식습관, 수면습관 등 환자의 전반적인 생활에 부정적인 영향을 주며, 사회적으로도 악영향을 주는 정신질병에 해당된다. 따라서 우울증의 발병 여부를 조기에 발견할 수 있다면, 개인적 측면 및 사회적 측면에서 모두 우울증 문제 해결에 도움이 될 수 있다. Recently, psychiatric analysis of human mental health is increasing in society. One of the most well-known human mental health disorders is Major Depressive Disorder (MDD), also known as depression. Depression is a mental illness that negatively affects the patient's overall life, such as the patient's family, work life, eating habits, and sleeping habits, and also adversely affects society. Therefore, if the onset of depression can be detected early, it can help solve the problem of depression both personally and socially.

종래 우울증을 확인하기 위한 검사는 정신의학 전문가의 평가로 이루어졌다. 또한 정신의학 전문가가 환자와 대면하여 상담을 통해, 우울증의 여부 및 우울증 수준을 판단하였다. 그러나, 전문가가 우울증을 판단하는 것은 노동 집약적이며 전문가의 주관적 인식에 크게 의존해야 하는 문제가 있다. Conventionally, a test to confirm depression was performed by a psychiatric expert's evaluation. In addition, a psychiatrist consulted with the patient face-to-face to determine the presence of depression and the level of depression. However, there is a problem in that expert judgment of depression is labor-intensive and highly dependent on the expert's subjective perception.

따라서, 환자의 우울증 여부 및 우울증 수준을 보다 간편한 방법을 통해 확인할 수 있다면, 우울증으로 인한 많은 문제들을 조기에 파악하고, 해결할 수 있다. 이에, 최근 인간의 얼굴 표정을 녹화한 비디오 데이터를 분석하여, 우울증 수준을 예측하는 다양한 연구가 진행되고 있다. Therefore, if it is possible to check whether a patient is depressed and the level of depression through a more convenient method, many problems caused by depression can be identified and resolved at an early stage. Accordingly, various studies have been conducted to predict the level of depression by analyzing video data recorded of human facial expressions.

한국공개특허 제10-2020-0061016호 (발명의 명칭: 얼굴 피부 영상을 이용한 우울증 지수 측정 및 진단 방법; 공개일자: 2020년 6월 2일)Korean Patent Publication No. 10-2020-0061016 (title of invention: method for measuring and diagnosing depression index using facial skin images; publication date: June 2, 2020)

본 발명은 전술한 문제점을 해결하기 위한 것으로서, 환자의 얼굴을 포함한 비디오 데이터를 기반으로 환자의 우울증 여부 및 그 수준을 예측할 수 있는 장치를 제공하는 것을 목적으로 한다.SUMMARY OF THE INVENTION The present invention is to solve the above problems, and an object of the present invention is to provide a device capable of predicting whether a patient has depression and the level thereof based on video data including the face of the patient.

본 발명은 비디오 데이터의 공간 특징과 비디오 프레임에서 시간 특징을 추출하여 딥러닝 분석을 통해, 인간의 우울증 수준을 예측하는 장치 및 방법을 제공하는 것을 목적으로 한다. An object of the present invention is to provide an apparatus and method for predicting the level of depression in a human through deep learning analysis by extracting spatial features of video data and temporal features from video frames.

본 발명은 TMP 방법을 통해, 공간 정보와 동적 정보에 대한 중간값을 딥러닝 분석을 통한 입력값으로 활용하여 입력 시퀀스의 길이에 따른 노이즈 문제를 해결할 수 있는 우울증 수준 예측 장치를 제공하는 것을 목적으로 한다. An object of the present invention is to provide a depression level prediction device that can solve the noise problem according to the length of an input sequence by using the median value of spatial information and dynamic information as an input value through deep learning analysis through a TMP method. do.

본 발명은 2개의 층으로 구성된 Bi-LSTM 모델을 활용하여 우울증 수준 예측에 관해 더욱 정확한 알고리즘 모델을 제공할 수 있는 우울증 수준 예측 장치를 제공하는 것을 목적으로 한다.An object of the present invention is to provide a depression level prediction device capable of providing a more accurate algorithm model for depression level prediction by utilizing a Bi-LSTM model composed of two layers.

본 발명의 실시 예를 따르면, 얼굴 표정을 포함하는 비디오 프레임에서 시공간 특징을 딥 러닝 기법을 활용하여 분석하여, 이를 통해 인간의 우울증 수준을 예측할 수 있다. According to an embodiment of the present invention, the level of human depression can be predicted by analyzing spatio-temporal features in a video frame including facial expressions using a deep learning technique.

본 발명의 실시 예를 따르면, 딥러닝 기법은 다층 LSTM을 사용하여 성능 측면에서 효과적인 인간의 우울증 수준 예측을 수행할 수 있다. According to an embodiment of the present invention, the deep learning technique can perform effective human depression level prediction in terms of performance using multi-layer LSTM.

본 발명의 실시 예를 따르면, 얼굴 표정에 따른 우울증 수준을 예측하여, 적절한 약물 치료 및 심리 치료를 제공할 수 있는 효과가 있다. According to an embodiment of the present invention, there is an effect of predicting the level of depression according to facial expressions and providing appropriate drug treatment and psychological treatment.

본 발명의 실시 예를 따르면, 컴퓨터를 통한 인간의 감정 인식 분야에서도 응용이 가능하며, 향후 인간 심리학 관련 폭넓은 분야에서 응용될 수 있다.According to an embodiment of the present invention, it can be applied to the field of human emotion recognition through a computer, and can be applied to a wide range of fields related to human psychology in the future.

도 1은 본 발명의 실시 예에 따른 미세 얼굴 표정을 분석하여 인간의 우울증 수준을 예측하는 장치를 설명하기 위한 블록도이다.
도 2는 본 발명의 일 실시 예에 따른 미세 얼굴 표정을 분석하여 인간의 우울증 수준을 예측하는 장치를 설명하기 위한 상세도이다.
도 3은 VLDN 특징 맵 생성부의 상세 구성을 설명하기 위한 도면이다.
도 4는 본 발명의 실시 예에 따른 VLDN 특징 맵 예시도이다.
도 5는 본 발명의 실시 예에 따른 VLDN 특징맵 생성부에서 생성하는 동적 정보를 설명하기 위한 CNN 모델 예시도이다.
도 6은 본 발명의 실시 예에 따른 미세 얼굴 표정을 분석하여 인간의 우울증 수준을 예측하는 방법을 설명하기 위한 순서도이다.
도 7은 AVEC2013, AVEC2014 데이터 셋에 대한 실험의 결과값을 나타낸 도면이다.
도 8은 AVEC2013, AVEC2014 데이터 셋에 대해서, 공간 정보 생성시 얼굴 이미지 전체 공간 특징 추출과 임의의 개수로 분할한 조각의 공간 특징 추출을 고려하여 분석한 결과값을 나타낸 도면이다.
도 9는 AVEC2013, AVEC2014 데이터 셋에 대해서, 동적 정보만을 분석한 모델과 다른 모델을 비교한 실험의 결과값을 나타낸 도면이다.
도 10은 AVEC2013, AVEC2014 데이터 셋에 대해서 TMP 방법에 관한 실험 결과를 나타낸 도면이다. 1 is a block diagram illustrating an apparatus for predicting a level of depression in a human by analyzing minute facial expressions according to an embodiment of the present invention.
2 is a detailed diagram illustrating an apparatus for predicting a level of human depression by analyzing minute facial expressions according to an embodiment of the present invention.
3 is a diagram for explaining a detailed configuration of a VLDN feature map generator.
4 is an exemplary view of a VLDN feature map according to an embodiment of the present invention.
5 is an exemplary view of a CNN model for explaining dynamic information generated by a VLDN feature map generator according to an embodiment of the present invention.
6 is a flowchart illustrating a method of predicting a level of depression in a human by analyzing minute facial expressions according to an embodiment of the present invention.
7 is a diagram showing result values of experiments on the AVEC2013 and AVEC2014 data sets.
8 is a diagram showing results obtained by analyzing AVEC2013 and AVEC2014 data sets in consideration of spatial feature extraction of the entire face image and spatial feature extraction of fragments divided into an arbitrary number when generating spatial information.
9 is a diagram showing result values of an experiment in which a model analyzing only dynamic information and other models are compared with respect to the AVEC2013 and AVEC2014 data sets.
10 is a diagram showing experimental results of the TMP method for the AVEC2013 and AVEC2014 data sets.

전술한 목적, 특징 및 장점은 첨부된 도면을 참조하여 상세하게 후술되며, 이에 따라 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요 하게 흐릴 수 있다고 판단되는 경우에는 상세한 설명을 생략한다.The above objects, features and advantages will be described later in detail with reference to the accompanying drawings, and accordingly, those skilled in the art to which the present invention belongs will be able to easily implement the technical spirit of the present invention. In describing the present invention, if it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted.

도면에서 동일한 참조부호는 동일 또는 유사한 구성요소를 가리키는 것으로 사용되며, 명세서 및 특허청구의 범위에 기재된 모든 조합은 임의의 방식으로 조합될 수 있다. 그리고 다른 식으로 규정하지 않는 한, 단수에 대한 언급은 하나 이상을 포함할 수 있고, 단수 표현에 대한 언급은 또한 복수 표현을 포함할 수 있음이 이해되어야 한다.In the drawings, the same reference numerals are used to indicate the same or similar elements, and all combinations described in the specification and claims may be combined in any manner. And unless otherwise specified, it should be understood that references to the singular may include one or more, and references to the singular may also include plural.

본 명세서에서 사용되는 용어는 단지 특정 예시적 실시 예들을 설명할 목적을 가지고 있으며 한정할 의도로 사용되는 것이 아니다. 본 명세서에서 사용된 바와 같은 단수적 표현들은 또한, 해당 문장에서 명확하게 달리 표시하지 않는 한, 복수의 의미를 포함하도록 의도될 수 있다. 용어 "및/또는," "그리고/또는"은 그 관련되어 나열되는 항목들의 모든 조합들 및 어느 하나를 포함한다. 용어 "포함한다", "포함하는", "포함하고 있는", "구비하는", "갖는", "가지고 있는" 등은 내포적 의미를 갖는 바, 이에 따라 이러한 용어들은 그 기재된 특징, 정수, 단계, 동작, 요소, 및/또는 컴포넌트를 특정하며, 하나 이상의 다른 특징, 정수, 단계, 동작, 요소, 컴포넌트, 및/또는 이들의 그룹의 존재 혹은 추가를 배제하지 않는다. 본 명세서에서 설명되는 방법의 단계들, 프로세스들, 동작들은, 구체적으로 그 수행 순서가 확정되는 경우가 아니라면, 이들의 수행을 논의된 혹은 예시된 그러한 특정 순서로 반드시 해야 하는 것으로 해석돼서는 안 된다. 추가적인 혹은 대안적인 단계들이 사용될 수 있음을 또한 이해해야 한다.Terms used herein are only for the purpose of describing specific exemplary embodiments and are not intended to be limiting. Singular expressions as used herein may also be intended to include plural meanings unless the context clearly dictates otherwise. The term “and/or,” “and/or” includes all combinations and any one of the associated listed items. The terms "comprises", "comprising", "including", "including", "having", "having" and the like are meant to be inclusive, and thus such terms shall be construed as having a recited feature, integer, Specifies steps, operations, elements, and/or components, and does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and actions described herein should not be construed as requiring their performance in the specific order discussed or illustrated, unless such order of performance is specifically established. . It should also be understood that additional or alternative steps may be used.

또한, 각각의 구성요소는 각각 하드웨어 프로세서로 구현될 수 있고, 위 구성요소들이 통합되어 하나의 하드웨어 프로세서로 구현될 수 있으며, 또는 위 구성요소들이 서로 조합되어 복수 개의 하드웨어 프로세서로 구현될 수도 있다.In addition, each component may be implemented as a hardware processor, and the above components may be integrated and implemented as one hardware processor, or the above components may be combined with each other and implemented as a plurality of hardware processors.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시 예를 상세히 설명하기로 한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시 예에 따른 미세 얼굴 표정을 분석하여 인간의 우울증 수준을 예측하는 장치를 설명하기 위한 블록도이다.1 is a block diagram illustrating an apparatus for predicting a level of depression in a human by analyzing minute facial expressions according to an embodiment of the present invention.

우울증 수준 예측 장치(100)는 얼굴 표정을 포함하는 비디오 데이터를 저장하는 데이터 저장부(110), 비디오 데이터에서 공간 특징을 추출하여 공간 정보를 생성하는 공간 정보 생성부(120), 비디오 데이터에서 동적 정보에 해당하는 VLDN 특징 맵을 생성하는 VLDN 특징 맵 생성부(130), 공간 정보와 동적 정보를 TMP 방법을 통해 처리하는 정보 처리부(150), 재귀신경망 중 하나인 다층 Bi-LSTM을 활용하여 우울증의 수준을 예측하는 수준 예측부(160)로 구성된다. VLDN 특징 맵 생성부(130)는 에지 응답 계산부(도 3의 도면부호 '135' 참조), 방향 번호 확인부(도 3의 도면부호 '136' 참조), VLDN 생성부(도 3의 도면부호 '137' 참조)를 더 포함할 수 있다. 장치에 대한 상세한 설명은 도 2를 참조하여 설명하도록 한다.Depression level prediction apparatus 100 includes a data storage unit 110 for storing video data including facial expressions, a spatial information generator 120 for generating spatial information by extracting spatial features from video data, and dynamic dynamics in video data. A VLDN feature map generator 130 that generates a VLDN feature map corresponding to information, an information processor 150 that processes spatial and dynamic information through a TMP method, and a multi-layer Bi-LSTM, one of the recursive neural networks, for depression. It consists of a level prediction unit 160 that predicts the level of. The VLDN feature map generator 130 includes an edge response calculator (see reference numeral 135 in FIG. 3 ), a direction number checker (see reference numeral 136 in FIG. 3 ), and a VLDN generator (reference numeral 135 in FIG. 3 ). See '137') may be further included. A detailed description of the device will be described with reference to FIG. 2 .

도 2는 본 발명의 일 실시 예에 따른 미세 얼굴 표정을 분석하여 인간의 우울증 수준을 예측하는 장치를 설명하기 위한 상세도이다. 2 is a detailed diagram illustrating an apparatus for predicting a level of human depression by analyzing minute facial expressions according to an embodiment of the present invention.

우울증 수준 예측 장치(100)에서 데이터 저장부(110)은 얼굴 비디오 데이터(111)를 저장할 수 있다. 데이터 저장부(110)가 저장하는 얼굴 비디오 데이터(111)는 사람의 얼굴 표정을 포함하는 데이터에 해당된다. 예를 들어, 얼굴 비디오 데이터(111)는 우울증 수준의 판단이 필요한 환자의 얼굴을 비디오 영상으로 촬영하여, 저장한 데이터일 수 있다. 또는, 우울증 수준의 판단이 필요한 환자가 스스로 촬영한 비디오 데이터일 수 있다. 데이터 저장부(110)에 저장되는 비디오 데이터(111)는 한정되어 해석되지 아니하며 사람의 얼굴을 포함한 비디오 데이터(111)는 형식 및 크기와 상관없이 모두 포함될 수 있다. In the depression level predicting device 100 , the data storage unit 110 may store face video data 111 . The face video data 111 stored in the data storage unit 110 corresponds to data including human facial expressions. For example, the face video data 111 may be data stored by capturing a video image of a patient's face for which the level of depression needs to be determined. Alternatively, it may be video data taken by a patient who needs to determine the depression level. The video data 111 stored in the data storage unit 110 is not limited and interpreted, and all video data 111 including a human face may be included regardless of format and size.

공간 정보 생성부(120)은 데이터 저장부(110)에 저장된 비디오 데이터의 얼굴 이미지(121)에 대한 공간 특징을 추출하여 공간 정보를 생성할 수 있다. 얼굴 이미지(121)는 데이터 저장부(110)에 저장된 얼굴 비디오 데이터(111)에서 얼굴 이미지에 대한 샘플 RGB 프레임으로 추출한 것일 수 있다. 공간 정보 생성부(120)은 얼굴 이미지(121)에서 공간 특징을 추출하여 공간 정보를 효과적으로 생성하기 위해, Inception-Resnet-v2 네트워크 컨볼루션 모델을 사용할 수 있다. Inception-Resnet-v2 네트워크 컨볼루션 모델은 이미지의 질감, 색상, 가장자리 정보 등의 일반적인 특징을 학습할 수 있다. Inception-Resnet-v2 네트워크 컨볼루션 모델은 ImageNet 데이터 셋을 사용하여 사전에 훈련된 모델일 수 있다. The spatial information generator 120 may generate spatial information by extracting spatial features of the face image 121 of the video data stored in the data storage 110 . The face image 121 may be extracted as a sample RGB frame for the face image from the face video data 111 stored in the data storage 110 . The spatial information generator 120 may use the Inception-Resnet-v2 network convolution model to effectively generate spatial information by extracting spatial features from the face image 121 . The Inception-Resnet-v2 network convolution model can learn general features such as texture, color, and edge information of images. The Inception-Resnet-v2 network convolution model can be a pre-trained model using the ImageNet data set.

본 발명에서의 공간 정보는 얼굴 이미지(121) 전체에서 얼굴 이미지에 대한 공간 특징을 추출한 공간 정보일 수 있다. 본 발명의 다른 일 실시 예에 따르면, 공간 정보 생성부(120)는 얼굴 이미지(121)를 임의의 갯수로 분할하여 분할된 조각에서 얼굴 이미지(121)에 대한 공간 특징을 추출할 수 있다. 예를 들어, 얼굴 이미지(121)를 4개의 조각(122)로 분할하여 각 조각에 대한 공간 특징을 추출할 수 있다. 각 조각에 대한 공간 특징 역시 Inception-Resnet-v2 네트워크 컨볼루션 모델을 활용할 수 있다. 공간 정보 생성부(120)은 모든 비디오 데이터의 얼굴 이미지에 대해서 공간 특징을 추출하고 집계할 수 있다. 모든 이미지 데이터에서 추출된 특징을 집계하여 유클리드 손실함수를 사용한 CNN(Convolutional Neural Network)모델을 활용하여 공간 정보를 생성할 수 있다. Spatial information in the present invention may be spatial information obtained by extracting spatial features of the face image from the entire face image 121 . According to another embodiment of the present invention, the spatial information generator 120 may divide the face image 121 into an arbitrary number of pieces and extract spatial features of the face image 121 from the divided pieces. For example, the facial image 121 may be divided into four pieces 122 and spatial features for each piece may be extracted. Spatial features for each slice can also utilize the Inception-Resnet-v2 network convolution model. The spatial information generating unit 120 may extract and aggregate spatial features of face images of all video data. Spatial information can be generated by aggregating features extracted from all image data and using a CNN (Convolutional Neural Network) model using a Euclidean loss function.

정보 처리부(150)은 얼굴 이미지(121) 전체에 대한 공간 특징과 분할된 조각(122)에서 공간 특징에 대한 공간 정보를 입력 데이터로 활용할 수 있다. 이하에서는 동적 정보를 생성하는 VLDN 특징 맵 생성부(130)을 설명하도록 한다.The information processing unit 150 may utilize spatial information about spatial features of the entire face image 121 and spatial features of the divided pieces 122 as input data. Hereinafter, the VLDN feature map generator 130 that generates dynamic information will be described.

VLDN 특징 맵 생성부(130)은 데이터 저장부(110)에서 저장된 얼굴 비디오 데이터(111)에서 3개의 연속된 이전, 현재, 다음의 프레임(131)을 추출할 수 있다. VLDN 특징맵 생성부(130)은 연속된 프레임(131)에 대해서 얼굴 역학을 분석하기 위한 VLDN(Volume Local Directional Number) 특징 맵을 생성할 수 있다. 생성된 VLDN 특징 맵은 VLDN 회색 이미지로 생성되어 채널 크기가 1인 CNN에 입력값으로 할 수 있다. CNN 컨볼루션 모델을 활용하여 출력된 결과값은 비디오 데이터의 동적 정보로 할 수 있다. 이하에서는 VLDN 특징 맵 생성에 관하여 상세하게 설명하도록 한다. The VLDN feature map generator 130 may extract three consecutive previous, current, and next frames 131 from the face video data 111 stored in the data storage unit 110 . The VLDN feature map generation unit 130 may generate a Volume Local Directional Number (VLDN) feature map for analyzing facial dynamics for consecutive frames 131 . The generated VLDN feature map is generated as a VLDN gray image and can be input to a CNN with a channel size of 1. The resulting value output by using the CNN convolution model can be used as dynamic information of video data. Hereinafter, VLDN feature map generation will be described in detail.

도 3은 VLDN 특징 맵 생성부의 상세 구성을 설명하기 위한 도면이다.3 is a diagram for explaining a detailed configuration of a VLDN feature map generator.

VLDN 특징 맵 생성부(130)는 에지 응답 계산부(135), 방향 번호 확인부(136), VLDN 생성부(137)를 포함할 수 있다. 에지 응답 계산부(135)는 LDN(Local Directional Number)의 확장인 VLDN 특징 맵을 생성할 수 있다. The VLDN feature map generator 130 may include an edge response calculator 135, a direction number checker 136, and a VLDN generator 137. The edge response calculation unit 135 may generate a VLDN feature map that is an extension of a local directional number (LDN).

보다 구체적으로 이전, 현재, 다음 프레임(131)에 대한 픽셀값에 대해서 에지 응답 계산부(135)의 처리를 통해 중간값이 생성될 수 있다. 에지 응답 계산부(135)는 Kirsch 마스크에 기초하여 중심 픽셀에 인접한 인접 픽셀의 에지 응답을 계산하는 기능을 수행할 수 있다. 에지 응답 계산부(135)는 수학식 (1)을 통해서 3개의 연속된 프레임(131)에 대한 에지 응답을 계산할 수 있다. 여기서 PR, CR, PO는 각각 이전, 현재, 다음 프레임(131)의 픽셀값에 해당한다.More specifically, an intermediate value may be generated through processing of the edge response calculator 135 for pixel values of the previous, current, and next frames 131 . The edge response calculator 135 may perform a function of calculating edge responses of pixels adjacent to the center pixel based on the Kirsch mask. The edge response calculator 135 may calculate edge responses for three consecutive frames 131 through Equation (1). Here, PR, CR, and PO correspond to pixel values of the previous, current, and next frames 131, respectively.

방향번호 확인부(136)는 최상위 양수 및 음수 방향의 숫자를 확인할 수 있다. 방향번호 확인부(136)에서 최상위 양수 및 음수 방향의 숫자를 확인하는 수학식은 (2)를 참조할 수 있다.The direction number confirmation unit 136 may check the numbers of the highest positive and negative directions. Equation (2) may be referred to for the equation for checking the highest positive and negative direction numbers in the direction number check unit 136.

예를 들어, 도 3에서 최상위 양수는 3개의 연속된 프레임(131)에서 620이며, 620에 대한 최상위 양수 방향 숫자는 6에 해당된다. 최상위 음수는 -740으로, -740에 대한 최상위 음수 방향 숫자는 1에 해당된다. VLDN 생성부(137)는 수학식 (3)을 이용하여 3개의 연속된 프레임(131)에 대한 VLDN 값을 생성할 수 있다. For example, in FIG. 3, the most significant positive number is 620 in three consecutive frames 131, and the most significant positive direction number for 620 corresponds to 6. The most negative number is -740, and the most negative number toward -740 is 1. The VLDN generation unit 137 may generate VLDN values for three consecutive frames 131 using Equation (3).

여기서 MPx,y의 값은 최상위 양수 방향 숫자값이고, MNx,y 값은 최상위 음수 방향 숫자값에 해당한다. 예를 들어, 본 발명의 일 실시 예를 따르면 MPx,y의 값은 6이고, MNx,y 값 1에 해당될 수 있다. 3개의 연속된 프레임(131)에 대한 VLDNx,y 값은 49에 해당되며 이를 이진법을 기초하여 반환하면 110001(2)의 해당된다.Here, the value of MPx,y corresponds to the most significant positive-direction numeric value, and the MNx,y value corresponds to the most significant negative-direction numeric value. For example, according to an embodiment of the present invention, the value of MPx,y is 6 and the value of MNx,y may correspond to 1. The VLDNx,y value for three consecutive frames 131 corresponds to 49, and when returned based on the binary method, it corresponds to 110001 (2).

VLDN 생성부(137)는 데이터 저장부(110)에 저장된 비디오 데이터의 모든 3개의 연속된 프레임(131)에 대해서 VLDN 값을 모두 생성할 수 있다. 생성된 VLDN 값을 기반으로 VLDN 특징 맵을 생성할 수 있다. 이하에서는 생성된 VLDN 특징 맵의 예시를 확인할 수 있다. The VLDN generation unit 137 may generate all VLDN values for all three consecutive frames 131 of video data stored in the data storage unit 110 . A VLDN feature map may be generated based on the generated VLDN value. Below, you can see an example of the generated VLDN feature map.

도 4는 본 발명의 실시 예에 따른 VLDN 특징 맵 예시도이다.4 is an exemplary view of a VLDN feature map according to an embodiment of the present invention.

에지 응답 계산부(135)와 방향번호 확인부(136), VLDN 생성부(137)를 통해 계산된 VLDN 값을 통해 VLDN 특징 맵을 이미지로 생성할 수 있다. 도 4를 참조하면, 데이터 저장부(110)에 저장된 얼굴 비디오 데이터에 대해서 VLDN 특징 맵의 일부가 생성됨을 확인할 수 있다. 이와 같은 VLDN 특징 맵을 통한 효과적인 동적 정보를 생성하기 위하여 CNN 모델을 이용한다. CNN 모델에 대한 상세한 설명은 도 5을 참조하며 설명하도록 한다.A VLDN feature map may be generated as an image using the VLDN value calculated by the edge response calculator 135, the direction number checker 136, and the VLDN generator 137. Referring to FIG. 4 , it can be confirmed that a part of a VLDN feature map is generated for face video data stored in the data storage unit 110 . A CNN model is used to generate effective dynamic information through such a VLDN feature map. A detailed description of the CNN model will be described with reference to FIG. 5 .

도 5는 본 발명의 실시 예에 따른 VLDN 특징맵 생성부(130)에서 생성하는 동적 정보를 설명하기 위한 CNN 모델 예시도이다.5 is an exemplary CNN model diagram for explaining dynamic information generated by the VLDN feature map generator 130 according to an embodiment of the present invention.

VLDN 특징 맵 생성부(130)는 생성된 VLDN 특징 맵을 CNN 모델을 활용하여 얼굴 움직임에 대한 동적 정보를 생성할 수 있다. CNN 모델은 얼굴 움직임을 모델링 할 수 있도록 사전에 훈련될 수 있다. CNN 모델은 3x3 필터 대신 첫 번째 및 두 번째 컨볼루션 레이어에 5x5 필터를 사용할 수 있다. CNN 모델은 첫 번째, 두 번째, 레이어에 3개의 컨볼루션 대신 하나의 컨볼루션을 사용할 수 있다. CNN 모델은 10개의 컨볼루션, 5개의 최대 풀링 및 완전히 연결된 3개의 레이어로 구성되어 있을 수 있다. CNN 모델은 소프트 맥스 손실 함수 대신 유클리드 손실 함수를 사용할 수 있다. CNN 모델을 통해 VLDN 특징 맵을 입력값으로 하여 얼굴 움직임에 대한 동적 정보를 생성할 수 있다. 이하에서는 도 2로 돌아가, 정보 처리부(150), 수준 예측부(160)에 대한 구체적인 설명을 하도록 한다. The VLDN feature map generator 130 may generate dynamic information about facial motion by using the generated VLDN feature map as a CNN model. CNN models can be pre-trained to model facial movements. CNN models can use 5x5 filters in the first and second convolutional layers instead of 3x3 filters. A CNN model can use one convolution instead of three convolutions for the first, second, and layer. A CNN model might consist of 10 convolutions, 5 max poolings, and 3 fully connected layers. CNN models can use the Euclidean loss function instead of the softmax loss function. Through the CNN model, dynamic information about facial motion can be generated by using the VLDN feature map as an input value. Hereinafter, returning to FIG. 2, a detailed description of the information processing unit 150 and the level prediction unit 160 will be given.

정보 처리부(150)는 공간 정보(151)와 동적 정보(152)를 받아 정보 처리를 통해서 수준 예측부(160)의 입력값으로 생성할 수 있다. 수준 예측부(160)은 Bi-LSTM 모델을 활용하여 우울증 수준을 예측할 수 있다. The information processing unit 150 may receive the spatial information 151 and the dynamic information 152 and generate them as input values of the level prediction unit 160 through information processing. The level prediction unit 160 may predict the level of depression by using the Bi-LSTM model.

재귀 신경망 중 하나인 RNN (Recurrent Neural Network)은 숨겨진 상태를 통해 입력을 출력에 매핑하여 순차적 정보를 효과적으로 학습할 수 있는 모델에 해당된다. 그러나 RNN 기반 접근 방식은 입력이 긴 시퀀스일 때 기울기 폭발 및 기울기 소실 문제가 존재한다.Recurrent Neural Network (RNN), one of the recurrent neural networks, corresponds to a model that can effectively learn sequential information by mapping inputs to outputs through hidden states. However, RNN-based approaches have gradient explosion and gradient vanishing problems when the input is a long sequence.

재귀 신경망 중 하나인 LSTM(Long Short Term Memory)은 input, forget 및 output 게이트를 통해 기울기 소실 문제를 해결하고, 긴 길이의 시퀀스를 학습할 수 있다. 본 발명의 실시 예를 따르면, 수준 예측부(160)은 하나의 LSTM 층이 아닌, 다층 Bi-LSTM 모델(161)을 사용하여 공간정보와 동적정보를 학습할 수 있다. 따라서 Bi-LSTM 중 제 2 Bi-LSTM 층은 제 1 Bi-LSTM 층과 동일한 층의 이전 상태로부터 학습하는 효과가 있다. 다만, 현재 다층 Bi-LSTM 모델(161)은 프레임 간 특징을 입력으로 제공하여 긴 시간적 특징을 학습할 수 있지만 LSTM 모델에 대한 입력으로 프레임 수준 특징을 직접 사용하면 노이즈에 취약할 수 있다. LSTM (Long Short Term Memory), one of the recursive neural networks, solves the gradient loss problem through input, forget, and output gates, and can learn long-length sequences. According to an embodiment of the present invention, the level predictor 160 may learn spatial information and dynamic information using a multi-layer Bi-LSTM model 161 instead of a single LSTM layer. Therefore, the second Bi-LSTM layer of the Bi-LSTM has an effect of learning from the previous state of the same layer as the first Bi-LSTM layer. However, the current multilayer Bi-LSTM model 161 can learn long temporal features by providing inter-frame features as inputs, but directly using frame-level features as inputs to the LSTM model can be vulnerable to noise.

따라서 수준 예측부(160)의 다층 Bi-LSTM 모델(161)에 입력하기 전에 정보 처리부(150)에서 TMP (Temporal Median Pooling)를 방법을 통해서 노이즈에 취약한 문제를 해결할 수 있다. TMP 방법은 공간 정보(151)와 동적 정보(152)를 임의의 개수로 시간적으로 분할하는 방법을 말한다. 예를 들어, 공간 정보(151)와 동적 정보(152)를 5개씩 묶어 하나의 단위로 분할할 수 있다. 임의의 개수는 한정되어 해석되지 아니하고, 통상의 기술자 수준에서 분할할 수 있는 개수로 해석됨이 타당하다. Therefore, before inputting the information to the multilayer Bi-LSTM model 161 of the level predictor 160, the information processing unit 150 can solve the noise-vulnerable problem through the TMP (Temporal Median Pooling) method. The TMP method refers to a method of temporally dividing spatial information 151 and dynamic information 152 into an arbitrary number. For example, the spatial information 151 and the dynamic information 152 may be grouped into five units and divided into one unit. Any number is not interpreted as being limited, and it is reasonable to interpret it as a number that can be divided at the level of a person skilled in the art.

분할 후, 임의의 개수를 기준으로 분할한 정보를 하나의 단위로 하여 그 중앙값을 반환할 수 있다. 정보 처리부(150)는 TMP 방법을 통하여 공간 정보(151)의 중앙값(153), 동적 정보(152)의 중앙값(154)을 집계하여 수준 예측부(160)의 다층 Bi-LSTM 모델에 관한 입력값으로 사용할 수 있다. 수준 예측부(160)에서 활용하는 다층 Bi-LSTM은 재귀신경망 중 하나로서, LSTM, RNN 등이 있다. 이하에서는 재귀신경망 LSTM에 대한 기본 설명을 하도록 한다. After division, it is possible to return the median value of the divided information based on an arbitrary number as one unit. The information processing unit 150 aggregates the median value 153 of the spatial information 151 and the median value 154 of the dynamic information 152 through the TMP method, and input values related to the multilayer Bi-LSTM model of the level prediction unit 160. can be used as The multilayer Bi-LSTM used by the level prediction unit 160 is one of the recursive neural networks, and includes LSTM, RNN, and the like. Hereinafter, a basic explanation of the recursive neural network LSTM will be given.

여기서 σ는 로지스틱 함수를 나타내고 tanh는 쌍곡선 탄젠트 함수를 나타낸다. where σ denotes the logistic function and tanh denotes the hyperbolic tangent function.

i_t, f_t, O_t는 입력, 숨김, 출력 게이트를 나타낸다. W_i, W_f, W_O 및 b_i, b_f, b_O는 각각 입력, 숨김 및 출력 상태에 대한 가중치 행렬 및 바이어스 용어에 해당된다. X_t는 t 순간에서의 입력이고, e_t는 셀 입력 상태를 나타내고, C_t는 셀 출력 상태를 나타낸다. 마지막으로, 숨겨진 레이어 상태는 h_t로 표시된다. i _t , f _t , and O _t denote input, hidden, and output gates. W _i , W _f , W _O and b _i , b _f , b _O correspond to weight matrices and bias terms for the input, hidden and output states, respectively. X _t is the input at moment t, e _t represents the cell input state, and C _t represents the cell output state. Finally, the hidden layer state is denoted by h _t .

LSTM 모델과 유사하게 다층 Bi-LSTM 모델(161)은 입력, 숨김, 출력 게이트 및 메모리 유닛으로 구성된다. 다층 Bi-LSTM 모델(161)은 순차 데이터는 동일한 출력 레이어에 연결된 두 개의 서로 다른 숨겨진 레이어를 사용하여 순방향 및 역방향 감지로 처리된다. 전방 및 후방 층의 숨겨진 상태는 h_f = O_f tanh(C_f) 및 h_b = O_btanh (C_b)에 의해 추정된다. 최종 숨겨진 상태는 H=(h_f,h_b)로 표시된다. 다층 Bi-LSTM 모델(161)은 공간 정보와 동적 정보를 학습하기 위해 2개의 Bi-LSTM 셀을 쌓아 올리는 것으로 구성된다. 다층 Bi-LSTM 모델(161)은 프레임-투-프레임 기능을 입력으로 공급하여 학습하지만, LSTM 모델에 입력할 수 있는 프레임-레벨 기능을 직접 채용함으로써 허용 가능한 톤-노이즈 개입이 가능한 효과가 있다. 따라서, 본 발명에서는 다층 Bi-LSTM 모델(161)을 통해서 우울증 수준을 예측하여, 노이즈 개입에 문제를 해결할 수 있다.Similar to the LSTM model, the multilayer Bi-LSTM model 161 consists of input, hidden, output gates and memory units. In the multilayer Bi-LSTM model 161, sequential data is processed with forward and backward sense using two different hidden layers connected to the same output layer. The hidden states of the anterior and posterior layers are estimated by h _f = O _f tanh(C _f ) and h _b = O _b tanh (C _b ). The final hidden state is denoted by H=(h _f ,h _b ). The multilayer Bi-LSTM model 161 consists of stacking two Bi-LSTM cells to learn spatial and dynamic information. The multilayer Bi-LSTM model 161 learns by supplying frame-to-frame functions as inputs, but by directly employing frame-level functions that can be input to the LSTM model, there is an effect that allowable tone-noise intervention is possible. Therefore, in the present invention, the problem of noise interference can be solved by predicting the level of depression through the multilayer Bi-LSTM model 161.

다층 Bi-LSTM 모델(161)은 2개의 레이어는 출력 레이어와 회귀 레이어가 될 수 있다. 레이어는 서로 다른 수의 hidden units 이 존재할 수 있다. 예를 들어, Hidden units은 512개 또는 256개 일 수 있다. 다층 Bi-LSTM 모델(161)에 정보 처리부(150)에서 TMP 방법으로 처리된 공간 정보(151)의 중앙값(153)와 동적 정보(152)의 중앙값(154)를 입력값으로 사용할 수 있다. 다층 Bi-LSTM 모델(161)을 통해 공간 정보(151)와 동적 정보(152)의 그 결과값의 평균값을 최종 우울증 수준 예측의 예측값으로 할 수 있다. 이하에서는 도 7을 이용하여 본 발명의 실시 예 따른 미세 얼굴 표정을 딥 러닝 분석을 통해, 인간의 우울증 수준을 예측하는 방법을 설명하도록 한다.In the multilayer Bi-LSTM model 161, two layers may be an output layer and a regression layer. Layers can have different numbers of hidden units. For example, hidden units can be 512 or 256. The median value 153 of the spatial information 151 and the median value 154 of the dynamic information 152 processed by the TMP method in the information processing unit 150 can be used as input values to the multilayer Bi-LSTM model 161. Through the multi-layer Bi-LSTM model 161, the average value of the resulting values of the spatial information 151 and the dynamic information 152 can be used as the prediction value of the final depression level prediction. Hereinafter, a method of predicting a level of depression in a human through deep learning analysis of minute facial expressions according to an embodiment of the present invention will be described using FIG. 7 .

도 6은 본 발명의 실시 예에 따른 미세 얼굴 표정을 분석하여 인간의 우울증 수준을 예측하는 방법을 설명하기 위한 순서도이다.6 is a flowchart illustrating a method of predicting a level of depression in a human by analyzing minute facial expressions according to an embodiment of the present invention.

우울증 수준 예측 방법과 관련하여 전술한 우울증 수준 예측 장치(100)와 중복되는 세부 실시 예는 생략될 수 있다. 우울증 수준을 예측하는 장치(100)는 서버로 구현될 수 있는 바, 이하에서는 장치를 서버로 명명하여 설명하도록 한다. 서버는 얼굴 이미지와 비디오 데이터를 저장할 수 있다. 서버는 얼굴 이미지에서 공간 정보를 생성할 수 있다(S101). 서버는 얼굴 비디오 데이터에서 3개의 연속된 프레임(131)을 추출하고, 상기 연속된 프레임(131)을 기준으로 얼굴 역학을 분석하기 위한 VLDN(volume local directional number) 특징 맵을 생성할 수 있다(S103).In relation to the depression level prediction method, detailed embodiments overlapping with the above-described depression level prediction apparatus 100 may be omitted. The apparatus 100 for predicting the depression level may be implemented as a server, and hereinafter, the apparatus will be named and described as a server. The server may store face images and video data. The server may generate spatial information from the face image (S101). The server may extract three consecutive frames 131 from face video data and generate a volume local directional number (VLDN) feature map for analyzing facial dynamics based on the consecutive frames 131 (S103 ).

VLDN 특징 맵을 CNN(Deep Convolutional Neural Network) 모델에 입력하여 얼굴 움직임에 대한 동적 정보를 생성할 수 있다(S105). 서버는 공간 정보(151)와 동적 정보(152)를 TMP(Temporal Median Pooling) 방법을 통하여 출력값으로 생성할 수 있다(S107). 서버는 출력값을 재귀신경망을 기반으로 인간의 우울증 수준을 예측할 수 있으며, 재귀신경망으로 다층 Bi-LSTM 모델(161)을 사용할 수 있다(S109).Dynamic information about facial motion may be generated by inputting the VLDN feature map to a Deep Convolutional Neural Network (CNN) model (S105). The server may generate the spatial information 151 and the dynamic information 152 as output values through a Temporal Median Pooling (TMP) method (S107). The server can predict the level of human depression based on the output value of the recursive neural network, and can use the multilayer Bi-LSTM model 161 as the recursive neural network (S109).

VLDN 특징 맵은 에지 응답을 계산하는 단계, 방향번호를 확인하는 단계, VLDN을 생성하는 단계를 통해서 생성될 수 있다. 세부 실시예는 우울증 수준 예측 장치(100)에서 설명한 것과 중복되는 바 생략하도록 한다. The VLDN feature map may be generated through the steps of calculating an edge response, checking a direction number, and generating a VLDN. Detailed embodiments overlap with those described in the depression level predicting apparatus 100, and thus will be omitted.

이하 도 7 내지 도 10을 이용하여 본 발명의 실시 예에 따른 우울증 수준 예측 장치(100)에 대한 성능을 기존 방법과 비교하기 위한 실험 및 그 결과를 설명한다.Hereinafter, experiments and results for comparing the performance of the apparatus 100 for predicting depression levels according to an embodiment of the present invention with those of conventional methods will be described using FIGS. 7 to 10 .

도 7은 AVEC2013 데이터 셋 및 AVEC2014 데이터 셋에 대한 실험의 결과값을 나타낸 도면이다. 7 is a diagram showing result values of an experiment on the AVEC2013 data set and the AVEC2014 data set.

본 발명의 우울증 수준 예측 장치(100)의 성능에 관한 실험 결과를 설명하기 전에, 실험 설계 및 실험 대상 데이터에 대해서 설명하도록 한다. 우울증 수준 예측하는 장치(100)의 성능 실험에는 Audio/Visual Emotion Challenge (AVEC) 2013 and 2014 depression sub-challenge datasets의 2가지 데이터셋을 사용하고, 제안된 장치의 방법이 기존 방법과 비교하도록 한다. Before explaining the experimental results on the performance of the depression level prediction apparatus 100 of the present invention, the experimental design and experimental data will be described. Two datasets of the Audio/Visual Emotion Challenge (AVEC) 2013 and 2014 depression sub-challenge datasets are used to test the performance of the device 100 for predicting the level of depression, and the method of the proposed device is compared with the existing method.

AVEC2013 데이터셋은 82명의 150개 비디오로 구성되고, 데이터셋은 training, development 및 test의 세 가지 데이터셋으로 나뉜다. 각 데이터셋에는 50개의 비디오가 있으며, 모델을 훈련시키기 위해 100개의 비디오가 사용되었고, 본원 발명의 우울증 수준 예측 장치(100)의 성능을 평가하기 위해 나머지 50개의 비디오가 사용될 수 있다. 평균적으로 각 동영상의 길이는 25 분이고, 각 비디오에서 참가자는 마이크와 웹캠을 통해 기록되는 특정 질문에 응답을 할 수 있다. 또한 각 비디오에는 우울증 수준을 나타내는 레이블이 지정되어 있으며 이 레이블은 BDI-II 설문지를 통해 정의되어 있다. The AVEC2013 dataset consists of 150 videos of 82 people, and the dataset is divided into three datasets: training, development, and test. Each dataset has 50 videos, 100 videos were used to train the model, and the remaining 50 videos can be used to evaluate the performance of the depression level prediction apparatus 100 of the present invention. On average, each video is 25 minutes long, and in each video, participants can respond to specific questions that are recorded via a microphone and webcam. Additionally, each video was labeled with a level of depression, and these labels were defined through the BDI-II questionnaire.

AVEC2014 우울증 데이터셋은 150 개의 비디오로 구성되며, 각각 50개의 비디오로 이루어진 training, development 및 test 데이터셋으로 나뉜다. 이전 데이터셋과 유사하게 이 데이터셋에서 100개의 비디오가 훈련에 사용되고 50개의 비디오가 제안된 방법의 성능을 평가하는 데 사용되었으며, 이 비디오는 웹캠과 마이크로 녹화되며 평균적으로 각 비디오 길이는 약 2 분에 해당된다.The AVEC2014 depression dataset consists of 150 videos, which are divided into training, development, and test datasets each consisting of 50 videos. Similar to the previous dataset, in this dataset 100 videos are used for training and 50 videos are used to evaluate the performance of the proposed method, these videos are recorded with a webcam and microphone, each video length is about 2 minutes on average. applies to

더 나은 비디오 분석을 위해 서브 샘플링 전략을 사용하여 샘플 프레임 속도를 초당 30 프레임에서 초당 6 프레임으로 줄였으며, 비디오 클립은 2개의 연속하는 비디오들 사이에서 5개의 프레임 오버랩을 갖는 40의 프레임 길이의 서브 시퀀스로 분할된다. 얼굴 정보를 얻기 위해, 각 프레임에 대한 얼굴 검출 및 얼굴 랜드 마크 위치를 사용하고, 얼굴 영역이 각 프레임에서 추출되고 299 * 299로 크기가 조정된다. For better video analysis, the sample frame rate was reduced from 30 frames per second to 6 frames per second using a subsampling strategy, and the video clips were sub-frames of 40 frames length with an overlap of 5 frames between two consecutive videos. split into sequences. To obtain the face information, we use the face detection and face landmark position for each frame, and the face region is extracted from each frame and resized to 299*299.

본 발명에서는 MATLAB R2018b에서 Inception-ResNet-v2 네트워크를 훈련한다. 사전 훈련된 ImageNet 모델에서 Inception-ResNet-v2의 매개 변수를 초기화 한다. 공간 정보 관련 모델에서는 stochastic gradient descent(SGD) 알고리즘을 배치 크기 32로 사용하고, momentum은 0.9, weight decay는 0.0002로 설정하였다. 여기서 learning rate는 0.001로 고정된다. 대조적으로, 동적 정보에 대해 제안된 CNN 모델은 처음부터 훈련된다. 여기서 CNN 모델은 배치 크기가 32인 SGD 알고리즘으로 학습되며 learning rate는 0.001로 설정된다. 또한 두 모델 모두 소프트 맥스 손실 기능 대신 유클리드 손실 기능이 적용된다. 다층 Bi-LSTM 모델(161)을 훈련시키기 위해 시공간 정보를 배우기 위해 512 및 256 개의 hidden units와 함께 Adam 최적화 프로그램을 사용하며, 배치 크기는 10으로 설정되어 있으며 learning rate는 0.001에 해당된다. In the present invention, the Inception-ResNet-v2 network is trained in MATLAB R2018b. Initialize the parameters of Inception-ResNet-v2 from the pre-trained ImageNet model. In the model related to spatial information, the stochastic gradient descent (SGD) algorithm is used with a batch size of 32, momentum is set to 0.9, and weight decay is set to 0.0002. Here, the learning rate is fixed at 0.001. In contrast, the proposed CNN model for dynamic information is trained from scratch. Here, the CNN model is trained with the SGD algorithm with a batch size of 32 and the learning rate is set to 0.001. Also, in both models, the Euclidean loss function is applied instead of the softmax loss function. To train the multilayer Bi-LSTM model 161, we use the Adam optimizer with 512 and 256 hidden units to learn spatiotemporal information, the batch size is set to 10, and the learning rate is 0.001.

우울증 단계는 각 비디오 클립에 대한 공간 정보(151) 및 동적 정보(152)로부터 예측된 값의 평균을 취함으로써 측정되며, 전체 성능은 MAE (Mean Absolute Error) 및 RMSE (root Mean Square Error)를 사용하여 평가된다. MAE와 RMSE의 값이 작을수록 우울증 수준 예측에 대한 정확도가 높은 것으로 평가한다.The depression stage is measured by taking the average of the predicted values from spatial information (151) and dynamic information (152) for each video clip, and the overall performance uses MAE (Mean Absolute Error) and RMSE (root Mean Square Error) is evaluated by The smaller the values of MAE and RMSE, the higher the accuracy of predicting the level of depression.

MAE와 RMSE는 다음과 같이 정의된다. MAE and RMSE are defined as follows.

N은 총 샘플이고 xj는 예측된 값을 나타내고 xj는 j 번째 샘플의 실제 측정값에 해당된다. N is the total number of samples, xj represents the predicted value, and xj corresponds to the actual measured value of the jth sample.

실험에서 다층의 Bi-LSTM를 사용하는 InceptionResNet-v2 모델과 다층의 Bi-LSTM를 사용하는 VLDN-CNN을 각각 사용하여 공간 정보 분석 및 동적 정보 분석의 성능을 측정한다. 또한 두 정보 분석의 출력 평균을 취하여 수행되는 공간 및 시간 네트워크를 융합하여 MAE 및 RMSE를 추정한다. 이하에서는 실험 결과에 대해서 설명하도록 한다. In the experiment, the performance of spatial information analysis and dynamic information analysis is measured using the InceptionResNet-v2 model using multi-layer Bi-LSTM and VLDN-CNN using multi-layer Bi-LSTM, respectively. We also estimate the MAE and RMSE by fusing the spatial and temporal networks performed by taking the average of the outputs of the two informational analyzes. The experimental results are described below.

공간 정보(151)와 동적 정보(152)를 이용하여 분석한 본원발명의 우울증 수준 예측 장치(100)가 MAE 7.04, RMSE 8.93 값으로 기존 접근방법보다 더 우수함을 알 수 있다. 다만, AVEC2013 데이터 셋에서 Zhou et al. [31]에서는 MAE 6.20, RMSE 8.28로 더 좋은 성능을 보이는 것을 확인할 수 있다. 나아가 AVEC2014 데이터 셋에서도 Zhou et al. [31]이 MAE 6.21, RMSE 8.39로 더 좋은 성능을 보이는 것을 확인할 수 있다. 그러나 딥 러닝 기반을 하는 접근법인 Zhu et al. [6]과 Jazaery et al. [7] 보다 좋은 성능을 갖음을 알 수 있다. Zhu et al. [6]의 경우 정적 프레임과 광학 흐름 이미지 모두에 대해 딥 러닝 분석모델 이고, Jazaery et al. [7]의 경우 RNN을 기반으로 모델이지만, 모션 캡처 부분에 대해서 상대적으로 노이즈가 많음을 확인할 수 있다. It can be seen that the depression level predicting apparatus 100 of the present invention analyzed using the spatial information 151 and the dynamic information 152 is superior to the existing approach with values of MAE 7.04 and RMSE 8.93. However, in the AVEC2013 data set, Zhou et al. [31] shows better performance with MAE 6.20 and RMSE 8.28. Furthermore, in the AVEC2014 dataset, Zhou et al. [31] shows better performance with MAE 6.21 and RMSE 8.39. However, an approach based on deep learning, Zhu et al. [6] and Jazaery et al. It can be seen that [7] has better performance. Zhu et al. [6] is a deep learning analysis model for both static frames and optical flow images, and Jazaery et al. In the case of [7], it is a model based on RNN, but it can be confirmed that the motion capture part is relatively noisy.

도 8은 AVEC2013, AVEC2014 데이터 셋에 대해서, 공간 정보 생성시 얼굴 이미지 전체 공간 특징 추출과 임의의 개수로 분할한 조각의 공간 특징 추출을 고려하여 분석한 결과값을 나타낸 도면이다.8 is a diagram showing results obtained by analyzing AVEC2013 and AVEC2014 data sets in consideration of spatial feature extraction of the entire face image and spatial feature extraction of fragments divided into an arbitrary number when generating spatial information.

공간 정보 생성시 얼굴 이미지 전체에서 공간 특징 추출만을 통해서 분석하는 것보다 임의의 개수로 분할하여 조각에 대한 공간 특징 추출에 대한 공간 정보를 분석함이 더 우수한 효과를 내는 것을 알 수 있다. 조각으로 분할하여 공간 특징 추출하고 공간 정보를 분석한 경우, AVEC2013 데이터 셋에서 MAE 값은 7.22, RMSE 값은 9.02를 기록한다. AVEC2014 데이터 셋에서 MAE 값은 6.96, RMSE 값은 8.91임이 확인된다.When generating spatial information, it can be seen that analyzing the spatial information for spatial feature extraction of the fragment by dividing it into an arbitrary number produces a better effect than analyzing the entire face image through only spatial feature extraction. When spatial features are extracted by segmentation and spatial information is analyzed, the MAE value is 7.22 and the RMSE value is 9.02 in the AVEC2013 data set. In the AVEC2014 data set, the MAE value is 6.96 and the RMSE value is 8.91.

도 9는 AVEC2013, AVEC2014 데이터 셋에 대해서, 동적 정보만을 분석한 모델과 다른 모델을 비교한 실험의 결과값을 나타낸 도면이다.9 is a diagram showing result values of an experiment in which a model analyzing only dynamic information and other models are compared with respect to the AVEC2013 and AVEC2014 data sets.

본 발명의 VLDN을 활용한 동적 정보(152)만을 분석한 모델의 경우 MHH [11] 기반 시간 모델과 광학 흐름 이미지 [6] 기반 시간 모델을 비교할 수 있다. AVEC2013, AVEC2014 데이터 셋 모두에서 본 발명의 VLDN을 활용한 동적 정보(152)만을 분석한 모델이 다른 모델 보다 우수한 성능을 갖음을 알 수 있다. In the case of a model that analyzes only the dynamic information 152 using the VLDN of the present invention, a time model based on MHH [11] and a time model based on optical flow image [6] can be compared. In both the AVEC2013 and AVEC2014 data sets, it can be seen that the model that analyzes only the dynamic information 152 using the VLDN of the present invention has better performance than other models.

도 10은 AVEC2013, AVEC2014 데이터 셋에 대해서 TMP 방법에 관한 실험 결과를 나타낸 도면이다. 10 is a diagram showing experimental results of the TMP method for the AVEC2013 and AVEC2014 data sets.

도 10의 실험 결과를 통해, 정보 처리부(150)에서 공간 정보(151)와 동적 정보(152)를 임의의 시간 개수로 분할하여 중간값을 측정하는 TMP 방법에 대해 시간 개수에 따른 우울증 수준 예측 장치의 성능을 확인할 수 있다. 또한 TMP 방법과 다른 임시 최대 풀링 방법과 다른 풀링 방법에 대한 실험 결과도 확인할 수 있다. Through the experimental results of FIG. 10, the device for predicting depression level according to the number of times for the TMP method in which the information processing unit 150 divides the spatial information 151 and the dynamic information 152 into an arbitrary number of times and measures the median value. performance can be checked. In addition, the experimental results of the TMP method and other temporary maximum pooling methods and other pooling methods can be confirmed.

AVEC2013, AVEC2014 데이터 셋 모두에서 TMP 방법이 다른 풀링 방법보다 MAE와 RMSE 값이 작음을 알 수 있다. 나아가, 임의의 시간 개수 부분은 5 개를 기준으로 TMP 방법 진행시 가장 우수함을 확인할 수 있다. 본 발명의 우울증 수준 예측 장치(100)는 5개 시간 개수로 TMP 방법 수행시 AVEC2013 데이터 셋에서 MAE 7.04 및 RMSE 9.08의 값이 측정되었다. AVEC2014 데이터 셋에서는 MAE 6.86, RMSE 8.78 값이 측정되었다. In both AVEC2013 and AVEC2014 data sets, it can be seen that the TMP method has smaller MAE and RMSE values than other pooling methods. Furthermore, it can be confirmed that the random time number part is the best when the TMP method is performed based on 5. When the depression level predicting apparatus 100 of the present invention performed the TMP method with 5 time counts, values of MAE 7.04 and RMSE 9.08 were measured in the AVEC2013 data set. In the AVEC2014 data set, MAE 6.86 and RMSE 8.78 values were measured.

본 명세서와 도면에 개시된 본 발명의 실시 예들은 본 발명의 기술 내용을 쉽게 설명하고 본 발명의 이해를 돕기 위해 특정 예를 제시한 것뿐이며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 자명한 것이다.The embodiments of the present invention disclosed in the present specification and drawings are only presented as specific examples to easily explain the technical content of the present invention and help understanding of the present invention, and are not intended to limit the scope of the present invention. It is obvious to those skilled in the art that other modified examples based on the technical spirit of the present invention can be implemented in addition to the embodiments disclosed herein.

Claims

a data storage unit for storing video data;
a spatial information generation unit generating spatial information from the video data;
Extract three consecutive frames from the video data, create a volume local directional number (VLDN) feature map for analyzing facial dynamics based on the consecutive frames, and input to a Deep Convolutional Neural Network (CNN) model to a VLDN feature map generating unit generating dynamic information about facial motion;
an information processing unit generating the spatial information and the dynamic information as output values through a Temporal Median Pooling (TMP) method;
a level prediction unit that predicts a level of depression of a human based on the output value through a recursive neural network;
containing
A device that predicts the level of depression in humans by analyzing microscopic facial expressions.

According to claim 1,
The VLDN feature map generator
an edge response calculation unit calculating edge responses of pixels adjacent to the center pixel based on the kirsch masks of the three consecutive frames;
a direction number checking unit that checks a highest positive direction number and a highest negative direction number among the adjacent pixels;
a VLDN generator configured to generate a VLDN value using the highest positive direction number and the highest negative direction number, and to generate the VLDN feature map by generating all VLDN values for three sequentially consecutive video frames;
characterized in that it includes
A device that predicts the level of depression in humans by analyzing microscopic facial expressions.

According to claim 1
The spatial information generating unit
Characterized in that an image is arbitrarily divided into four regions from the video data, spatial features are extracted from the entire image, and spatial information is generated by extracting spatial features from each of the four regions.
A device that predicts the level of depression in humans by analyzing microscopic facial expressions.

According to claim 1,
The level prediction unit
Characterized in that the recursive neural network is a Bi-LSTM composed of two layers
A device that predicts the level of depression in humans by analyzing microscopic facial expressions.

According to claim 1,
The information processing unit
After dividing the spatial information and the dynamic information into an arbitrary number of pieces, a median value for each piece is returned and generated as an output value.
A device that predicts the level of depression in humans by analyzing microscopic facial expressions.

In the method of predicting the level of human depression by analyzing fine facial expressions,
storing video data by a data storage unit;
generating spatial information from a face image of the video data by a spatial information generator;
a VLDN feature map generating unit extracting three consecutive frames from the video data and generating a volume local directional number (VLDN) feature map for analyzing facial dynamics based on the consecutive frames;
generating dynamic information about facial motion by inputting the VLDN feature map to a Deep Convolutional Neural Network (CNN) model by the VLDN feature map generation unit;
generating, by an information processing unit, the spatial information and the dynamic information as output values through a Temporal Median Pooling (TMP) method;
predicting, by a level prediction unit, a level of human depression based on the output value through a recursive neural network; containing
A method for predicting levels of depression in humans by analyzing micro-facial expressions.

According to claim 6,
The step of generating the VLDN feature map
calculating edge responses of adjacent pixels adjacent to the central pixel based on the kirsch masks of the three consecutive frames;
checking a highest positive direction number and a highest negative direction number among the adjacent pixels;
generating the VLDN feature map by generating a VLDN value using the highest positive direction number and the highest negative direction number, and generating all VLDN values for three sequentially consecutive video frames;
characterized in that it further comprises
A method for predicting levels of depression in humans by analyzing micro-facial expressions.

According to claim 6,
The step of generating the spatial information is
randomly dividing an image from the video data into four regions;
extracting spatial features from the entire image; and
And generating spatial information by extracting spatial features from each of the four regions.
A method for predicting levels of depression in humans by analyzing micro-facial expressions.

According to claim 6,
Characterized in that the recursive neural network is a Bi-LSTM composed of two layers
A method for predicting levels of depression in humans by analyzing micro-facial expressions.

According to claim 6,
The step of generating the output value is
After dividing the spatial information and the dynamic information into an arbitrary number of pieces, returning a median value for each piece and generating it as an output value.
A method for predicting levels of depression in humans by analyzing micro-facial expressions.