KR20220025332A

KR20220025332A - Apparatus for performing conversation with user based on emotion analysis and method therefor

Info

Publication number: KR20220025332A
Application number: KR1020200105880A
Authority: KR
Inventors: 신사임; 장진예; 정민영; 정혜동; 김산
Original assignee: 한국전자기술연구원
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2022-03-03

Abstract

A device for monitoring the change of emotion by the analysis of mobility and voice according to the present invention comprises: a communication part for communication; a camera part for consistently photographing a predetermined space of a user to generate an observation video; an audio part for consistently receiving voice in the space to generate observation voice; a mobility analysis part for generating a movement pattern of the user based on the observation video and analyzing the movement pattern; an enunciation analysis part for generating an enunciation pattern of the user based on the observation voice and analyzing the generated enunciation pattern; an emotion analysis part for analyzing the emotion state of the user using a neural network having learned the emotion state of the user corresponding to the movement pattern and the enunciation pattern based on the movement pattern and the enunciation pattern; and a report part for transmitting the analyzed emotion state to a guardian device and a remote medical treatment device through the communication part.

Description

Apparatus for performing conversation with user based on emotion analysis and method therefor

본 발명은 사용자와의 대화를 수행하기 위한 기술에 관한 것으로, 보다 상세하게는, 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치 및 이를 위한 방법에 관한 것이다. The present invention relates to a technology for performing a conversation with a user, and more particularly, to an apparatus for performing a conversation with a user based on emotion analysis and a method therefor.

최근 인구의 노령화와 핵가족화에 따라 도움을 줄 수 있는 간병인이 없이 홀로 생활하는 독거노인 혹은 감정 기복이 심한 정신질환자에 대한 헬스케어 서비스에 관심이 높아지고 있다.Recently, with the aging of the population and the nuclear family, interest in health care services for the elderly living alone or mentally ill patients with severe emotional ups and downs is increasing without a caregiver who can help.

한국공개특허 제2020-0058612호 2020년 05월 28일 공개 (명칭: 인공지능 스피커 및 이를 이용한 대화 진행 방법)Korean Patent Laid-Open Patent No. 2020-0058612 published on May 28, 2020 (Title: Artificial Intelligence Speaker and Conversation Proceeding Method Using the Same)

본 발명의 목적은 사용자와 장기적인 상호 작용을 통해 사용자의 심리상태를 수집하고, 이를 기반으로 정서적 안정을 확보할 수 있는 공감 대화를 위해 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치 및 이를 위한 방법에 관한 것이다. An object of the present invention is to collect the user's psychological state through long-term interaction with the user, and based on this, a device for performing a conversation with the user based on emotion analysis for a sympathetic conversation that can secure emotional stability and the same It's about how to

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치는 영상을 촬영하기 위한 카메라부와, 음성을 입력받기 위한 마이크 및 음성을 출력하기 위한 스피커를 포함하는 오디오부와, 상기 카메라부를 통해 입력되는 영상 및 상기 마이크를 통해 입력되는 음성을 분석하여 사용자의 상태를 도출하는 상태분석부와, 상기 사용자의 음성으로부터 사용자의 발화를 추출하고, 상기 도출된 사용자의 상태를 고려하여 상기 추출된 발화에 대응하는 응답을 생성하는 대화형성부와, 상기 생성된 응답을 상기 오디오부를 통해 음성으로 출력하도록 제어하는 리포트부를 포함한다. An apparatus for performing a conversation with a user based on emotion analysis according to a preferred embodiment of the present invention for achieving the object as described above includes a camera unit for taking an image, a microphone for receiving a voice, and a voice. An audio unit including a speaker for outputting, a state analysis unit for deriving a user's state by analyzing an image input through the camera unit and a voice input through the microphone, and extracting a user's utterance from the user's voice and a conversation forming unit generating a response corresponding to the extracted utterance in consideration of the derived user's state, and a reporting unit controlling to output the generated response as a voice through the audio unit.

상기 대화형성부는 상기 사용자의 음성으로부터 발화를 추출하는 발화인식모델 및 상기 추출된 발화에 대응하여 상기 사용자의 상태에 공감하는 응답을 생성하는 공감응답생성모델을 포함한다. 즉, 상기 대화형성부는 발화인식모델을 통해 상기 사용자의 음성으로부터 발화를 추출하고, 공감응답생성모델을 통해 상기 추출된 발화에 대응하여 상기 사용자의 상태에 공감하는 응답을 생성한다. The dialog forming unit includes a speech recognition model for extracting a utterance from the user's voice and an empathy response generating model for generating a response that empathizes with the user's state in response to the extracted utterance. That is, the conversation forming unit extracts a utterance from the user's voice through a speech recognition model, and generates a response that empathizes with the user's state in response to the extracted utterance through an empathic response generation model.

상기 대화형성부는 학습용 발화 및 학습용 상태를 포함하는 입력과, 상기 학습용 발화에 대응하여 상기 학습용 상태에 공감하는 모범 응답인 레이블을 포함하는 학습 데이터를 마련하고, 상기 학습용 발화 및 상기 학습용 상태를 상기 공감응답생성모델에 입력하고, 상기 공감응답생성모델이 상기 학습용 발화 및 상기 학습용 상태에 대해 학습되지 않은 가중치가 적용되는 연산을 수행하여 응답을 산출하면, 상기 레이블로 설정된 모범 응답과의 차이가 목표하는 최소값이 되도록 상기 공감응답생성모델의 가중치를 갱신할 수 있다. The conversation forming unit prepares learning data including an input including a learning utterance and a learning state, and a label that is a model response that sympathizes with the learning state in response to the learning utterance, and empathizes with the learning utterance and the learning state When input to the response generation model, and the empathy response generation model calculates a response by performing an operation in which an unlearned weight is applied to the learning utterance and the learning state, the difference from the model response set by the label is the target The weight of the empathy response generation model may be updated so that it becomes the minimum value.

상기 상태분석부는 상기 영상으로부터 사용자의 상태를 분석하는 영상인식모델과, 상기 음성으로부터 사용자의 상태를 분석하는 음성인식모델과, 상기 영상인식모델의 분석 결과 및 상기 음성인식모델의 분석 결과를 종합하여 상기 사용자의 상태를 도출하는 상태분석모델을 포함한다. 즉, 상기 상태분석부는 영상인식모델을 통해 상기 영상으로부터 사용자의 상태를 분석하고, 음성인식모델을 통해 상기 음성으로부터 사용자의 상태를 분석한 후, 상태분석모델을 통해 상기 영상인식모델의 분석 결과 및 상기 음성인식모델의 분석 결과를 종합하여 상기 사용자의 상태를 도출하는 것을 특징으로 한다. The state analysis unit synthesizes an image recognition model that analyzes the user's state from the image, a voice recognition model that analyzes the user's state from the voice, and the analysis result of the image recognition model and the analysis result of the voice recognition model. and a state analysis model for deriving the user's state. That is, the state analyzer analyzes the user's state from the image through the image recognition model, analyzes the user's state from the voice through the voice recognition model, and then analyzes the result of the image recognition model through the state analysis model and It is characterized in that the state of the user is derived by synthesizing the analysis results of the voice recognition model.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 방법은 상태분석부가 사용자의 영상 및 음성을 입력받고, 입력되는 사용자의 영상 및 음성을 분석하여 사용자의 상태를 도출하는 단계와, 대화형성부가 상기 음성으로부터 사용자의 발화를 추출하고, 추출된 사용자의 발화에 대하여 상기 도출된 사용자의 상태를 고려한 응답을 생성하는 단계와, 오디오부가 상기 생성된 응답을 음성으로 출력하는 단계를 포함한다. In a method for conducting a conversation with a user based on emotion analysis according to a preferred embodiment of the present invention for achieving the object as described above, the state analysis unit receives the user's image and voice, and the inputted user's image and A step of deriving the user's state by analyzing the voice, the dialog forming unit extracting the user's utterance from the voice and generating a response in consideration of the derived user's state with respect to the extracted user's utterance; and outputting the generated response as a voice.

상기 응답을 생성하는 단계는 상기 대화형성부가 발화인식모델을 통해 상기 음성으로부터 사용자의 발화를 추출하는 단계와, 상기 대화형성부가 공감응답생성모델을 통해 상기 추출된 발화에 대하여 상기 사용자의 상태에 공감하는 응답을 생성하는 단계를 포함한다. The generating of the response includes: extracting the user's utterance from the voice through the speech recognition model by the dialog forming unit; generating a response that says

상기 방법은 상기 사용자의 상태를 도출하는 단계 전, 상기 대화형성부가 학습용 발화 및 학습용 상태를 포함하는 입력과, 상기 학습용 발화에 대응하여 상기 학습용 상태에 공감하는 모범 응답인 레이블을 포함하는 학습 데이터를 마련하는 단계와, 상기 대화형성부가 상기 학습용 발화 및 상기 학습용 상태를 상기 공감응답생성모델에 입력하는 단계와, 상기 공감응답생성모델이 상기 학습용 발화 및 상기 학습용 상태에 대해 학습되지 않은 가중치가 적용되는 연산을 수행하여 응답을 산출하는 단계와, 상기 대화형성부가 상기 레이블로 설정된 모범 응답과의 차이가 목표하는 최소값이 되도록 상기 공감응답생성모델의 가중치를 갱신하는 단계를 더 포함한다.The method includes, before the step of deriving the user's state, an input including the utterance for learning and the state for learning by the conversation forming unit, and a label that is a model response that sympathizes with the state for learning in response to the utterance for learning. The steps of providing, by the dialog forming unit, inputting the learning utterance and the learning state into the empathy response generation model, and the empathy response generation model applying unlearned weights to the learning utterance and the learning state The method further includes: calculating a response by performing an operation; and updating the weight of the empathy response generation model so that the difference from the model response set by the label by the dialog forming unit becomes a target minimum value.

상기 사용자의 상태를 도출하는 단계는 상기 상태분석부의 영상인식모델이 상기 영상으로부터 사용자의 상태를 분석하여 상기 사용자의 상태를 나타내는 영상인식상태벡터를 산출하는 단계와, 상기 상태분석부의 음성인식모델이 상기 음성으로부터 사용자의 상태를 분석하여 상기 사용자의 상태를 나타내는 음성인식상태벡터를 산출하는 단계와, 상기 상태분석부의 상태분석모델이 영상인식상태벡터 및 음성인식상태벡터로부터 상기 사용자의 상태를 나타내는 상태벡터를 산출하는 단계를 포함한다.The step of deriving the state of the user comprises the steps of: the image recognition model of the state analyzer analyzes the state of the user from the image to calculate an image recognition state vector representing the state of the user; analyzing the user's state from the voice to calculate a voice recognition state vector representing the user's state, and a state in which the state analysis model of the state analyzer represents the user's state from the image recognition state vector and the voice recognition state vector calculating the vector.

본 발명에 따르면, 간병인 등의 담당자가 직접 관찰하지 않아도 인공지능 대화 기술을 활용하여 사용자와의 거부감 없는 지속적인 대화를 통해 오랜 시간 사용자의 상태를 자동으로 인식하고 분석할 수 있다. 더욱이, 사용자는 거부감 없이 오랜 시간 동안 상황에 맞게 사용자와 교감하는 대화를 나눌 수 있다. According to the present invention, it is possible to automatically recognize and analyze the user's condition for a long time through continuous conversation with the user without any objection by using artificial intelligence conversation technology without direct observation by a person in charge such as a caregiver. Moreover, the user can have a conversation that sympathizes with the user according to the situation for a long time without feeling repulsed.

도 1은 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 시스템의 구성을 설명하기 위한 도면이다.
도 2는 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치의 구성을 설명하기 위한 도면이다.
도 3은 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치의 세부적인 구성을 설명하기 위한 블록도이다.
도 4는 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치의 상태분석부를 설명하기 위한 블록도이다.
도 5는 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치의 대화형성부를 설명하기 위한 블록도이다.
도 6은 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 방법을 설명하기 위한 흐름도이다. 1 is a diagram for explaining the configuration of a system for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention.
2 is a diagram for explaining the configuration of an apparatus for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention.
3 is a block diagram illustrating a detailed configuration of an apparatus for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention.
4 is a block diagram illustrating a state analysis unit of an apparatus for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention.
5 is a block diagram for explaining a conversation forming unit of an apparatus for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention.
6 is a flowchart illustrating a method for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention.

본 발명의 상세한 설명에 앞서, 이하에서 설명되는 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 실시예에 불과할 뿐, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다. Prior to the detailed description of the present invention, the terms or words used in the present specification and claims described below should not be construed as being limited to their ordinary or dictionary meanings, and the inventors should develop their own inventions in the best way. It should be interpreted as meaning and concept consistent with the technical idea of the present invention based on the principle that it can be appropriately defined as a concept of a term for explanation. Therefore, the embodiments described in the present specification and the configurations shown in the drawings are only the most preferred embodiments of the present invention, and do not represent all the technical spirit of the present invention, so various equivalents that can be substituted for them at the time of the present application It should be understood that there may be water and variations.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 이때, 첨부된 도면에서 동일한 구성 요소는 가능한 동일한 부호로 나타내고 있음을 유의해야 한다. 또한, 본 발명의 요지를 흐리게 할 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략할 것이다. 마찬가지의 이유로 첨부 도면에 있어서 일부 구성요소는 과장되거나 생략되거나 또는 개략적으로 도시되었으며, 각 구성요소의 크기는 실제 크기를 전적으로 반영하는 것이 아니다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In this case, it should be noted that in the accompanying drawings, the same components are denoted by the same reference numerals as much as possible. In addition, detailed descriptions of well-known functions and configurations that may obscure the gist of the present invention will be omitted. For the same reason, some components are exaggerated, omitted, or schematically illustrated in the accompanying drawings, and the size of each component does not fully reflect the actual size.

먼저, 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 시스템에 대해서 설명하기로 한다. 도 1은 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 시스템의 구성을 설명하기 위한 도면이다. First, a system for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention will be described. 1 is a diagram for explaining the configuration of a system for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 시스템(이하, 대화 시스템)은 대화장치(10), 원격진료장치(20) 및 보호자장치(30)를 포함한다. Referring to FIG. 1 , a system (hereinafter referred to as a conversation system) for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention includes a conversation device 10, a remote medical treatment device 20, and a guardian device ( 30) is included.

대화장치(10)는 사용자의 발화에 대응하는 응답을 생성하여 사용자와 대화를 하기 위한 것이다. 특히, 대화장치(10)는 사용자의 발화에 대응하는 응답을 생성하되, 사용자의 상태에 적합한 응답을 생성한다. 사용자의 상태를 추정하기 위하여, 대화장치(10)는 사용자의 관찰이 허용되는 지정된 공간(R)에 비접촉식 센서인, 카메라 및 마이크를 설치하고, 카메라 및 마이크를 통해 사용자의 행동을 촬영한 영상 및 사용자의 발화를 포함하는 음성을 수집하고, 수집된 영상 및 음성을 기초로 사용자의 상태를 도출한다. 여기서, 사용자의 상태는 감정, 피도로 및 상황 중 적어도 하나를 이용하여 표현된다. 감정은 사용자의 마음이나 느끼는 기분을 나타내며, 놀람, 기쁨, 행복, 고독, 슬픔, 분노 등의 지정된 항목 각각의 정도를 수치로 표현되는 벡터, 즉, 감정벡터로 표현될 수 있다. 예컨대, 감정벡터는 (놀람, 기쁨, 행복, 고독, 슬픔, 분노)=[0.0014 0.0027 0.0102 0.7165 0.1222 0.0111]과 같이 나타낼 수 있다. 피도로는 사용자가 피곤한 정도를 나타낸다. 이러한 피로도는 매우 지침, 지침, 다소 지침, 중간, 다소 활력, 활력, 매우 활력 중 어느 하나를 나타내는 원 핫 인코딩(one-hot-encoding) 벡터로 표현될 수 있으며, 이러한 벡터를 피로도벡터라고 칭한다. 예컨대, 다소 지침을 나타내는 피로도벡터는 (매우 지침, 지침, 다소 지침, 중간, 다소 활력, 활력, 매우 활력)=[0 0 1 0 0 0 0]과 같이 표현될 수 있다. 상황은 사용자가 분주한 정도를 나타낸다. 이러한 상황은 매우 바쁨, 바쁨, 다소 바쁨, 보통, 다소 한가함, 한가함 및 매우 한가함 중 어느 하나를 나타내는 원 핫 인코딩 벡터로 표현될 수 있고, 이러한 벡터를 상황벡터라고 칭한다. 예컨대, ‘매우 한가함’을 나타내는 상황벡터는 (매우 바쁨, 바쁨, 다소 바쁨, 보통, 다소 한가함, 한가함, 매우 한가함)=[0 0 0 0 0 0 1]과 같이 표현될 수 있다. The chatting device 10 is for having a conversation with the user by generating a response corresponding to the user's utterance. In particular, the conversational device 10 generates a response corresponding to the user's utterance, but generates a response suitable for the user's state. In order to estimate the user's state, the conversational device 10 installs a non-contact sensor, a camera and a microphone, in a designated space R where the user's observation is allowed, and includes an image and A voice including the user's utterance is collected, and the user's state is derived based on the collected image and voice. Here, the user's state is expressed using at least one of emotion, fatigue, and situation. The emotion represents the user's mind or feeling, and may be expressed as a vector expressing the degree of each of the specified items such as surprise, joy, happiness, solitude, sadness, anger, etc. numerically, that is, an emotion vector. For example, the emotion vector can be expressed as (surprise, joy, happiness, solitude, sadness, anger) = [0.0014 0.0027 0.0102 0.7165 0.1222 0.0111]. Fatigue indicates how tired the user is. Such fatigue can be expressed as a one-hot-encoding vector indicating any one of very guideline, guideline, somewhat guideline, medium, somewhat vitality, vitality, and very vitality, and this vector is called a fatigue vector. For example, a fatigue vector indicating some guidelines can be expressed as (very guideline, guideline, some guideline, medium, somewhat vitality, vitality, very vitality)=[0 0 1 0 0 0 0]. The situation indicates how busy the user is. Such a situation may be expressed as a one-hot encoding vector representing any one of very busy, busy, rather busy, normal, somewhat free, free, and very free, and this vector is referred to as a situation vector. For example, a situation vector representing 'very free' can be expressed as (very busy, busy, somewhat busy, normal, somewhat free, free, very free)=[0 0 0 0 0 0 1].

대화장치(10)는 사용자의 상태에 적합한 응답을 생성하는 복수의 인공신경망(ANN)을 통해 응답을 생성한다. 복수의 인공신경망(ANN)은 사용자의 발화에 대해 사용자의 상태에 적합한 응답을 생성하도록 학습(deep learning)된다. The conversational device 10 generates a response through a plurality of artificial neural networks (ANNs) that generate a response suitable for the user's state. A plurality of artificial neural networks (ANNs) are learned (deep learning) to generate a response appropriate to the user's state to the user's utterance.

또한, 대화장치(10)는 사용자의 상태 및 사용자와의 대화를 원격진료장치(20) 및 보호자장치(30)로 전송할 수 있다. 원격진료장치(20)는 원격 진료를 위한 것으로, 사용자의 담당의가 사용하는 장치이다. 담당의는 원격진료장치(20)를 통해 사용자의 상태 및 대화를 지속적으로 모니터링할 수 있다. 보호자장치(30)는 사용자의 보호자가 사용하는 장치이며, 보호자는 보호자장치(30)를 통해 사용자의 상태 및 대화를 지속적으로 모니터링할 수 있다. In addition, the chat device 10 may transmit the user's status and the conversation with the user to the remote medical treatment device 20 and the guardian device 30 . The telemedicine device 20 is for telemedicine and is used by the user's doctor. The attending physician may continuously monitor the user's status and conversation through the remote medical treatment device 20 . The guardian device 30 is a device used by the user's guardian, and the guardian may continuously monitor the user's status and conversation through the guardian device 30 .

다음으로, 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치에 대해서 설명하기로 한다. 도 2는 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치의 구성을 설명하기 위한 도면이다. 도 2를 참조하면, 대화장치(10)은 통신부(11), 카메라부(12), 오디오부(13), 입력부(14), 표시부(15), 저장부(16) 및 제어부(17)를 포함한다. Next, an apparatus for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention will be described. 2 is a diagram for explaining the configuration of an apparatus for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention. Referring to FIG. 2 , the conversation device 10 includes a communication unit 11 , a camera unit 12 , an audio unit 13 , an input unit 14 , a display unit 15 , a storage unit 16 , and a control unit 17 . include

통신부(11)는 원격진료장치(20) 및 보호자장치(30)와 통신을 위한 것이다. 통신부(11)는 송신되는 신호의 주파수를 상승 변환 및 증폭하는 RF(Radio Frequency) 송신기(Tx) 및 수신되는 신호를 저 잡음 증폭하고 주파수를 하강 변환하는 RF 수신기(Rx)를 포함할 수 있다. 그리고 통신부(11)는 송신되는 신호를 변조하고, 수신되는 신호를 복조하는 모뎀(Modem)을 포함할 수 있다. The communication unit 11 is for communication with the remote medical treatment device 20 and the guardian device 30 . The communication unit 11 may include a radio frequency (RF) transmitter (Tx) for up-converting and amplifying a frequency of a transmitted signal, and an RF receiver (Rx) for low-noise amplifying a received signal and down-converting the frequency. In addition, the communication unit 11 may include a modem that modulates a transmitted signal and demodulates a received signal.

카메라부(12)는 영상을 촬영하기 위한 것으로, 이미지 센서를 포함한다. 이미지 센서는 피사체에서 반사되는 빛을 입력받아 전기신호로 변환하며, CCD(Charged Coupled Device), CMOS(Complementary Metal-Oxide Semiconductor) 등을 기반으로 구현될 수 있다. 카메라부(12)는 아날로그-디지털 변환기(Analog to Digital Converter)를 더 포함할 수 있으며, 이미지 센서에서 출력되는 전기신호를 디지털 수열로 변환하여 제어부(17)로 출력할 수 있다. 특히, 카메라부(12)는 3D 센서를 포함한다. 3D 센서는 비접촉 방식으로 영상의 각 픽셀에 대한 3차원 좌표를 획득하기 위한 센서이다. 카메라부(12)가 객체를 촬영하면, 3D 센서는 촬영된 객체의 영상의 각 픽셀에 대한 3차원 좌표를 검출하고, 검출된 3차원 좌표를 제어부(17)로 전달한다. 3D 센서는 레이저, 적외선, 가시광 등을 이용하는 다양한 방식의 센서를 이용할 수 있다. 이러한 3D 센서는 TOP(Time of Flight), 위상변위(Phase-shift) 및 Online Waveform Analysis 중 어느 하나를 이용하는 레이저 방식 3차원 스캐너, 광 삼각법을 이용하는 레이저 방식 3차원 스캐너, 백색광 혹은 변조광을 이용하는 광학방식 3차원 스캐너, Handheld Real Time 방식의 PHOTO, 광학방식 3차원 스캐너, Pattern Projection 혹은 Line Scanning을 이용하는 광학방식, 레이저 방식 전신 스캐너, 사진 측량(Photogrammetry)을 이용하는 사진방식 스캐너, 키네틱(Kinect Fusion)을 이용하는 실시간(Real Time) 스캐너 등을 예시할 수 있다. 특히, 카메라부(12)는 사용자의 소정의 공간을 지속적으로 촬영하여 관찰 영상을 생성할 수 있다. 이러한 관찰 영상은 제어부(17)에 제공된다. The camera unit 12 is for capturing an image, and includes an image sensor. The image sensor receives light reflected from a subject and converts it into an electrical signal, and may be implemented based on a Charged Coupled Device (CCD), a Complementary Metal-Oxide Semiconductor (CMOS), or the like. The camera unit 12 may further include an analog-to-digital converter, and may convert an electrical signal output from the image sensor into a digital sequence and output it to the control unit 17 . In particular, the camera unit 12 includes a 3D sensor. The 3D sensor is a sensor for acquiring 3D coordinates for each pixel of an image in a non-contact manner. When the camera unit 12 captures an object, the 3D sensor detects 3D coordinates for each pixel of the image of the photographed object, and transmits the detected 3D coordinates to the controller 17 . As the 3D sensor, various types of sensors using laser, infrared, or visible light may be used. These 3D sensors are a laser type 3D scanner using any one of TOP (Time of Flight), phase-shift, and Online Waveform Analysis, a laser type 3D scanner using optical triangulation, and optical using white light or modulated light. Method 3D scanner, Handheld Real Time PHOTO, Optical 3D scanner, Optical method using Pattern Projection or Line Scanning, Laser full body scanner, Photo scanner using Photogrammetry, Kinect Fusion A real-time scanner to be used may be exemplified. In particular, the camera unit 12 may generate an observation image by continuously photographing a predetermined space of the user. This observation image is provided to the control unit 17 .

오디오부(13)는 음성 신호를 출력하기 위한 스피커(SPK)와, 음성 신호를 입력받기 위한 마이크(MIKE)를 포함한다. 오디오부(13)는 제어부(17)의 제어에 따라 음성 신호를 스피커(SPK)를 통해 출력하거나, 마이크(MIKE)를 통해 입력된 음성 신호를 제어부(17)로 전달할 수 있다. 오디오부(13)는 마이크(MIKE)를 통해 사용자의 소정의 공간 내의 음성 신호를 지속적으로 수신하여 수신된 음성을 제어부(17)에 제공한다. 오디오부(13)는 제어부(17)로부터 사용자의 발화에 대응하는 응답을 수신하면, 수신된 응답을 스피커(SPK)를 통해 음성 신호로 출력할 수 있다. The audio unit 13 includes a speaker SPK for outputting a voice signal and a microphone MIKE for receiving a voice signal. The audio unit 13 may output a voice signal through the speaker SPK or transmit a voice signal input through the microphone MIKE to the control unit 17 under the control of the control unit 17 . The audio unit 13 continuously receives a user's voice signal in a predetermined space through a microphone (MIKE) and provides the received voice to the control unit 17 . When receiving a response corresponding to the user's utterance from the control unit 17 , the audio unit 13 may output the received response as a voice signal through the speaker SPK.

입력부(14)는 대화장치(10)를 제어하기 위한 사용자의 키 조작을 입력받고 입력 신호를 생성하여 제어부(17)에 전달한다. 입력부(14)는 대화장치(10)를 제어하기 위한 각 종 키들을 포함할 수 있다. 입력부(14)는 표시부(15)가 터치스크린으로 이루어진 경우, 각 종 키들의 기능이 표시부(15)에서 이루어질 수 있으며, 터치스크린만으로 모든 기능을 수행할 수 있는 경우, 입력부(14)는 생략될 수도 있다. The input unit 14 receives a user's key manipulation for controlling the conversation device 10 , generates an input signal, and transmits the generated input signal to the control unit 17 . The input unit 14 may include various types of keys for controlling the conversation device 10 . In the input unit 14, when the display unit 15 is formed of a touch screen, the functions of various keys may be performed on the display unit 15, and when all functions can be performed only with the touch screen, the input unit 14 may be omitted. may be

표시부(15)는 대화장치(10)의 메뉴, 입력된 데이터, 기능 설정 정보 및 기타 다양한 정보를 사용자에게 시각적으로 제공한다. 표시부(15)는 대화장치(10)의 부팅 화면, 대기 화면, 메뉴 화면, 등의 화면을 출력하는 기능을 수행한다. 특히, 표시부(15)는 본 발명의 실시예에 따른 검침 영상을 화면으로 출력하는 기능을 수행한다. 이러한 표시부(15)는 액정표시장치(LCD, Liquid Crystal Display), 유기 발광 다이오드(OLED, Organic Light Emitting Diodes), 능동형 유기 발광 다이오드(AMOLED, Active Matrix Organic Light Emitting Diodes) 등으로 형성될 수 있다. 한편, 표시부(15)는 터치스크린으로 구현될 수 있다. 이러한 경우, 표시부(15)는 터치센서를 포함한다. 터치센서는 사용자의 터치 입력을 감지한다. 터치센서는 정전용량 방식(capacitive overlay), 압력식, 저항막 방식(resistive overlay), 적외선 감지 방식(infrared beam) 등의 터치 감지 센서로 구성되거나, 압력 감지 센서(pressure sensor)로 구성될 수도 있다. 상기 센서들 이외에도 물체의 접촉 또는 압력을 감지할 수 있는 모든 종류의 센서 기기가 본 발명의 터치센서로 이용될 수 있다. 터치센서는 사용자의 터치 입력을 감지하고, 감지 신호를 발생시켜 제어부(17)로 전송한다. 특히, 표시부(15)가 터치스크린으로 이루어진 경우, 입력부(14) 기능의 일부 또는 전부는 표시부(15)를 통해 이루어질 수 있다. The display unit 15 visually provides the user with the menu of the interactive apparatus 10 , input data, function setting information, and other various information. The display unit 15 performs a function of outputting a boot screen, a standby screen, a menu screen, and the like of the interactive apparatus 10 . In particular, the display unit 15 performs a function of outputting the meter reading image according to the embodiment of the present invention to the screen. The display unit 15 may be formed of a liquid crystal display (LCD), an organic light emitting diode (OLED), an active matrix organic light emitting diode (AMOLED), or the like. Meanwhile, the display unit 15 may be implemented as a touch screen. In this case, the display unit 15 includes a touch sensor. The touch sensor detects a user's touch input. The touch sensor may be composed of a touch sensing sensor such as a capacitive overlay, a pressure type, a resistive overlay, or an infrared beam, or may be composed of a pressure sensor. . In addition to the above sensors, all types of sensor devices capable of sensing contact or pressure of an object may be used as the touch sensor of the present invention. The touch sensor detects a user's touch input, generates a detection signal, and transmits it to the control unit 17 . In particular, when the display unit 15 is formed of a touch screen, some or all of the functions of the input unit 14 may be performed through the display unit 15 .

저장부(16)는 대화장치(10)의 동작에 필요한 프로그램 및 데이터를 저장하는 역할을 수행한다. 특히, 저장부(16)는 대화장치(10)의 사용에 따라 발생하는 사용자 데이터, 예컨대, 운동 패턴, 발화 패턴 및 감정 분석 결과 등이 저장되는 영역이다. 저장부(160)에 저장되는 각 종 데이터는 사용자의 조작에 따라, 삭제, 변경, 추가될 수 있다. The storage unit 16 serves to store programs and data necessary for the operation of the interactive apparatus 10 . In particular, the storage unit 16 is an area in which user data generated according to the use of the chatting device 10, for example, an exercise pattern, an utterance pattern, an emotion analysis result, and the like are stored. Various types of data stored in the storage 160 may be deleted, changed, or added according to a user's manipulation.

제어부(17)는 대화장치(10)의 전반적인 동작 및 대화장치(10)의 내부 블록들 간 신호 흐름을 제어하고, 데이터를 처리하는 데이터 처리 기능을 수행할 수 있다. 또한, 제어부(17)는 기본적으로, 대화장치(10)의 각 종 기능을 제어하는 역할을 수행한다. 제어부(17)는 CPU(Central Processing Unit), BP(baseband processor), AP(application processor), GPU(Graphic Processing Unit), DSP(Digital Signal Processor) 등을 예시할 수 있다. The controller 17 may control the overall operation of the chat device 10 and the signal flow between internal blocks of the chat device 10 , and perform a data processing function of processing data. Also, the control unit 17 basically serves to control various functions of the conversation device 10 . The controller 17 may include a central processing unit (CPU), a baseband processor (BP), an application processor (AP), a graphic processing unit (GPU), a digital signal processor (DSP), and the like.

그러면, 이러한 제어부(17)의 구성 및 동작에 대해 보다 상세하게 설명하기로 한다. 도 3은 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치의 세부적인 구성을 설명하기 위한 블록도이다. 도 4는 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치의 상태분석부를 설명하기 위한 블록도이다. 그리고 도 5는 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 장치의 대화형성부를 설명하기 위한 블록도이다. Then, the configuration and operation of the control unit 17 will be described in more detail. 3 is a block diagram illustrating a detailed configuration of an apparatus for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention. 4 is a block diagram illustrating a state analysis unit of an apparatus for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention. And FIG. 5 is a block diagram for explaining a conversation forming unit of an apparatus for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention.

먼저, 도 3을 참조하면, 제어부(17)는 상태분석부(100), 대화형성부(200) 및 리포트부(300)를 포함한다. First, referring to FIG. 3 , the control unit 17 includes a state analysis unit 100 , a conversation forming unit 200 , and a report unit 300 .

상태분석부(100)는 카메라부(12)를 통해 입력되는 영상 및 오디오부(13)의 마이크(MIKE)를 통해 입력되는 음성을 분석하여 사용자의 상태를 도출하기 위한 것이다. The state analyzer 100 is for deriving the user's state by analyzing the image input through the camera unit 12 and the voice input through the microphone MIKE of the audio unit 13 .

도 4를 참조하면, 사용자의 상태를 도출하기 위하여, 상태분석부(100)는 영상인식모델(110), 음성인식모델(120) 및 상태분석모델(130)을 이용할 수 있다. Referring to FIG. 4 , in order to derive the user's state, the state analyzer 100 may use the image recognition model 110 , the voice recognition model 120 , and the state analysis model 130 .

영상인식모델(110)은 카메라부(12)를 통해 입력되는 영상을 분석하여 사용자의 상태를 나타내는 벡터인 영상인식상태벡터를 출력하도록 학습(deep learning)된 인공신경망 모델이다. 영상인식모델(110)은 카메라부(12)를 통해 입력되는 영상에 대해 학습에 의해 설정된 가중치가 적용되는 연산을 수행하여 사용자의 상태를 나타내는 영상인식상태벡터를 산출한다. The image recognition model 110 is an artificial neural network model trained (deep learning) to analyze an image input through the camera unit 12 and output an image recognition state vector, which is a vector representing the user's state. The image recognition model 110 calculates an image recognition state vector representing a user's state by performing an operation in which a weight set by learning is applied to an image input through the camera unit 12 .

음성인식모델(120)은 오디오부(13)의 마이크(MIKE)를 통해 입력되는 음성을 분석하여 사용자의 상태를 나타내는 벡터인 음성인식상태벡터를 출력하도록 학습(deep learning)된 인공신경망 모델이다. 음성인식모델(120)은 오디오부(13)의 마이크(MIKE)를 통해 입력되는 음성에 대해 학습에 의해 설정된 가중치가 적용되는 연산을 수행하여 사용자의 상태를 나타내는 음성인식상태벡터를 산출한다. The voice recognition model 120 is an artificial neural network model that is deep learned to analyze a voice input through the microphone (MIKE) of the audio unit 13 and output a voice recognition state vector that is a vector representing the user's state. The voice recognition model 120 calculates a voice recognition state vector representing the user's state by performing an operation in which a weight set by learning is applied to the voice input through the microphone (MIKE) of the audio unit 13 .

상태분석모델(130)은 영상인식모델(110)의 분석 결과 및 음성인식모델(120)의 분석 결과를 종합하여 사용자의 상태를 나타내는 벡터인 상태벡터를 출력하도록 학습된 인공신경망 모델이다. 상태분석모델(130)은 영상인식모델(110)이 산출한 영상인식상태벡터와 음성인식모델(120)이 산출한 음성인식상태벡터를 입력받고, 영상인식상태벡터 및 음성인식상태벡터에 대해 학습에 의해 설정된 가중치가 적용되는 연산을 수행하여 사용자의 상태를 나타내는 상태벡터를 산출한다. The state analysis model 130 is an artificial neural network model trained to output a state vector, which is a vector representing a user's state, by synthesizing the analysis result of the image recognition model 110 and the analysis result of the voice recognition model 120 . The state analysis model 130 receives the image recognition state vector calculated by the image recognition model 110 and the voice recognition state vector calculated by the voice recognition model 120, and learns the image recognition state vector and the voice recognition state vector A state vector representing the user's state is calculated by performing an operation to which the weight set by .

전술한 바와 같이, 영상인식모델(110), 음성인식모델(120) 및 상태분석모델(130) 각각은 하나의 인공신경망(ANN)이다. 따라서 영상인식모델(110), 음성인식모델(120) 및 상태분석모델(130) 각각은 복수의 계층을 가지며, 복수의 계층 각각은 하나 이상의 연산을 수행하는 모듈, 예컨대, 노드를 가진다. 또한, 복수의 계층의 하나 이상의 연산 각각은 이전 계층의 하나 이상의 연산의 결과에 가중치가 적용된 값을 입력받고, 입력된 값에 대해 연산을 수행하여 그 연산 결과를 출력한다. 이러한 연산은 활성화 함수에 의해 정의된다. 활성화 함수는 시그모이드(Sigmoid), 소프트맥스(Softmax) 등을 예시할 수 있다. As described above, each of the image recognition model 110 , the voice recognition model 120 , and the state analysis model 130 is an artificial neural network (ANN). Therefore, each of the image recognition model 110 , the voice recognition model 120 , and the state analysis model 130 has a plurality of layers, and each of the plurality of layers has a module that performs one or more operations, for example, a node. In addition, each of one or more operations of the plurality of layers receives a value in which a weight is applied to a result of one or more operations of a previous layer, performs an operation on the input value, and outputs the operation result. These operations are defined by activation functions. The activation function may be exemplified by Sigmoid, Softmax, or the like.

전술한 영상인식상태벡터, 음성인식상태벡터 및 상태벡터는 감정벡터, 피로도벡터 및 상황벡터가 연결된 형식으로 구성된다. 예컨대, 영상인식상태벡터, 음성인식상태벡터 및 상태벡터는 (놀람, 기쁨, 행복, 고독, 슬픔, 분노), (매우 지침, 지침, 다소 지침, 중간, 다소 활력, 활력, 매우 활력), (매우 바쁨, 바쁨, 다소 바쁨, 보통, 다소 한가함, 한가함, 매우 한가함)= [0.0014 0.0027 0.0102 0.7165 0.1222 0.0111 0 0 1 0 0 0 0 0 0 0 0 0 0 1]과 같이 표현될 수 있다. The above-described image recognition state vector, voice recognition state vector, and state vector are configured in a form in which an emotion vector, a fatigue vector, and a situation vector are connected. For example, the image recognition state vector, the voice recognition state vector and the state vector are (surprise, joy, happiness, loneliness, sadness, anger), (very guide, guideline, little guideline, medium, little bit vitality, vitality, very vitality), ( Very busy, busy, somewhat busy, normal, somewhat free, free, very free) = [0.0014 0.0027 0.0102 0.7165 0.1222 0.0111 0 0 1 0 0 0 0 0 0 0 0 0 1].

상태분석부(100)는 영상인식모델(110), 음성인식모델(120) 및 상태분석모델(130)이 입력되는 음성 및 영상으로부터 사용자의 상태를 나타내는 상태벡터를 산출하도록 학습시킬 수 있다. 이때, 상태분석부(100)는 영상인식모델(110) 및 음성인식모델(120) 각각을 개별적으로 학습시킨 후, 영상인식모델(110) 및 음성인식모델(120) 각각의 학습이 완료되면, 영상인식모델(110), 음성인식모델(120) 및 상태분석모델(130)을 함께 학습시킨다. 이러한 방법에 대해서 설명하기로 한다. The state analyzer 100 may teach the image recognition model 110 , the voice recognition model 120 , and the state analysis model 130 to calculate a state vector representing the user's state from input voice and image. At this time, after the state analysis unit 100 individually learns each of the image recognition model 110 and the voice recognition model 120, when the learning of each of the image recognition model 110 and the voice recognition model 120 is completed, The image recognition model 110 , the voice recognition model 120 , and the state analysis model 130 are trained together. These methods will be described.

상태분석부(100)는 영상인식모델(110)을 학습시키기 위해 학습 데이터를 마련한다. 학습 데이터는 사용자의 상태가 알려진 영상인 입력 및 그 영상에 상응하는 레이블을 포함한다. 여기서, 레이블은 사용자의 상태에 따른 상태벡터가 될 수 있다. 상태분석부(100)는 사용자의 상태가 알려진 영상을 영상인식모델(110)에 입력하고, 영상인식모델(110)이 학습되지 않은 가중치가 적용되는 연산을 수행하여 사용자의 상태를 나타내는 영상인식상태벡터를 산출하면, 레이블로 설정된 상태벡터와 산출된 영상인식상태벡터와의 차이가 목표하는 최소값이 되도록 영상인식모델(110)의 가중치를 갱신한다. 이와 같이, 상태분석부(100)는 복수의 학습 데이터를 이용하여 전술한 바와 같은 가중치의 갱신을 반복함으로써 영상인식모델(110)이 사용자의 상태를 나타내는 영상인식상태벡터를 산출하도록 학습시킬 수 있다. The state analysis unit 100 prepares learning data for learning the image recognition model 110 . The training data includes an input that is an image of which the user's state is known and a label corresponding to the image. Here, the label may be a state vector according to the user's state. The state analysis unit 100 inputs an image of which the user's state is known to the image recognition model 110, and the image recognition model 110 performs an operation to which a weight that is not learned is applied, thereby indicating the image recognition state of the user. When the vector is calculated, the weight of the image recognition model 110 is updated so that the difference between the state vector set as the label and the calculated image recognition state vector becomes a target minimum value. In this way, the state analysis unit 100 can learn the image recognition model 110 to calculate the image recognition state vector representing the user's state by repeating the update of the weights as described above using a plurality of learning data. .

상태분석부(100)는 음성인식모델(120)을 학습시키기 위해 학습 데이터를 마련한다. 학습 데이터는 사용자의 상태가 알려진 음성인 입력 및 그 음성에 상응하는 레이블을 포함한다. 여기서, 레이블은 사용자의 상태에 따른 상태벡터가 될 수 있다. 상태분석부(100)는 사용자의 상태가 알려진 음성을 음성인식모델(120)에 입력하고, 음성인식모델(120)이 학습되지 않은 가중치가 적용되는 연산을 수행하여 사용자의 상태를 나타내는 음성인식상태벡터를 산출하면, 레이블로 설정된 상태벡터와 산출된 음성인식상태벡터와의 차이가 목표하는 최소값이 되도록 음성인식모델(120)의 가중치를 갱신한다. 이와 같이, 상태분석부(100)는 복수의 학습 데이터를 이용하여 전술한 바와 같은 가중치의 갱신을 반복함으로써 음성인식모델(120)이 사용자의 상태를 나타내는 음성인식상태벡터를 산출하도록 학습시킬 수 있다. The state analysis unit 100 prepares training data for learning the voice recognition model 120 . The training data includes an input in which the user's state is a known voice and a label corresponding to the voice. Here, the label may be a state vector according to the user's state. The state analysis unit 100 inputs a voice of which the user's state is known into the voice recognition model 120, and the voice recognition model 120 performs an operation to which a weight that is not learned is applied to indicate the state of the user. When the vector is calculated, the weight of the voice recognition model 120 is updated so that the difference between the state vector set as the label and the calculated voice recognition state vector becomes a target minimum value. In this way, the state analysis unit 100 can learn the voice recognition model 120 to calculate the voice recognition state vector representing the user's state by repeating the update of the above-described weights using a plurality of learning data. .

전술한 바와 같이, 영상인식모델(110) 및 음성인식모델(120) 각각의 학습이 완료되면, 상태분석부(100)는 영상인식모델(110), 음성인식모델(120) 및 상태분석모델(130)을 함께 학습시킨다. 이에 대해 설명하기로 한다. 상태분석부(100)는 사용자의 상태가 알려진 영상 및 음성을 포함하는 입력과, 그 입력에 대응하여 사용자의 상태에 따른 상태벡터를 레이블로 설정한 복수의 학습 데이터를 마련한다. 그런 다음, 상태분석부(100)는 영상 및 음성 각각을 영상인식모델(110) 및 음성인식모델(120)에 입력한다. 이에 따라, 영상인식모델(110)은 입력된 영상에 대해 학습된 가중치에 따라 연산을 수행하여 영상인식상태벡터를 산출하며, 음성인식모델(120)은 입력된 음성에 대해 학습된 가중치에 따라 연산을 수행하여 음성인식상태벡터를 산출할 것이다. As described above, when the learning of each of the image recognition model 110 and the voice recognition model 120 is completed, the state analysis unit 100 performs the image recognition model 110, the voice recognition model 120 and the state analysis model ( 130) are taught together. This will be explained. The state analyzer 100 prepares an input including an image and an audio of which the user's state is known, and a plurality of learning data in which a state vector according to the user's state is set as a label in response to the input. Then, the state analysis unit 100 inputs each of the image and the voice to the image recognition model 110 and the voice recognition model 120 . Accordingly, the image recognition model 110 calculates an image recognition state vector by performing an operation according to the weight learned for the input image, and the voice recognition model 120 operates according to the weight learned for the input voice. to calculate the voice recognition state vector.

그러면, 상태분석모델(130)은 영상인식벡터 및 음성인식벡터에 대해 학습되지 않은 가중치가 적용되는 연산을 수행하여 사용자의 상태를 나타내는 상태벡터를 산출하면, 레이블로 설정된 상태벡터와 산출된 상태벡터와의 차이가 목표하는 최소값이 되도록 상태분석모델(130)의 가중치를 갱신한다. 이와 같이, 상태분석부(100)는 복수의 학습 데이터를 이용하여 전술한 바와 같은 가중치의 갱신을 반복함으로써 상태분석모델(130)이 사용자의 상태를 나타내는 상태벡터를 산출하도록 학습시킬 수 있다. Then, when the state analysis model 130 calculates a state vector representing the user's state by performing an operation in which unlearned weights are applied to the image recognition vector and the voice recognition vector, the state vector set as the label and the calculated state vector The weight of the state analysis model 130 is updated so that the difference between and is a target minimum value. In this way, the state analysis unit 100 can teach the state analysis model 130 to calculate a state vector representing the user's state by repeating the update of the above-described weights using a plurality of learning data.

다시 도 3을 참조하면, 대화형성부(200)는 사용자의 음성으로부터 사용자의 발화를 추출하고, 앞서 상태분석부(100)가 도출한 사용자의 상태를 고려하여 추출된 발화에 대응하는 응답을 생성하기 위한 것이다. 도 5를 참조하면, 응답을 생성하기 위하여, 대화형성부(200)는 발화인식모델(210) 및 공감응답생성모델(220)을 이용할 수 있다. 발화인식모델(210) 및 공감응답생성모델(220)은 모두 인공신경망(ANN)이다. 이에 따라, 발화인식모델(210) 및 공감응답생성모델(220)은 복수의 계층을 가지며, 복수의 계층 각각은 하나 이상의 연산을 수행하는 모듈, 예컨대, 노드를 가진다. 또한, 복수의 계층의 하나 이상의 연산 각각은 이전 계층의 하나 이상의 연산의 결과에 가중치가 적용된 값을 입력받고, 입력된 값에 대해 연산을 수행하여 그 연산 결과를 출력한다. 이러한 연산은 활성화 함수에 의해 정의된다. 활성화 함수는 시그모이드(Sigmoid), 소프트맥스(Softmax) 등을 예시할 수 있다. Referring back to FIG. 3 , the conversation forming unit 200 extracts the user's utterance from the user's voice, and generates a response corresponding to the extracted utterance in consideration of the user's state derived by the state analysis unit 100 above. it is to do Referring to FIG. 5 , in order to generate a response, the conversation forming unit 200 may use the speech recognition model 210 and the empathy response generation model 220 . Both the speech recognition model 210 and the empathic response generation model 220 are artificial neural networks (ANNs). Accordingly, the speech recognition model 210 and the empathy response generation model 220 have a plurality of layers, and each of the plurality of layers has a module that performs one or more operations, for example, a node. In addition, each of one or more operations of the plurality of layers receives a value in which a weight is applied to a result of one or more operations of a previous layer, performs an operation on the input value, and outputs the operation result. These operations are defined by activation functions. The activation function may be exemplified by Sigmoid, Softmax, or the like.

발화인식모델(210)은 오디오부(13)의 마이크(MIKE)를 통해 입력되는 음성을 분석하여 사용자의 발화를 출력하도록 학습(deep learning)된 인공신경망 모델이다. 발화인식모델(210)은 발화인식모델(Acoustic Model) 및 언어모델(Language Model)을 이용하여 사용자의 음성에 대해 학습에 의해 설정된 가중치가 적용되는 연산을 수행하여 사용자의 발화를 도출한다. The speech recognition model 210 is an artificial neural network model trained to output the user's speech by analyzing the speech input through the microphone (MIKE) of the audio unit 13 . The speech recognition model 210 derives the user's speech by performing an operation in which a weight set by learning is applied to the user's voice using an acoustic model and a language model.

공감응답생성모델(220)은 사용자의 상태를 고려하여 추출된 발화에 대응하는 응답을 출력하도록 학습(deep learning)된 인공신경망 모델이다. 공감응답생성모델(220)은 상태분석부(100)가 도출한 사용자의 상태 및 발화인식모델(210)이 산출한 사용자의 발화에 대해 학습에 의해 설정된 가중치가 적용되는 연산을 수행하여 사용자의 발화에 대해 사용자의 상태에 공감하는 응답을 생성한다. The empathic response generation model 220 is an artificial neural network model trained to output a response corresponding to the extracted utterance in consideration of the user's state. The empathy response generation model 220 performs an operation in which a weight set by learning is applied to the user's state derived by the state analysis unit 100 and the user's utterance calculated by the speech recognition model 210 to perform the user's utterance to generate a response that empathizes with the user's state.

대화형성부(200)는 공감응답생성모델(220)을 학습시키기 위해 복수의 학습 데이터를 마련한다. 학습 데이터는 학습용 발화 및 학습용 상태를 포함하는 입력과, 학습용 발화에 대응하여 학습용 상태에 공감하는 모범 응답인 레이블을 포함한다. 대화형성부(200)는 학습용 발화 및 학습용 상태를 공감응답생성모델(220)에 입력하고, 공감응답생성모델(220)이 학습용 발화 및 학습용 상태에 대해 학습되지 않은 가중치가 적용되는 연산을 수행하여 응답을 산출하면, 레이블로 설정된 모범 응답과의 차이가 목표하는 최소값이 되도록 공감응답생성모델(220)의 가중치를 갱신한다. 이와 같이, 대화형성부(200)는 복수의 학습 데이터를 이용하여 전술한 바와 같은 가중치의 갱신을 반복함으로써 공감응답생성모델(220)이 사용자의 발화에 대하여 사용자의 상태에 공감하는 응답을 산출하도록 학습시킬 수 있다. The dialog forming unit 200 prepares a plurality of learning data in order to learn the empathy response generation model 220 . The training data includes an input including a learning utterance and a learning state, and a label that is an exemplary response that empathizes with the learning state in response to the learning utterance. The dialog forming unit 200 inputs the learning utterance and learning state to the empathy response generation model 220, and the empathy response generation model 220 performs an operation in which unlearned weights are applied to the learning utterance and learning state. When the response is calculated, the weight of the empathy response generation model 220 is updated so that the difference from the model response set as the label becomes the target minimum value. In this way, the conversation forming unit 200 repeats the update of the weights as described above using a plurality of learning data so that the empathy response generation model 220 calculates a response that empathizes with the user's state with respect to the user's utterance. can learn

대화형성부(200)는 공감응답생성모델(220)이 생성한 응답을 오디오부(130)의 스피커(SPK)를 통해 음성 신호로 출력한다. 공감응답생성모델(220)이 생성한 응답은 디지털 신호인 벡터값이기 때문에 대화형성부(200)는 사용자가 인식할 수 있도록 오디오부(130)의 스피커(SPK)를 통해 음성 신호로 변환하여 출력한다. The dialog forming unit 200 outputs the response generated by the empathy response generating model 220 as a voice signal through the speaker SPK of the audio unit 130 . Since the response generated by the empathy response generation model 220 is a vector value that is a digital signal, the dialog forming unit 200 converts it into a voice signal through the speaker SPK of the audio unit 130 and outputs it so that the user can recognize it. do.

다시 도 3을 참조하면, 리포트부(300)는 상태분석부(100)로부터 사용자의 상태를 입력받고, 대화형성부(200)로부터 사용자의 발화 및 응답을 입력 받은 후, 사용자의 상태 및 이에 대응하는 사용자의 발화 및 생성된 응답을 포함하는 대화를 지속적으로 저장부(16)에 저장할 수 있다. 특히, 리포트부(300)는 통신부(11)를 통해 원격진료장치(20) 및 보호자장치(30)로 사용자의 상태 및 이에 대응하는 사용자의 발화 및 생성된 응답을 포함하는 대화를 전송한다. Referring back to FIG. 3 , the report unit 300 receives the user's state from the state analysis unit 100 , and receives the user's utterance and response from the conversation forming unit 200 , and then responds to the user's state and corresponding Conversation including the user's utterance and the generated response may be continuously stored in the storage unit 16 . In particular, the report unit 300 transmits a conversation including the user's status and the user's utterances and generated responses to the remote medical treatment device 20 and the guardian device 30 through the communication unit 11 .

다음으로, 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 방법에 대해서 설명하기로 한다. 도 6은 본 발명의 실시예에 따른 감정 분석을 기초로 사용자와의 대화를 수행하기 위한 방법을 설명하기 위한 흐름도이다. 도 6에서, 영상인식모델(110), 음성인식모델(120), 상태분석모델(130), 발화인식모델(210) 및 공감응답생성모델(220) 모두 학습(deep learning)이 완료된 상태라고 가정한다. Next, a method for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention will be described. 6 is a flowchart illustrating a method for performing a conversation with a user based on emotion analysis according to an embodiment of the present invention. 6, it is assumed that the image recognition model 110, the voice recognition model 120, the state analysis model 130, the speech recognition model 210, and the empathy response generation model 220 are all in a state where deep learning is completed. do.

도 6을 참조하면, S110 단계에서 카메라부(12)는 사용자의 소정의 공간(R)에 있는 사용자를 촬영하여 사용자의 영상을 생성하고, 생성된 사용자의 영상을 제어부(17)에 입력하고, 오디오부(13)는 마이크(MIKE)를 통해 사용자의 소정의 공간(R) 내에서 사용자의 음성을 지속적으로 수신하여 수신된 음성을 제어부(17)에 입력한다. 6, in step S110, the camera unit 12 creates an image of the user by photographing the user in the user's predetermined space R, and inputs the generated image of the user to the control unit 17, The audio unit 13 continuously receives the user's voice in the user's predetermined space R through the microphone MIKE and inputs the received voice to the controller 17 .

제어부(17)의 상태분석부(100)는 S120 단계에서 입력되는 사용자의 영상 및 음성을 분석하여 사용자의 상태를 도출한다. S120 단계에 대해 보다 구체적으로 설명하면 다음과 같다. The state analysis unit 100 of the control unit 17 derives the user's state by analyzing the user's video and audio input in step S120. Step S120 will be described in more detail as follows.

상태분석부(100)는 영상인식모델(110)을 통해 입력되는 사용자의 영상으로부터 사용자의 상태를 분석하여 사용자의 상태를 나타내는 영상인식상태벡터를 산출한다. 이때, 전술한 바와 같이, 영상인식모델(110)은 학습이 완료된 인공신경망(ANN)이며, 영상인식모델(110)은 학습된 바에 따라 사용자의 영상에 대해 학습에 의해 설정된 가중치가 적용되는 연산을 수행하여 사용자의 상태를 나타내는 영상인식상태벡터를 산출한다. The state analysis unit 100 analyzes the user's state from the user's image input through the image recognition model 110 to calculate an image recognition state vector representing the user's state. At this time, as described above, the image recognition model 110 is an artificial neural network (ANN) that has been trained, and the image recognition model 110 performs an operation in which a weight set by learning is applied to the user's image according to the learned image. to calculate an image recognition state vector representing the user's state.

또한, 상태분석부(100)의 음성인식모델(120)이 음성으로부터 사용자의 상태를 분석하여 사용자의 상태를 나타내는 음성인식상태벡터를 산출한다. 이때, 전술한 바와 같이, 음성인식모델(120)은 학습이 완료된 인공신경망이고, 음성인식모델(120)은 학습된 바에 따라 사용자의 음성에 대해 학습에 의해 설정된 가중치가 적용되는 연산을 수행하여 사용자의 상태를 나타내는 음성인식상태벡터를 산출한다. In addition, the voice recognition model 120 of the state analyzer 100 analyzes the user's state from the voice to calculate a voice recognition state vector representing the user's state. At this time, as described above, the voice recognition model 120 is an artificial neural network that has been trained, and the voice recognition model 120 performs an operation in which a weight set by learning is applied to the user's voice according to the learned operation, and the user A voice recognition state vector representing the state of is calculated.

이와 같이, 영상인식상태벡터 및 음성인식상태벡터가 산출된 후, 상태분석부(100)의 상태분석모델(120)이 영상인식상태벡터 및 음성인식상태벡터로부터 사용자의 상태를 나타내는 상태벡터를 산출한다. 이때, 전술한 바와 같이, 상태분석모델(130)은 학습이 완료된 인공신경망이고, 상태분석모델(130)은 학습된 바에 따라 영상인식모델(110)이 산출한 영상인식상태벡터와 음성인식모델(120)이 산출한 음성인식상태벡터에 대해 학습에 의해 설정된 가중치가 적용되는 연산을 수행하여 사용자의 상태를 나타내는 상태벡터를 산출한다. In this way, after the image recognition state vector and the voice recognition state vector are calculated, the state analysis model 120 of the state analysis unit 100 calculates a state vector representing the user's state from the image recognition state vector and the voice recognition state vector. do. At this time, as described above, the state analysis model 130 is an artificial neural network that has been trained, and the state analysis model 130 is an image recognition state vector and a voice recognition model ( 120) calculates a state vector representing the user's state by performing an operation in which a weight set by learning is applied to the calculated voice recognition state vector.

다음으로, 대화형성부(200)는 S130 단계에서 발화인식모델(210)을 통해 입력되는 사용자의 음성으로부터 사용자의 발화를 추출한다. 발화인식모델(210)은 사용자의 음성을 분석하여 사용자의 발화를 출력하도록 학습(deep learning)된 인공신경망 모델이다. 발화인식모델(210)은 발화인식모델(Acoustic Model) 및 언어모델(Language Model)을 이용하여 사용자의 음성에 대해 학습에 의해 설정된 가중치가 적용되는 연산을 수행하여 사용자의 발화를 추출할 수 있다. Next, the dialog forming unit 200 extracts the user's utterance from the user's voice input through the speech recognition model 210 in step S130 . The speech recognition model 210 is an artificial neural network model trained to output the user's speech by analyzing the user's speech. The speech recognition model 210 may extract the user's speech by performing an operation in which a weight set by learning is applied to the user's voice using an acoustic model and a language model.

대화형성부(200)는 S140 단계에서 발화인식모델(210)을 통해 발화인식모델(210)이 추출한 사용자의 발화에 대하여 상태분석부(100)가 도출한 사용자의 상태를 고려한 응답을 생성한다. 예컨대, 사용자의 발화가 “회사에서 XXXX한 일 때문에 고민이야”이고, 사용자의 상태가 우울한 감정인 경우, 발화인식모델(210)은 사용자의 우울한 감정에 공감하면서 “~한 일 때문에 고민이군요. 이럴 땐 OOOO을 해보는 것은 어떨까요?”와 같은 응답을 생성할 수 있다. The dialog forming unit 200 generates a response in consideration of the user's state derived by the state analysis unit 100 to the user's utterance extracted by the speech recognition model 210 through the speech recognition model 210 in step S140 . For example, if the user's utterance is "I'm worried about what I did at work" and the user's state is a depressed emotion, the speech recognition model 210 empathizes with the user's depressed feeling and says, "I'm worried because of what I did. In this case, why not try OOOO?”.

응답이 생성되면, 오디오부(130)는 S150 단계에서 스피커(SPK)를 통해 대화형성부(200)는 생성한 응답을 음성으로 출력한다. 대화형성부(200)의 공감응답생성모델(220)이 생성한 응답은 디지털 신호인 벡터값이기 때문에 사용자가 인식할 수 있도록 오디오부(130)의 스피커(SPK)를 통해 음성 신호로 변환되어 출력된다. When the response is generated, the audio unit 130 outputs the generated response as a voice through the speaker SPK in step S150 . Since the response generated by the empathy response generating model 220 of the dialogue forming unit 200 is a digital signal vector value, it is converted into a voice signal through the speaker SPK of the audio unit 130 and output so that the user can recognize it. do.

다음으로, 리포트부(400)는 S160 단계에서 상태분석부(100)로부터 사용자의 상태를 입력받고, 대화형성부(200)로부터 사용자의 발화 및 응답을 입력 받고, 통신부(11)를 통해 원격진료장치(20) 및 보호자장치(30)로 사용자의 상태 및 이에 대응하는 사용자의 발화 및 생성된 응답을 포함하는 대화를 전송한다. 또한, 리포트부(400)는 사용자의 상태 및 이에 대응하는 사용자의 발화 및 생성된 응답을 포함하는 대화를 지속적으로 저장부(16)에 저장할 수 있다. Next, the report unit 400 receives the user's state from the state analysis unit 100 in step S160, receives the user's utterance and response from the conversation forming unit 200, and receives remote treatment through the communication unit 11 A conversation including the user's state and the user's utterance and generated response corresponding thereto is transmitted to the device 20 and the guardian device 30 . Also, the report unit 400 may continuously store a conversation including the user's state and the user's utterance and generated response corresponding thereto in the storage unit 16 .

한편, 앞서 설명된 본 발명의 실시예에 따른 방법은 다양한 컴퓨터수단을 통하여 판독 가능한 프로그램 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광 기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어를 포함할 수 있다. 이러한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다. Meanwhile, the method according to the embodiment of the present invention described above may be implemented in the form of a program readable by various computer means and recorded in a computer readable recording medium. Here, the recording medium may include a program command, a data file, a data structure, etc. alone or in combination. The program instructions recorded on the recording medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software. For example, the recording medium includes magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floppy disks ( magneto-optical media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions may include high-level languages that can be executed by a computer using an interpreter or the like as well as machine language such as generated by a compiler. Such hardware devices may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

감정 기복이 심한 정신질환자들이나 독거노인의 상시 케어를 위해 가족 및 보호자들은 어려움을 호소한다. 하지만, 본 발명은 홈 환경에서도 지속적인 관찰과 모니터링이 필요한 공황장애, 우울증, 조현명 등의 정신질환자들과의 공감하는 대화를 통해 지속적인 케어를 지원할 수 있다. 더욱이, 본 발명은 원거리에서도 보호자가 환자의 응급 상황을 알림 받을 수 있고, 의료진들에게 병원 밖에서의 환자 상태를 24시간 모니터링하여 전달할 수 있어서 정확한 환자 상태 파악이 가능하다. Families and caregivers complain of difficulties for regular care of mentally ill people with severe emotional ups and downs or the elderly living alone. However, the present invention can support continuous care through sympathetic dialogue with mentally ill patients such as panic disorder, depression, and Jo Hyun-myung, who require continuous observation and monitoring even in a home environment. Moreover, according to the present invention, the guardian can be notified of the emergency situation of the patient even from a distance, and it is possible to monitor and deliver the patient's condition outside the hospital to medical staff 24 hours a day, so that it is possible to accurately identify the patient's condition.

이상 본 발명을 몇 가지 바람직한 실시예를 사용하여 설명하였으나, 이들 실시예는 예시적인 것이며 한정적인 것이 아니다. 이와 같이, 본 발명이 속하는 기술분야에서 통상의 지식을 지닌 자라면 본 발명의 사상과 첨부된 특허청구범위에 제시된 권리범위에서 벗어나지 않으면서 균등론에 따라 다양한 변화와 수정을 가할 수 있음을 이해할 것이다. Although the present invention has been described above using several preferred embodiments, these examples are illustrative and not restrictive. As such, those of ordinary skill in the art to which the present invention pertains will understand that various changes and modifications can be made in accordance with the doctrine of equivalents without departing from the spirit of the present invention and the scope of rights set forth in the appended claims.

10: 대화장치 11: 통신부
12: 카메라부 13: 오디오부
14: 입력부 15: 표시부
16: 저장부 17: 제어부
20: 원격진료장치 30: 보호자장치
100: 상태분석부 110: 영상인식모델
120: 음성인식모델 130: 상태분석모델
200: 대화형성부 210: 발화인식모델
220: 공감응답생성모델 300: 리포트부 10: dialogue device 11: communication unit
12: camera unit 13: audio unit
14: input unit 15: display unit
16: storage unit 17: control unit
20: remote medical device 30: guardian device
100: state analysis unit 110: image recognition model
120: speech recognition model 130: state analysis model
200: dialogue forming unit 210: speech recognition model
220: empathy response generation model 300: report unit

Claims

In the device for performing a conversation with a user based on emotion analysis,
a camera unit for taking an image;
an audio unit including a microphone for receiving voice and a speaker for outputting voice;
a state analysis unit for deriving a user's state by analyzing the image input through the camera unit and the voice input through the microphone;
extracting a user's utterance from the user's voice,
a conversation forming unit generating a response corresponding to the extracted utterance in consideration of the derived user's state; and
a report unit for controlling to output the generated response as a voice through the audio unit;
characterized in that it comprises
A device for conducting a conversation.

According to claim 1,
The dialogue forming unit
A speech recognition model for extracting speech from the user's voice,
A sympathetic response generation model for generating a response that empathizes with the user's state in response to the extracted utterance
characterized in that it comprises
A device for conducting a conversation.

3. The method of claim 2,
The dialogue forming unit
Prepare learning data including an input including a learning utterance and a learning state, and a label that is a model response that sympathizes with the learning state in response to the learning utterance,
Input the utterance for learning and the state for learning into the empathy response generation model,
When the empathy response generation model calculates a response by performing an operation in which an unlearned weight is applied to the learning utterance and the learning state,
It characterized in that the weight of the empathy response generation model is updated so that the difference from the model response set by the label becomes a target minimum value.
A device for conducting a conversation.

According to claim 1,
The state analysis unit
an image recognition model that analyzes the user's condition from the image;
a voice recognition model for analyzing the user's state from the voice;
A state analysis model for deriving the user's state by synthesizing the analysis result of the image recognition model and the analysis result of the voice recognition model
characterized in that it comprises
A device for conducting a conversation.

In the method for performing a conversation with a user based on emotion analysis,
deriving a state of the user by a state analysis unit receiving the user's image and voice, and analyzing the inputted user's image and voice;
extracting a user's utterance from the voice by a conversation forming unit and generating a response in consideration of the derived user's state with respect to the extracted user's utterance; and
outputting the generated response as voice by an audio unit;
characterized in that it comprises
How to conduct a conversation.

6. The method of claim 5,
The step of generating the response is
extracting, by the dialog forming unit, a user's utterance from the voice through a speech recognition model; and
generating, by the dialog forming unit, a response that empathizes with the user's state with respect to the extracted utterance through a sympathetic response generation model;
characterized in that it comprises
How to conduct a conversation.

6. The method of claim 5,
Before the step of deriving the user's state,
preparing, by the conversation forming unit, learning data including an input including a learning utterance and a learning state, and a label that is a model response that empathizes with the learning state in response to the learning utterance;
inputting, by the dialog forming unit, the utterance for learning and the state for learning into the empathy response generation model;
calculating, by the empathy response generation model, an operation in which an unlearned weight is applied to the learning utterance and the learning state to calculate a response;
updating, by the dialogue forming unit, the weight of the empathy response generation model so that the difference from the model response set by the label becomes a target minimum value;
characterized in that it further comprises
How to conduct a conversation.

6. The method of claim 5,
The step of deriving the user's state is
calculating, by the image recognition model of the state analysis unit, an image recognition state vector representing the state of the user by analyzing the state of the user from the image;
calculating, by the voice recognition model of the state analyzer, a voice recognition state vector representing the state of the user by analyzing the state of the user from the voice; and
calculating, by the state analysis model of the state analysis unit, a state vector representing the user's state from the image recognition state vector and the voice recognition state vector;
characterized in that it comprises
How to conduct a conversation.