KR20200092166A

KR20200092166A - Server, method and computer program for recognizing emotion

Info

Publication number: KR20200092166A
Application number: KR1020190009370A
Authority: KR
Inventors: 류휘정; 임지희; 장두성
Original assignee: 주식회사 케이티
Priority date: 2019-01-24
Filing date: 2019-01-24
Publication date: 2020-08-03

Abstract

The present invention relates to a server, a method, and a computer program for recognizing the emotion of a user based on speech by the user while the user is performing an interaction through speech or a video. The server for recognizing the emotion of a user comprises: a receiving part for receiving input data including speech information and utterance moment information and user identification information from a user terminal; an attribute information extraction part for extracting user attribute information based on the received user identification information; a selection part for selecting at least one emotion recognition model among a plurality of pre-registered emotion recognition models based on the speech information, the utterance moment information, and the user attribute information; and an emotion recognition part that recognizes the emotion of the user by analyzing the speech information by the selected emotion recognition model.

Description

Server, method and computer program for recognizing emotions{SERVER, METHOD AND COMPUTER PROGRAM FOR RECOGNIZING EMOTION}

본 발명은 감정을 인식하는 서버, 방법 및 컴퓨터 프로그램에 관한 것이다.The present invention relates to a server, method and computer program for recognizing emotions.

인공 지능 엔진과 음성 인식을 기반으로 사용자에게 맞춤 정보를 수집하여 제공하고, 음성 명령에 따라 여러 기능을 수행하는 지능형 개인 비서가 여러 장치를 통해 제공되고 있다. Based on the artificial intelligence engine and speech recognition, intelligent personal assistants that collect and provide customized information to users and perform various functions according to voice commands are being provided through various devices.

지능형 개인 비서는 사용자 음성 명령에 대한 수행뿐만 아니라, 사용자의 음성을 인식하여 사용자의 감성을 분석하고, 분석된 사용자의 감성에 기초하여 적절한 컨텐츠를 사용자에게 추천할 수도 있다. The intelligent personal assistant may not only perform the user's voice command, but also analyze the user's emotion by recognizing the user's voice, and recommend appropriate content to the user based on the analyzed user's emotion.

지능형 개인 비서에서 사용자의 감정을 인식하는 기술과 관련하여, 선행기술인 한국등록특허 제 10-1398218호는 감정 음성 인식 장치 및 방법을 개시하고 있다. With regard to technology for recognizing a user's emotion in an intelligent personal assistant, Korean Patent Registration No. 10-1398218, which is a prior art, discloses an apparatus and method for recognizing emotion speech.

종래의 사용자의 감정을 인식하는 기술은 사용자의 음성 데이터로부터 음성의 크기, 높낮이, 발화 속도에 기초하여 사용자의 감정을 인식할 수 있었다. 그러나 사용자의 성격에 따라 차분한 음성으로 화를 표출할 수도 있으므로, 음성 데이터만을 이용하여 사용자의 감정을 인식하는 경우, 정확도가 다소 떨어진다는 단점을 가지고 있다.Conventional techniques for recognizing a user's emotion were able to recognize the user's emotion based on the volume, height, and speech rate of the voice from the user's voice data. However, since the voice may be expressed in a calm voice according to the user's personality, when the user's emotion is recognized using only the voice data, there is a disadvantage in that the accuracy is somewhat lower.

사용자가 음성 또는 영상 등의 인터랙션(interaction)을 수행하는 중 사용자가 발화한 음성에 기초하여 사용자의 감정을 인식하는 서버, 방법 및 프로그램을 제공하고자 한다. An object of the present invention is to provide a server, method, and program for recognizing a user's emotion based on a voice spoken by the user while the user performs an interaction such as voice or video.

사용자가 발화한 음성 외에도, 사용자의 얼굴을 촬영한 영상, 대화 내용, 사용자 정보, 과거 서비스 이용 내역 등을 포함하는 복합 상황 정보에 기초하여 사용자의 감정을 인식하는 서버, 방법 및 프로그램을 제공하고자 한다. In addition to the voice spoken by the user, it is intended to provide a server, method, and program for recognizing the user's emotions based on complex situation information including video of a user's face, conversation content, user information, and past service usage history. .

사용자와 인터랙션을 수행 중인 상대방이 사용자의 감정을 인식함으로써, 인식된 사용자의 감정에 대해 미리 대응할 수 있도록 하는 사용자의 감정을 인식하는 서버, 방법 및 프로그램을 제공하고자 한다. It is intended to provide a server, method, and program for recognizing a user's emotion that enables a counterpart who is interacting with the user to recognize the user's emotion in advance, thereby responding to the recognized user's emotion in advance.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical problems to be achieved by the present embodiment are not limited to the technical problems as described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 수단으로서, 본 발명의 일 실시예는, 사용자 단말로부터 음성 정보 및 발화 순간 정보를 포함하는 입력 데이터 및 사용자의 식별 정보를 수신하는 수신부, 상기 수신한 사용자의 식별 정보에 기초하여 사용자 속성 정보를 추출하는 속성 정보 추출부, 상기 음성 정보, 상기 발화 순간 정보 및 상기 사용자 속성 정보에 기초하여 기등록된 복수의 감정 인식 모델 중 적어도 하나의 감정 인식 모델을 선택하는 선택부 및 상기 선택된 감정 인식 모델에 의해 상기 음성 정보를 분석하여 상기 사용자의 감정을 인식하는 감정 인식부를 포함하는 감정 인식 서버를 제공할 수 있다. As a means for achieving the above-described technical problem, an embodiment of the present invention, a receiving unit for receiving input information and identification information of the user including voice information and instantaneous speech information from the user terminal, the identification information of the received user An attribute information extraction unit for extracting user attribute information based on the selection unit for selecting at least one emotion recognition model from among a plurality of emotion recognition models registered based on the voice information, the spoken moment information, and the user attribute information And an emotion recognition unit that recognizes the user's emotion by analyzing the voice information by the selected emotion recognition model.

본 발명의 다른 실시예는, 사용자 단말로부터 음성 정보 및 발화 순간 정보를 포함하는 입력 데이터 및 사용자의 식별 정보를 수신하는 단계, 상기 수신한 사용자의 식별 정보에 기초하여 사용자 속성 정보를 추출하는 단계, 상기 음성 정보, 상기 발화 순간 정보 및 상기 사용자 속성 정보에 기초하여 기등록된 복수의 감정 인식 모델 중 적어도 하나의 감정 인식 모델을 선택하는 단계 및 상기 선택된 감정 인식 모델에 의해 상기 음성 정보를 분석하여 상기 사용자의 감정을 인식하는 단계를 포함하는 감정 인식 방법을 제공할 수 있다. According to another embodiment of the present invention, receiving input data including voice information and instantaneous information from a user terminal and identification information of a user, extracting user attribute information based on the received identification information of the user, Selecting at least one emotion recognition model from among a plurality of pre-registered emotion recognition models based on the voice information, the spoken moment information, and the user attribute information, and analyzing the voice information by the selected emotion recognition model to obtain the It can provide an emotion recognition method comprising the step of recognizing the user's emotion.

본 발명의 또 다른 실시예는, 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 사용자 단말로부터 음성 정보 및 발화 순간 정보를 포함하는 입력 데이터 및 사용자의 식별 정보를 수신하고, 상기 수신한 사용자의 식별 정보에 기초하여 사용자 속성 정보를 추출하고, 상기 음성 정보, 상기 발화 순간 정보 및 상기 사용자 속성 정보에 기초하여 기등록된 복수의 감정 인식 모델 중 적어도 하나의 감정 인식 모델을 선택하고, 상기 선택된 감정 인식 모델에 의해 상기 음성 정보를 분석하여 상기 사용자의 감정을 인식하도록 하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램을 제공할 수 있다. According to another embodiment of the present invention, when the computer program is executed by a computing device, input data including voice information and instantaneous speech information from a user terminal and user identification information are received, and the received user identification information Based on the extracted user attribute information, and based on the voice information, the utterance information and the user attribute information, select at least one emotion recognition model among a plurality of registered emotion recognition models, and the selected emotion recognition model By analyzing the voice information, it is possible to provide a computer program stored in a medium including a sequence of instructions to recognize the user's emotion.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and detailed description of the invention.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 사용자가 음성 또는 영상 등의 인터랙션(interaction)을 수행하는 중 사용자가 발화한 음성에 기초하여 사용자의 감정을 인식하는 서버, 방법 및 프로그램을 제공할 수 있다. According to any one of the above-described problem solving means of the present invention, a user provides a server, a method and a program for recognizing a user's emotion based on a voice spoken by a user while performing an interaction such as voice or video. can do.

사용자가 발화한 음성 외에도, 사용자의 얼굴을 촬영한 영상, 대화 내용, 사용자 정보, 과거 서비스 이용 내역 등을 포함하는 복합 상황 정보에 기초하여 사용자의 감정을 인식하는 서버, 방법 및 프로그램을 제공할 수 있다. In addition to the voice uttered by the user, a server, method, and program for recognizing the user's emotions based on complex situation information including a video of the user's face, conversation content, user information, and past service usage history can be provided. have.

사용자와 인터랙션을 수행 중인 상대방이 사용자의 감정을 인식함으로써, 인식된 사용자의 감정에 대해 미리 대응할 수 있도록 하는 사용자의 감정을 인식하는 서버, 방법 및 프로그램을 제공할 수 있다. It is possible to provide a server, a method and a program for recognizing a user's emotion so that a user who is interacting with the user recognizes the user's emotion, so that the user can respond in advance to the recognized user's emotion.

도 1은 본 발명의 일 실시예에 따른 감정 인식 시스템의 구성도이다.
도 2는 본 발명의 일 실시예에 따른 감정 인식 서버의 구성도이다.
도 3a 및 도 3b는 본 발명의 일 실시예에 따른 사용자의 감정을 인식하는 과정을 설명하기 위한 예시적인 도면이다.
도 4는 본 발명의 일 실시예에 따른 감정 인식 서버에서 감정을 인식하는 방법의 순서도이다. 1 is a block diagram of an emotion recognition system according to an embodiment of the present invention.
2 is a block diagram of an emotion recognition server according to an embodiment of the present invention.
3A and 3B are exemplary views illustrating a process of recognizing a user's emotion according to an embodiment of the present invention.
4 is a flowchart of a method for recognizing emotion in an emotion recognition server according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains may easily practice. However, the present invention can be implemented in many different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present invention in the drawings, parts irrelevant to the description are omitted, and like reference numerals are assigned to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Throughout the specification, when a part is "connected" to another part, this includes not only "directly connected" but also "electrically connected" with other elements in between. . Also, when a part is said to “include” a certain component, it means that the component may further include other components, not exclude other components, unless specifically stated otherwise. However, it should be understood that the existence or addition possibilities of numbers, steps, actions, components, parts or combinations thereof are not excluded in advance.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다.In the present specification, the term “unit” includes a unit realized by hardware, a unit realized by software, and a unit realized by using both. Further, one unit may be realized by using two or more hardware, and two or more units may be realized by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다.Some of the operations or functions described in this specification as being performed by a terminal or device may be performed instead on a server connected to the corresponding terminal or device. Similarly, some of the operations or functions described as being performed by the server may be performed in a terminal or device connected to the corresponding server.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 감정 인식 시스템의 구성도이다. 도 1을 참조하면, 감정 인식 시스템(1)은 사용자 단말(110), 감정 인식 서버(120) 및 다른 사용자 단말(130)을 포함할 수 있다. 사용자 단말(110), 감정 인식 서버(120) 및 다른 사용자 단말(130)은 감정 인식 시스템(1)에 의하여 제어될 수 있는 구성요소들을 예시적으로 도시한 것이다. 1 is a block diagram of an emotion recognition system according to an embodiment of the present invention. Referring to FIG. 1, the emotion recognition system 1 may include a user terminal 110, an emotion recognition server 120, and another user terminal 130. The user terminal 110, the emotion recognition server 120 and other user terminals 130 exemplarily show components that can be controlled by the emotion recognition system 1.

도 1의 감정 인식 시스템(1)의 각 구성요소들은 일반적으로 네트워크(network)를 통해 연결된다. 예를 들어, 도 1에 도시된 바와 같이, 감정 인식 서버(120)는 사용자 단말(110) 또는 다른 사용자 단말(130)과 동시에 또는 시간 간격을 두고 연결될 수 있다. Each component of the emotion recognition system 1 of FIG. 1 is generally connected through a network. For example, as illustrated in FIG. 1, the emotion recognition server 120 may be connected to the user terminal 110 or another user terminal 130 simultaneously or at a time interval.

네트워크는 단말들 및 서버들과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 근거리 통신망(LAN: Local Area Network), 광역 통신망(WAN: Wide Area Network), 인터넷 (WWW: World Wide Web), 유무선 데이터 통신망, 전화망, 유무선 텔레비전 통신망 등을 포함한다. 무선 데이터 통신망의 일례에는 3G, 4G, 5G, 3GPP(3rd Generation Partnership Project), LTE(Long Term Evolution), WIMAX(World Interoperability for Microwave Access), 와이파이(Wi-Fi), 블루투스 통신, 적외선 통신, 초음파 통신, 가시광 통신(VLC: Visible Light Communication), 라이파이(LiFi) 등이 포함되나 이에 한정되지는 않는다. The network means a connection structure capable of exchanging information between nodes such as terminals and servers, and a local area network (LAN), a wide area network (WAN), and the Internet (WWW: World) Wide Web), wired and wireless data communication networks, telephone networks, and wired and wireless television communication networks. Examples of wireless data communication networks include 3G, 4G, 5G, 3rd Generation Partnership Project (3GPP), Long Term Evolution (LTE), World Interoperability for Microwave Access (WIMAX), Wi-Fi, Bluetooth communication, infrared communication, ultrasound Communication, Visible Light Communication (VLC), LiFi, and the like are included, but are not limited thereto.

사용자 단말(110)은 다른 사용자 단말(130)과 음성/영상 등의 통화또는 인터랙션(interaction)을 수행할 수 있다. 예를 들어, 사용자 단말(110)은 마이크를 통해 사용자(100)가 발화한 음성을 입력받고, 입력받은 음성을 감정 인식 서버(120)로 전송할 수 있다. The user terminal 110 may perform a call or interaction such as voice/video with another user terminal 130. For example, the user terminal 110 may receive a voice uttered by the user 100 through a microphone and transmit the received voice to the emotion recognition server 120.

사용자 단말(110)은 영상 통화의 수행 중 발생되는 부가 정보를 감정 인식 서버(120)로 전송할 수 있다. 여기서, 부가 정보는 사용자(100)가 음성을 발화한 순간과 관련된 발화 순간 정보일 수 있다. 예를 들어, 사용자 단말(110)은 카메라를 통해 사용자(100)를 촬영한 영상을 감정 인식 서버(120)로 전송할 수 있다. 다른 예를 들어, 사용자 단말(110)은 GPS를 통해 측정된 사용자(100)가 위치한 장소를 감정 인식 서버(120)로 전송할 수 있다. The user terminal 110 may transmit additional information generated during the execution of the video call to the emotion recognition server 120. Here, the additional information may be utterance information related to the moment when the user 100 utters the voice. For example, the user terminal 110 may transmit an image of the user 100 through the camera to the emotion recognition server 120. For another example, the user terminal 110 may transmit the location of the user 100 measured through GPS to the emotion recognition server 120.

사용자 단말(110)은 사용자(100)의 식별 정보를 감정 인식 서버(120)로 전송할 수 있다. 이 때, 사용자(100)의 식별 정보는 사용자 단말(110)의 회선 번호, 사용자 계정 정보(ID/PW) 등을 포함할 수 있다. The user terminal 110 may transmit the identification information of the user 100 to the emotion recognition server 120. At this time, the identification information of the user 100 may include a line number of the user terminal 110, user account information (ID/PW), and the like.

감정 인식 서버(120)는 복수의 사용자 단말로부터 수집된 대화 이력 데이터를 사용자별 속성 정보에 따라 분류하고, 분류된 사용자별 속성 정보에 기초하여 개별 학습 모델로 학습할 수 있다. The emotion recognition server 120 may classify conversation history data collected from a plurality of user terminals according to attribute information for each user, and learn with individual learning models based on the classified attribute information for each user.

감정 인식 서버(120)는 사용자 단말(110)로부터 음성 정보 및 발화 순간 정보를 포함하는 입력 데이터 및 사용자(100)의 식별 정보를 수신할 수 있다. 감정 인식 서버(120)는 사용자 단말(110)로부터 수신한 음성 정보를 STT(Speech-To-Text) 변환을 통해 텍스트로 변환할 수 있다. The emotion recognition server 120 may receive input data including voice information and instantaneous speech information from the user terminal 110 and identification information of the user 100. The emotion recognition server 120 may convert voice information received from the user terminal 110 into text through STT (Speech-To-Text) conversion.

감정 인식 서버(120)는 수신한 사용자(100)의 식별 정보에 기초하여 사용자 속성 정보를 추출할 수 있다. 여기서, 사용자 속성 정보는 사용자(100)의 성별, 연령, 거주 지역, 과거 대화 이력, 과거 서비스 이용 내역, 선호 컨텐츠 서비스 이용 내역, 서비스 이용 시간 분포 정보 등을 포함할 수 있으며, 사용자 속성 정보가 하나도 주어지지 않은 경우에도 사용자(100)의 감정이 인식될 수 있다. The emotion recognition server 120 may extract user attribute information based on the received identification information of the user 100. Here, the user attribute information may include the user's 100 gender, age, residence area, past conversation history, past service usage history, preferred content service usage history, service usage time distribution information, and none of the user attribute information. Even if it is not given, the emotion of the user 100 may be recognized.

감정 인식 서버(120)는 발화 순간 정보로부터 사용자(100)가 발화한 순간에 대한 영상 정보 및 위치 정보 중 적어도 하나를 추출할 수 있다. 예를 들어, 감정 인식 서버(120)는 추출된 영상 정보에 기초하여 사용자(100)의 얼굴을 분석하고, 분석된 사용자(100)의 얼굴 정보에 기초하여 기등록된 복수의 감정 인식 모델 중 적어도 하나의 감정 인식 모델을 선택할 수 있다. 다른 예를 들어, 감정 인식 서버(120)는 추출된 위치 정보에 기초하여 사용자(100)가 위치한 장소를 파악하고, 파악된 사용자(100)가 위치한 장소에 기초하여 기등록된 복수의 감정 인식 모델 중 적어도 하나의 감정 인식 모델을 선택할 수 있다.The emotion recognition server 120 may extract at least one of image information and location information for a moment spoken by the user 100 from the moment information spoken. For example, the emotion recognition server 120 analyzes the face of the user 100 based on the extracted image information, and at least one of a plurality of pre-registered emotion recognition models based on the analyzed face information of the user 100 One emotion recognition model can be selected. For another example, the emotion recognition server 120 identifies a location where the user 100 is located based on the extracted location information, and a plurality of emotion recognition models that are pre-registered based on the location where the identified user 100 is located. At least one emotion recognition model may be selected.

감정 인식 서버(120)는 음성 정보, 발화 순간 정보 및 사용자 속성 정보에 기초하여 기등록된 복수의 감정 인식 모델 중 적어도 하나의 감정 인식 모델을 선택할 수 있다. 여기서, 기등록된 복수의 감정 인식 모델은 성별 감정 인식 모델, 연령별 감정 인식 모델, 지역별 감정 인식 모델, 특정 서비스 이용자별 감정 인식 모델 등을 포함할 수 있다. The emotion recognition server 120 may select at least one emotion recognition model from among a plurality of previously registered emotion recognition models based on voice information, instantaneous speech information, and user attribute information. Here, the plurality of pre-registered emotion recognition models may include a gender emotion recognition model, an age-specific emotion recognition model, a regional emotion recognition model, and a specific service user emotion recognition model.

감정 인식 서버(120)는 선택된 감정 인식 모델에 의해 음성 정보를 분석하여 사용자(100)의 감정을 인식할 수 있다. 이 때, 감정 인식 서버(120)는 감정 인식 모델에 의해 음성 정보를 분석하여 순간 감정 데이터(1회 감정 데이터)를 도출할 수 있으며, 연속 감정 데이터(일련의 감정 데이터, 복수회 감정 데이터)를 도출할 수도 있다. The emotion recognition server 120 may recognize emotions of the user 100 by analyzing voice information according to the selected emotion recognition model. At this time, the emotion recognition server 120 may derive instantaneous emotion data (one-time emotion data) by analyzing voice information by using the emotion recognition model, and perform continuous emotion data (a series of emotion data, multiple times emotion data). It can also be derived.

감정 인식 서버(120)는 인식된 사용자(100)의 감정을 다른 사용자 단말(130)로 전송할 수 있다. The emotion recognition server 120 may transmit the emotion of the recognized user 100 to another user terminal 130.

이러한 감정 인식 서버(120)는 감정을 인식하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램에 의해 실행될 수 있다. 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 사용자 단말(110)로부터 음성 정보 및 발화 순간 정보를 포함하는 입력 데이터 및 사용자(100)의 식별 정보를 수신하고, 수신한 사용자의 식별 정보에 기초하여 사용자 속성 정보를 추출하고, 음성 정보, 발화 순간 정보 및 사용자 속성 정보에 기초하여 기등록된 복수의 감정 인식 모델 중 적어도 하나의 감정 인식 모델을 선택하고, 선택된 감정 인식 모델에 의해 음성 정보를 분석하여 사용자(100)의 감정을 인식하도록 하는 명령어들의 시퀀스를 포함할 수 있다. The emotion recognition server 120 may be executed by a computer program stored in a medium including a sequence of instructions for recognizing emotion. When the computer program is executed by the computing device, the user terminal 110 receives input data including voice information and instantaneous speech information and identification information of the user 100, and user attributes based on the received identification information of the user Extracts information, selects at least one emotion recognition model from among a plurality of pre-registered emotion recognition models based on voice information, instantaneous speech information, and user attribute information, and analyzes voice information by the selected emotion recognition model so that the user ( It may include a sequence of commands to recognize the emotion of 100).

다른 사용자 단말(130)은 사용자 단말(110)과 음성/영상과 같은 통화 또는 인터랙션(interaction)을 수행할 수 있다. The other user terminal 130 may perform a call or interaction such as voice/video with the user terminal 110.

다른 사용자 단말(130)은 사용자 단말(110)과 음성 통화 또는 영상 통화를 수행하는 중, 감정 인식 서버(120)로부터 사용자(100)의 인식된 감정을 수신할 수 있다. The other user terminal 130 may receive the recognized emotion of the user 100 from the emotion recognition server 120 while performing a voice call or a video call with the user terminal 110.

도 2는 본 발명의 일 실시예에 따른 감정 인식 서버의 구성도이다. 도 2를 참조하면, 감정 인식 서버(120)는 학습부(210), 수신부(220), 속성 정보 추출부(230), 발화 순간 정보 추출부(240), 얼굴 분석부(250), 장소 파악부(260), 선택부(270), 감정 인식부(280) 및 전송부(290)를 포함할 수 있다. 2 is a block diagram of an emotion recognition server according to an embodiment of the present invention. Referring to FIG. 2, the emotion recognition server 120 includes a learning unit 210, a receiving unit 220, an attribute information extraction unit 230, an instantaneous speech information extraction unit 240, a face analysis unit 250, and a place identification It may include a unit 260, the selection unit 270, the emotion recognition unit 280 and the transmission unit 290.

학습부(210)는 복수의 사용자 단말로부터 수집된 대화 이력 데이터를 사용자별 속성 정보에 따라 분류하고, 분류된 사용자별 속성 정보에 기초하여 개별 학습 모델로 학습할 수 있다. The learning unit 210 may classify conversation history data collected from a plurality of user terminals according to attribute information for each user, and learn as an individual learning model based on the classified attribute information for each user.

학습부(210)는 신경망 학습 기법을 이용하여 개별 학습 모델로 학습할 수 있다. 여기서, 학습 방법은 신경망 학습 기법에 제한되지 않으며, 통상적으로 실시되는 기계학습(Machine Learning) 기반의 학습 방법이 이용될 수도 있다. 이 때, 학습부(210)에서 사용자 정보를 더 결합하여 학습하는 경우, 감정 인식에 대한 정확도를 향상시킬 수 있다. The learning unit 210 may learn with an individual learning model using a neural network learning technique. Here, the learning method is not limited to the neural network learning technique, and a conventional learning method based on machine learning may be used. In this case, when learning is further performed by combining user information in the learning unit 210, accuracy for emotion recognition may be improved.

학습부(210)는 하나의 학습 결과만을 이용하지 않고, 수집된 학습 데이터를 사용자 정보에 따라 분류한 후, 각각을 개별 학습 모델로 학습시킬 수 있다. 개별 학습 모델은 예를 들어, 성별 감정 인식 모델, 연령별 감정 인식 모델, 지역별 감정 인식 모델, 특정 서비스 이용자별 감정 인식 모델 등으로 학습시킬 수 있다. The learning unit 210 may classify the collected learning data according to user information without using only one learning result, and then train each of them as an individual learning model. The individual learning model may be trained with, for example, a gender emotion recognition model, an age-specific emotion recognition model, a region-specific emotion recognition model, and a specific service user-specific emotion recognition model.

예를 들어, 학습부(210)는 남성과 여성이 혼합된 학습 데이터를 남성 학습 데이터와 여성 학습 데이터로 분리한 후, 남성 감정 인식 모델 또는 여성 감정 인식 모델과 같이 개별 학습 모델로 학습시킬 수 있다. 이 때, 사용자(100)의 감정을 인식하고자 하는 경우, 여성이면 여성 감정 인식 모델이 이용되고, 남성이면 남성 감정 인식 모델이 이용될 수 있다. 이는, 남성과 여성이 혼합된 단일 모델을 이용하는 경우보다, 높은 정확도를 얻을 수 있다는 장점을 갖는다.For example, the learning unit 210 may separate male and female learning data into male learning data and female learning data, and then train them as individual learning models, such as a male emotion recognition model or a female emotion recognition model. . At this time, in order to recognize the emotion of the user 100, a female emotion recognition model may be used if it is a female, and a male emotion recognition model may be used if it is a male. This has the advantage that higher accuracy can be obtained than when a single model in which a man and a woman are mixed is used.

다른 예를 들어, 학습부(210)는 학습 데이터를 사용자 정보에 따라 지역별로 분류한 후, 지역별 감정 인식 모델과 같이 개별 학습 모델로 학습할 수 있다. 예를 들면, 사투리를 사용하는 사용자(100)의 감정을 인식하고자 하는 경우, 경상도 지역 감정 인식 모델, 서울 지역 감정 인식 모델 및 전라도 지역 감정 인식 모델과 같이 3개의 지역 감정 모델을 이용함으로써, 전국 감정 인식 모델을 이용하는 것보다 높은 정확도를 얻을 수 있다. 이 때, 다중 감정 인식 모델을 이용하기 위해서는 감정을 인식하고자 하는 사용자(100)의 거주 지역에 대한 정보가 추가로 필요하다. For another example, the learning unit 210 may classify the learning data into regions according to user information, and then learn with individual learning models such as regional emotion recognition models. For example, in order to recognize the emotions of the user 100 using the dialect, national emotions are used by using three regional emotion models, such as the Gyeongsang-do regional emotion recognition model, the Seoul regional emotion recognition model, and the Jeolla-do regional emotion recognition model. Higher accuracy can be achieved than using a recognition model. At this time, in order to use the multi-emotional recognition model, additional information on the residential area of the user 100 who wants to recognize the emotion is required.

지역별 감정 인식 모델을 개별 학습 모델로 학습하는 이유는, 사용자(100) 마다 지역색이 나타나는 말투, 화법, 억양, 어휘를 이용할 가능성이 높으므로, 이러한 정보를 이용하여 지역별 감정 인식 모델을 통해 사용자(100)의 감정을 분석하게 되면, 다른 감정 인식 모델을 이용하여 분석한 경우보다 정확도가 높아지기 때문이다. The reason for learning the regional emotion recognition model as an individual learning model is that there is a high possibility of using the speech, speech, intonation, and vocabulary in which the local color is displayed for each user 100, and thus the user 100 through the regional emotion recognition model using this information This is because, when the emotions of the) are analyzed, the accuracy is higher than that of the analysis using other emotion recognition models.

구체적으로, 학습부(210)는 사용자(100)가 서울에 거주중인 경우, 사용자(100)의 음성 정보에 감정이 드러나는 구간 별로 감정을 태깅하고, 이를 학습하여 서울 지역 감정 인식 모델로 학습할 수 있다. 이와 유사하게, 각 지역별로 감정 인식 모델이 학습될 수 있다. 이 때, 사용자(100)의 거주 지역 정보를 획득할 수 없는 경우, 구축된 학습 데이터를 모두 학습한 전국민을 대상으로 구축된 학습 모델이 이용될 수 있다. 전국민을 대상으로 구축된 학습 모델은 지역별 감정 인식 모델보다는 다소 부정확하지만 사용자의 거주 정보가 주어지지 않은 상황에서 최선의 결과를 획득할 수 있다. Specifically, if the user 100 is living in Seoul, the learning unit 210 may tag emotions for each section in which emotions are revealed in the voice information of the user 100 and learn them to learn with the emotion recognition model in Seoul. have. Similarly, an emotion recognition model may be trained for each region. At this time, if it is not possible to obtain information about the area of residence of the user 100, a learning model constructed for the nationals who have learned all the constructed learning data may be used. The learning model built for the national population is somewhat inaccurate than the regional emotion recognition model, but the best results can be obtained in a situation where the user's residence information is not given.

수신부(220)는 사용자 단말(110)로부터 음성 정보 및 발화 순간 정보를 포함하는 입력 데이터를 수신할 수 있다. 음성 정보는 패킷으로부터 음성 부분이 추출될 수 있다. 수신부(220)를 통해 수신된 음성 정보 및 발화 순간 정보는 저장부(미도시)에 의해 각 사용자 별로 입력된 순으로 시간이 특정되어 데이터베이스에 저장될 수 있다. 이 때, 음성 정보는 변환부(미도시)를 통해 STT(Speech-To-Text) 변환을 거쳐 텍스트로 구성된 대화 정보로 변환될 수 있다. The receiver 220 may receive input data including voice information and instantaneous speech information from the user terminal 110. In the voice information, a voice portion may be extracted from the packet. The voice information and the instantaneous speech information received through the receiving unit 220 may be specified and stored in the database in the order of time input for each user by the storage unit (not shown). At this time, the voice information may be converted into conversation information composed of text through speech-to-text (STT) conversion through a conversion unit (not shown).

저장부(미도시)는 자원 관리 차원에서 전체 입력 데이터를 저장하지 않고, 일정 시간 내에 해당하는 입력 데이터만을 저장할 수도 있다. 즉, 전체 입력 데이터 중 특정 시간 내의 입력 데이터가 사용자(100)의 감정 인식에 이용될 수 있다. 이 때, 사용자(100)의 음성 정보 및 영상 정보와 위치 정보를 포함하는 발화 순간 정보는 서로 정확하게 동일한 시점에 수집 및 확보된 데이터일 수 있다. The storage unit (not shown) may not store the entire input data in terms of resource management, but may also store only the corresponding input data within a predetermined time. That is, input data within a specific time among all input data may be used for the emotion recognition of the user 100. In this case, instantaneous speech information including voice information, video information, and location information of the user 100 may be data collected and secured at exactly the same time point.

저장부(미도시)에 저장된 입력 데이터는 콜 센터 상담원과의 음성 통화와 같이 지속적으로 사용자(100)에 대한 감정을 확인해야 하는 경우, 신규 발화 순간 정보가 획득하는 시점에 사용자(100)의 감정이 분석될 수 있다. 이와 달리, 영상 통화와 같이 대화가 끝날 때마다 발화자의 감정을 파악할 필요가 있는 경우, 대화 정보가 업데이트되는 시점에 분석될 수 있다. When the input data stored in the storage unit (not shown) needs to continuously check the emotion of the user 100, such as a voice call with a call center agent, the emotion of the user 100 at the time when the new utterance information is acquired This can be analyzed. Alternatively, when it is necessary to grasp the emotion of the talker every time the conversation is finished, such as a video call, it may be analyzed at the time when the conversation information is updated.

저장부(미도시)는 데이터베이스에 저장된 입력 데이터와 관련된 정보에 대해 자원의 효율적 이용을 위해 감정 인식이 중단된 데이터를 데이터베이스로부터 삭제할 수 있다. 또한, 저장부(미도시)는 불필요한 데이터의 경우, 일정 시간 이전의 과거 데이터를 데이터베이스로부터 삭제할 수 있다. The storage unit (not shown) may delete data from which emotion recognition has been stopped from the database for efficient use of resources for information related to input data stored in the database. In addition, in the case of unnecessary data, the storage unit (not shown) may delete past data from a database before a certain time.

저장부(미도시)는 감정 인식부(280)에서 사용자(100)의 감정 인식이 완료되면, 사용자 정보, 음성 정보, 발화 순간 정보, 인식된 감정 등에 대한 모든 데이터를 해당 시점과 사용자(100) 별로 저장 관리하여, 추후에 새로운 모델을 학습하기 위한 데이터로 저장할 수 있다. 이 때, 사용자(100)와 관련된 모든 데이터는 추후 관리자에 의해 감정 인식 결과가 교정될 수 있으며, 교정된 데이터는 감정 인식 모델을 학습시키는데 이용될 수 있다. 이를 통해, 지속적으로 감정 인식 기능을 향상시키거나, 새로운 감정 인식 모델이 학습될 수 있다. When the emotion recognition of the user 100 is completed by the emotion recognition unit 280, the storage unit (not shown) transmits all data on the user information, voice information, utterance information, and recognized emotions at the corresponding time and the user 100. You can store and manage each, and save it as data for learning a new model later. At this time, all data related to the user 100 may be subsequently corrected by the emotion recognition result by the manager, and the corrected data may be used to train the emotion recognition model. Through this, the emotion recognition function may be continuously improved or a new emotion recognition model may be learned.

수신부(220)는 사용자(100)의 식별 정보를 수신할 수 있다. 사용자(100)의 식별 정보는 사용자(100)를 특정할 수 있는 회선 번호, 사용자 계정 정보 등을 포함할 수 있다. The receiver 220 may receive identification information of the user 100. The identification information of the user 100 may include a line number capable of specifying the user 100, user account information, and the like.

속성 정보 추출부(230)는 수신한 사용자(100)의 식별 정보에 기초하여 사용자 속성 정보를 추출할 수 있다. 여기서, 사용자 속성 정보는 사용자(100)의 성별, 연령, 거주 지역 등을 포함하는 사용자 프로필 정보와 과거 대화 이력, 과거 서비스 이용 내역, 선호 컨텐츠 서비스 이용 내역, 서비스 이용 시간 분포 정보 등을 포함하는 사용자(100)가 이용한 서비스와 관련된 정보로 구성될 수 있다. 이 때, 사용자 속성 정보는 사용자(100)와 관련된 정보로, 데이터베이스에 기저장되어 있는 것일 수 있으며, 발화 순간 정보로부터 사용자(100)의 식별 정보에 기초하여 추출되는 것일 수도 있으며, 이에 한정하지 않는다. 또한, 사용자 속성 정보가 하나도 주어지지 않은 경우에도 음성 정보 및 영상 정보 등을 이용하여 사용자(100)의 감정이 인식될 수 있다.The attribute information extraction unit 230 may extract user attribute information based on the identification information of the received user 100. Here, the user attribute information includes user profile information including the gender, age, and residence area of the user 100 and a user including past conversation history, past service usage history, preferred content service usage history, service usage time distribution information, and the like. It may be composed of information related to the service used by (100). At this time, the user attribute information is information related to the user 100, which may be pre-stored in the database, or may be extracted based on the identification information of the user 100 from the moment of speech information, but is not limited thereto. . In addition, even when no user attribute information is given, the emotion of the user 100 may be recognized using voice information and video information.

발화 순간 정보 추출부(240)는 발화 순간 정보로부터 사용자(100)가 발화한 순간에 대한 영상 정보 및 위치 정보 중 적어도 하나를 추출할 수 있다. 여기서, 영상 정보는 영상 통화와 같이 몇 분 내외의 얼굴 영상이 촬영된 영상, 순간적으로 감지되는 일부 얼굴 영상 등을 모두 포함할 수 있다. The utterance instantaneous information extraction unit 240 may extract at least one of image information and location information for the moment uttered by the user 100 from the utterance instantaneous information. Here, the video information may include all images of a face image within a few minutes, such as a video call, and some face images detected instantaneously.

얼굴 분석부(250)는 추출된 영상 정보에 기초하여 사용자(100)의 얼굴을 분석할 수 있다. 예를 들어, 사용자 속성 정보에서 성별 또는 연령이 포함되지 않은 경우, 얼굴 분석부(250)에서 분석된 사용자의 얼굴 정보에 기초하여 사용자의 성별, 연령 등을 도출할 수 있다. 또한, 얼굴 분석부(250)는 사용자(100)의 눈, 입 등에 위치한 얼굴 근육에 기초하여 사용자(100)의 얼굴 표정을 분석할 수도 있다. The face analysis unit 250 may analyze the face of the user 100 based on the extracted image information. For example, if gender or age is not included in the user attribute information, the gender and age of the user may be derived based on the face information of the user analyzed by the face analyzer 250. In addition, the face analysis unit 250 may analyze the facial expression of the user 100 based on the facial muscles located in the eyes, mouth, and the like of the user 100.

장소 파악부(260)는 추출된 위치 정보에 기초하여 사용자(100)가 위치한 장소를 파악할 수 있다. 예를 들어, 장소 파악부(260)는 추출된 위치 정보에 기초하여 사용자(100)가 위치한 지역을 파악할 수 있다. The location identification unit 260 may identify a location where the user 100 is located based on the extracted location information. For example, the location identification unit 260 may identify the region where the user 100 is located based on the extracted location information.

선택부(270)는 음성 정보, 발화 순간 정보 및 사용자 속성 정보에 기초하여 기등록된 복수의 감정 인식 모델 중 적어도 하나의 감정 인식 모델을 선택할 수 있다. 여기서, 기등록된 복수의 감정 인식 모델은 성별 감정 인식 모델, 연령별 감정 인식 모델, 지역별 감정 인식 모델, 특정 서비스 이용자별 감정 인식 모델 등을 포함할 수 있다. The selector 270 may select at least one emotion recognition model from among a plurality of pre-registered emotion recognition models based on voice information, instantaneous speech information, and user attribute information. Here, the plurality of pre-registered emotion recognition models may include a gender emotion recognition model, an age-specific emotion recognition model, a regional emotion recognition model, and a specific service user emotion recognition model.

선택부(270)는 분석된 사용자(100)의 얼굴에 기초하여 기등록된 복수의 감정 인식 모델 중 적어도 하나의 감정 인식 모델을 선택할 수 있다. 예를 들어, 선택부(270)는 사용자(100)의 얼굴에 기초하여 기등록된 감정 인식 모델 중 여성 감정 인식 모델을 선택할 수 있다. The selector 270 may select at least one emotion recognition model from among a plurality of pre-registered emotion recognition models based on the analyzed face of the user 100. For example, the selection unit 270 may select a female emotion recognition model from among pre-registered emotion recognition models based on the face of the user 100.

선택부(270)는 파악된 사용자(100)가 위치한 장소에 기초하여 기등록된 복수의 감정 인식 모델 중 적어도 하나의 감정 인식 모델을 선택할 수 있다. 예를 들어, 선택부(270)는 사용자(100)가 위치한 장소가 '서울'인 경우, 기등록된 복수의 감정 인시 모델 중 서울 지역 감정 인식 모델을 선택할 수 있다. The selector 270 may select at least one emotion recognition model from among a plurality of pre-registered emotion recognition models based on the identified location of the user 100. For example, when the place where the user 100 is located is'Seoul', the selector 270 may select the emotion recognition model of the Seoul region from among a plurality of registered emotion recognition models.

예를 들어, 사용자(100)가 '서울에 거주하는 20대 남성'인 경우, 선택부(270)는 기등록된 복수의 모델에서 남성 감정 인식 모델, 서울 지역 감정 인식 모델, 20대 감정 인식 모델 중 어느 하나를 선택하거나, 이들의 조합을 선택할 수도 있다.For example, when the user 100 is'a man in his 20s who lives in Seoul', the selection unit 270 includes a male emotion recognition model, a Seoul regional emotion recognition model, and a 20th emotion recognition model from a plurality of previously registered models. Either one or a combination of these may be selected.

감정 인식부(280)는 선택된 감정 인식 모델에 의해 음성 정보를 분석하여 사용자(100)의 감정을 인식할 수 있다. 여기서, 인식되는 감정의 종류는 최소 2개에서 많게는 수십가지의 감정 중 하나로 인식될 수 있다. 감정 인식부(280)는 음성 정보를 분석하여 하나의 감정을 도출할 수 있으며, 수초 단위로 분석하여 실시간에 가까운 감정의 시퀀스를 도출할 수도 있다. 감정은 사전 시스템(미도시)에 의해 기정의되며, 2개에서 최대 수십 여개까지의 종류로 분화될 수 있다. 예를 들어, 중립, 화남 감정의 두 가지 감정으로부터 중립, 화남, 슬픔, 놀람, 무서움, 즐거움 등의 6가지 감정으로 분화될 수도 있다. The emotion recognition unit 280 may recognize emotions of the user 100 by analyzing voice information according to the selected emotion recognition model. Here, the type of emotions to be recognized may be recognized as one of dozens of emotions, at least two. The emotion recognition unit 280 may derive one emotion by analyzing voice information, or may analyze a unit of seconds to derive a sequence of emotions close to real time. Emotions are defined by a dictionary system (not shown), and can be divided into 2 to up to dozens of types. For example, it may be differentiated from two emotions: neutral, angry emotion, to six emotions: neutral, angry, sad, surprise, fear, and pleasure.

구체적으로, 감정 인식부(280)는 감정 인식 모델에 의해 음성 정보를 분석하여 순간 감정 데이터 또는 연속 감정 데이터를 도출할 수 있다. 예를 들어, 사용자(100)의 음성 정보만을 이용하여 사용자(100)의 감정을 인식하는 경우, 감정 인식부(280)는 감정 인식 모델에 의해 음성 정보를 분석하여 '화남'과 같은 순간 감정 데이터를 도출할 수 있다. 이와 달리, 사용자(100)의 음성 정보 및 발화 순간 정보 중 사용자(100)의 얼굴을 촬영한 영상 정보를 함께 이용하여 사용자(100)의 감정을 인식하는 경우, 감정 인식부(280)는 감정 인식 모델에 의해 음성 정보를 분석하여 '차분-화남-격분'등과 같은 연속 감정 데이터를 도출할 수 있다. Specifically, the emotion recognition unit 280 may derive instantaneous emotion data or continuous emotion data by analyzing voice information using the emotion recognition model. For example, when recognizing the emotion of the user 100 using only the voice information of the user 100, the emotion recognition unit 280 analyzes the voice information by the emotion recognition model and analyzes the instant emotion data such as'angry'. Can be derived. On the other hand, in the case of recognizing the emotion of the user 100 by using the video information of the face of the user 100 among the voice information and the instantaneous information of the user 100, the emotion recognition unit 280 recognizes the emotion By analyzing the voice information by the model, it is possible to derive continuous emotion data such as'differential-angry-outrage'.

전송부(290)는 인식된 사용자(100)의 감정을 다른 사용자 단말(130)로 전송할 수 있다. 예를 들어, 인식된 사용자(100)의 감정 정보가 다른 사용자 단말(130)로 전송된 경우, 다른 사용자 단말(130)은 인식된 사용자(100)의 감정에 기초하여 관련 응대를 사용자(100)에게 제공할 수 있다. 여기서, 감정 인식 결과는 최종 판정된 순간 감정 데이터일 수 있지만, 각 감정 별 확률 또는 감정과 해당 감정간의 신뢰도 쌍 등의 다양한 형태로 다른 사용자 단말(130)로 전송될 수 있다. The transmission unit 290 may transmit the emotion of the recognized user 100 to another user terminal 130. For example, when the emotion information of the recognized user 100 is transmitted to the other user terminal 130, the other user terminal 130 is based on the emotion of the recognized user 100 to respond to the related user 100 Can provide. Here, the result of the emotion recognition may be the emotion data at the moment of final determination, but may be transmitted to other user terminals 130 in various forms such as a probability for each emotion or a confidence pair between the emotion and the emotion.

도 3a 및 도 3b는 본 발명의 일 실시예에 따른 사용자의 감정을 인식하는 과정을 설명하기 위한 예시적인 도면이다. 3A and 3B are exemplary views illustrating a process of recognizing a user's emotion according to an embodiment of the present invention.

도 3a는 본 발명의 일 실시예에 따른 음성 통화를 통해 사용자의 감정을 인식하는 과정을 설명하기 위한 예시적인 도면이다. 도 3a를 참조하면, 사용자(100)는 콜 센터 상담원(300)과 음성 통화를 수행할 수 있다. 여기서, 다른 사용자 단말(130)은 콜 센터 상담원(300)의 단말일 수 있다. 3A is an exemplary diagram for explaining a process of recognizing a user's emotion through a voice call according to an embodiment of the present invention. Referring to FIG. 3A, the user 100 may perform a voice call with the call center agent 300. Here, the other user terminal 130 may be a terminal of the call center agent 300.

감정 인식 서버(120)는 사용자 단말(110)로부터 음성 정보 및 발화 순간 정보를 포함하는 입력 데이터를 수신할 수 있다. 예를 들어, 음성 정보는 "물건이 파손됐어요"와 같은 사용자(100)의 음성이 포함될 수 있다. 발화 순간 정보는 사용자(100)의 현재 위치가 포함될 수 있다. The emotion recognition server 120 may receive input data including voice information and instantaneous speech information from the user terminal 110. For example, the voice information may include the voice of the user 100, such as "The object is broken." The utterance information may include the current location of the user 100.

감정 인식 서버(120)는 사용자 단말(110)로부터 사용자 식별 정보를 수신할 수 있다. 예를 들어, 감정 인식 서버(120)는 사용자 단말(110)의 회선 번호 또는 사용자 단말(110)로부터 사용자 계정을 수신할 수 있다.The emotion recognition server 120 may receive user identification information from the user terminal 110. For example, the emotion recognition server 120 may receive a user account from the line number of the user terminal 110 or the user terminal 110.

감정 인식 서버(120)는 사용자(100)의 식별 정보에 기초하여 사용자 속성 정보를 추출할 수 있다. 예를 들어, 감정 인식 서버(120)는 사용자 속성 정보에 기초하여 사용자(100)의 현재 거주지가 '전라도 지역'임을 파악할 수 있다. The emotion recognition server 120 may extract user attribute information based on the identification information of the user 100. For example, the emotion recognition server 120 may determine that the current residence of the user 100 is'Jeonolla area' based on the user attribute information.

감정 인식 서버(120)는 기등록된 복수의 감정 인식 모델 중 지역별 감정 인식 모델을 이용할 수 있다. 이 때, 감정 인식 서버(120)는 지역별 감정 인식 모델 중 사용자의 거주지에 해당하는 전라도 지역 감성 인식 모델을 선택하고, 선택된 전라도 지역 감정 인식 모델을 통해 음성 정보를 분석하여 순간 감정 데이터를 도출하여 사용자(100)의 감정을 '불만'으로 인식할 수 있다. The emotion recognition server 120 may use a region-specific emotion recognition model among a plurality of registered emotion recognition models. At this time, the emotion recognition server 120 selects the Jeolla-do regional emotion recognition model corresponding to the user's residence from among the region-specific emotion recognition models, and analyzes voice information through the selected Jeolla-do regional emotion recognition model to derive instant emotion data to the user The emotion of (100) can be recognized as'complaint'.

감정 인식 서버(120)는 인식된 사용자(100)의 감정을 콜센터 상담원(300)의 단말(130)로 전송할 수 있다. The emotion recognition server 120 may transmit the emotion of the recognized user 100 to the terminal 130 of the call center agent 300.

콜 센터 상담원(300)은 인식된 사용자(100)의 감정에 기초하여 불만 상황에 적합한 응대를 사용자(100)에게 제공할 수 있다. The call center agent 300 may provide the user 100 with a response suitable for a complaint situation based on the emotion of the recognized user 100.

도 3b는 본 발명의 일 실시예에 따른 영상 통화를 통해 사용자의 감정을 인식하는 과정을 설명하기 위한 예시적인 도면이다. 도 3b를 참조하면, 사용자(100)는 상대방(310)과 영상 통화를 수행할 수 있다. 여기서, 다른 사용자 단말(130)은 영상 통화를 하는 상대방(310)의 단말일 수 있다. 3B is an exemplary diagram for explaining a process of recognizing a user's emotion through a video call according to an embodiment of the present invention. Referring to FIG. 3B, the user 100 may make a video call with the other party 310. Here, the other user terminal 130 may be the terminal of the other party 310 making a video call.

감정 인식 서버(120)는 사용자 단말(110)로부터 음성 정보 및 발화 순간 정보를 포함하는 입력 데이터를 수신할 수 있다. 예를 들어, 음성 정보는 "오늘 너무 즐거웠어. 거기 또 가고 싶다."와 같은 사용자(100)의 음성이 포함될 수 있다. 발화 순간 정보는 사용자(100)의 얼굴이 촬영된 영상과 현재 위치가 포함될 수 있다. The emotion recognition server 120 may receive input data including voice information and instantaneous speech information from the user terminal 110. For example, the voice information may include the voice of the user 100, such as "I was so happy today. I want to go there again." The utterance information may include the image of the face of the user 100 and the current location.

감정 인식 서버(120)는 사용자 단말(110)로부터 사용자 식별 정보를 수신할 수 있다. 예를 들어, 감정 인식 서버(120)는 사용자 단말(110)의 회선 번호를 수신할 수 있다.The emotion recognition server 120 may receive user identification information from the user terminal 110. For example, the emotion recognition server 120 may receive the line number of the user terminal 110.

감정 인식 서버(120)는 사용자(100)의 식별 정보에 기초하여 사용자 속성 정보를 추출할 수 있다. 예를 들어, 감정 인식 서버(120)는 사용자 속성 정보에 기초하여 사용자(100)의 성별이 '여성'임을 파악할 수 있다. The emotion recognition server 120 may extract user attribute information based on the identification information of the user 100. For example, the emotion recognition server 120 may determine that the gender of the user 100 is “female” based on the user attribute information.

감정 인식 서버(120)는 기등록된 복수의 감정 인식 모델 중 성별 감정 인식모델을 이용할 수 있다. 이 때, 감정 인식 서버(120)는 성별 감정 인식 모델 중 사용자의 성별에 해당하는 여성 감정 인식 모델을 선택하고, 선택된 여성 학습 모델을 통해 음성 정보 및 영상 정보를 분석하여 연속 감정 데이터를 도출하여 사용자(100)의 감정을 '행복-신남'으로 인식할 수 있다. The emotion recognition server 120 may use a gender emotion recognition model among a plurality of registered emotion recognition models. At this time, the emotion recognition server 120 selects a female emotion recognition model corresponding to the user's gender from the gender emotion recognition model, analyzes voice information and video information through the selected female learning model, and derives continuous emotion data to generate a user. The emotion of (100) can be recognized as'happy-shinnam'.

감정 인식 서버(120)는 인식된 사용자(100)의 감정을 상대방(310)의 단말(130)로 전송할 수 있다. The emotion recognition server 120 may transmit the emotion of the recognized user 100 to the terminal 130 of the other party 310.

도 4는 본 발명의 일 실시예에 따른 감정 인식 서버에서 감정을 인식하는 방법의 순서도이다. 도 4에 도시된 감정 인식 서버(120)에서 감정을 인식하는 방법은 도 1 내지 도 3b에 도시된 실시예에 따라 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라고 하더라도 도 1 내지 도 3b에 도시된 실시예에 따른 감정 인식 서버(120)에서 감정을 인식하는 방법에도 적용된다. 4 is a flowchart of a method for recognizing emotion in an emotion recognition server according to an embodiment of the present invention. The method for recognizing emotion in the emotion recognition server 120 illustrated in FIG. 4 includes steps that are processed in time series according to the embodiment illustrated in FIGS. 1 to 3B. Therefore, even if it is omitted below, it is also applied to a method of recognizing emotions in the emotion recognition server 120 according to the embodiment illustrated in FIGS. 1 to 3B.

단계 S410에서 감정 인식 서버(120)는 사용자 단말(110)로부터 음성 정보 및 발화 순간 정보를 포함하는 입력 데이터 및 사용자(100)의 식별 정보를 수신할 수 있다. In step S410, the emotion recognition server 120 may receive input data including voice information and instantaneous speech information from the user terminal 110 and identification information of the user 100.

단계 S420에서 감정 인식 서버(120)는 수신한 사용자(100)의 식별 정보에 기초하여 사용자 속성 정보를 추출할 수 있다. In step S420, the emotion recognition server 120 may extract user attribute information based on the identification information of the received user 100.

단계 S430에서 감정 인식 서버(120)는 음성 정보, 발화 순간 정보 및 사용자 속성 정보에 기초하여 기등록된 복수의 감정 인식 모델 중 적어도 하나의 감정 인식 모델을 선택할 수 있다. In step S430, the emotion recognition server 120 may select at least one emotion recognition model from among a plurality of pre-registered emotion recognition models based on voice information, instantaneous speech information, and user attribute information.

단계 S440에서 감정 인식 서버(120)는 선택된 감정 인식 모델에 의해 음성 정보를 분석하여 사용자(100)의 감정을 인식할 수 있다. In step S440, the emotion recognition server 120 may recognize the emotion of the user 100 by analyzing voice information according to the selected emotion recognition model.

상술한 설명에서, 단계 S410 내지 S440은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 전환될 수도 있다.In the above description, steps S410 to S440 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted if necessary, and the order between the steps may be switched.

도 1 내지 도 4를 통해 설명된 감정 인식 서버에서 감정을 인식하는 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 또한, 도 1 내지 도 4를 통해 설명된 감정 인식 서버에서 감정을 인식하는 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램의 형태로도 구현될 수 있다. The method for recognizing emotions in the emotion recognition server described with reference to FIGS. 1 to 4 may also be implemented in the form of a computer program stored in a medium executed by a computer or a recording medium including instructions executable by a computer. In addition, the method for recognizing emotion in the emotion recognition server described with reference to FIGS. 1 to 4 may also be implemented in the form of a computer program stored in a medium executed by a computer.

컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. Computer readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. In addition, computer readable media may include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustration only, and those skilled in the art to which the present invention pertains can understand that it can be easily modified to other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다. The scope of the present invention is indicated by the following claims rather than the above detailed description, and it should be interpreted that all changes or modified forms derived from the meaning and scope of the claims and equivalent concepts thereof are included in the scope of the present invention. do.

110: 사용자 단말
120: 감정 인식 서버
130: 다른 사용자 단말
210: 학습부
220: 수신부
230: 속성 정보 추출부
240: 발화 순간 정보 추출부
250: 얼굴 표정 분석부
260: 장소 파악부
270: 선택부
280: 감정 인식부
290: 전송부110: user terminal
120: emotion recognition server
130: another user terminal
210: learning department
220: receiver
230: attribute information extraction unit
240: instantaneous information extraction unit
250: facial expression analysis unit
260: location identification
270: selection section
280: emotion recognition unit
290: transmission unit

Claims

In the emotion recognition server,
A reception unit for receiving input data including voice information and instantaneous speech information from the user terminal and identification information of the user;
An attribute information extracting unit extracting user attribute information based on the received identification information of the user;
A selection unit for selecting at least one emotion recognition model from among a plurality of emotion recognition models pre-registered on the basis of the voice information, the spoken moment information, and the user attribute information; And
And an emotion recognition unit that recognizes the user's emotion by analyzing the voice information by the selected emotion recognition model.

According to claim 1,
The user attribute information includes at least one of the user's gender, age, residence area, past conversation history, preferred content service usage history, and service usage time distribution information.

According to claim 1,
Emotion recognition server further comprises a speech instantaneous information extraction unit for extracting at least one of the video information and location information for the user's speech moment from the speech moment information.

The method of claim 3,
Further comprising a face analysis unit for analyzing the face of the user based on the extracted image information,
The selection unit selects at least one emotion recognition model from among a plurality of emotion recognition models registered in advance based on the analyzed face information of the user,
The emotion recognition unit analyzes the voice information by the selected emotion recognition model, and recognizes the user's emotion based on the analyzed voice information, emotion recognition server.

The method of claim 3,
Wherein based on the extracted location information further comprises a place identification unit for determining the location where the user is located,
The selection unit selects at least one emotion recognition model among a plurality of emotion recognition models that are pre-registered based on the location where the identified user is located,
The emotion recognition unit recognizes the user's emotion by analyzing the voice information according to the selected emotion recognition model.

According to claim 1,
The pre-registered plurality of emotion recognition models include at least one of a gender emotion recognition model, an age-specific emotion recognition model, a region-specific emotion recognition model, and a specific service user-specific emotion recognition model.

According to claim 1,
Emotion recognition server further comprises a transmission unit for transmitting the emotion of the recognized user to another user terminal.

According to claim 1,
Emotion recognition server further comprises a learning unit that classifies the conversation history data collected from a plurality of user terminals according to user-specific attribute information, and learns with an individual learning model based on the classified user-specific attribute information.

According to claim 1,
The emotion recognition unit derives instant emotion data by analyzing the voice information by the emotion recognition model.

According to claim 1,
The emotion recognition unit derives continuous emotion data by analyzing the voice information by the emotion recognition model.

In a method for recognizing emotion in the emotion recognition server,
Receiving input data and user identification information including voice information and utterance information from a user terminal;
Extracting user attribute information based on the received identification information of the user;
Selecting at least one emotion recognition model among a plurality of pre-registered emotion recognition models based on the voice information, the spoken moment information, and the user attribute information; And
And recognizing the user's emotion by analyzing the voice information by the selected emotion recognition model.

The method of claim 11,
The user attribute information includes at least one of the user's gender, age, residence area, past conversation history, preferred content service usage history, and service usage time distribution information.

The method of claim 11,
And extracting at least one of image information and location information of the user's utterance from the utterance moment information.

The method of claim 13,
Analyzing the user's face based on the extracted image information;
Selecting at least one emotion recognition model from among a plurality of registered emotion recognition models based on the analyzed face information of the user; And
And analyzing the voice information by the selected emotion recognition model and recognizing the user's emotion based on the analyzed voice information.

The method of claim 13,
Determining a location where the user is located based on the extracted location information;
Selecting at least one emotion recognition model from among a plurality of registered emotion recognition models based on the identified user's location; And
And recognizing the user's emotion by analyzing the voice information by the selected emotion recognition model.

The method of claim 11,
The pre-registered plurality of emotion recognition models include at least one of a gender emotion recognition model, an age-specific emotion recognition model, a region-specific emotion recognition model, and a specific service user-specific emotion recognition model.

The method of claim 11,
And transmitting the emotion of the recognized user to another user terminal.

The method of claim 11,
The method further includes classifying conversation history data collected from a plurality of user terminals according to attribute information for each user, and learning with an individual learning model based on the classified attribute information for each user.

The method of claim 11,
Recognizing the emotion,
Emotion recognition method, by analyzing the voice information by the emotion recognition model to derive either of the instantaneous emotion data or continuous emotion data.

A computer program stored on a computer readable medium comprising a sequence of instructions for recognizing emotion, comprising:
When the computer program is executed by a computing device,
Receiving input data and user identification information including voice information and instantaneous speech information from a user terminal,
Extract user attribute information based on the received identification information of the user,
Selecting at least one emotion recognition model from among a plurality of emotion recognition models pre-registered based on the voice information, the utterance moment information, and the user attribute information,
And a sequence of instructions for analyzing the voice information by the selected emotion recognition model to recognize the user's emotion.