KR20190140801A

KR20190140801A - A multimodal system for simultaneous emotion, age and gender recognition

Info

Publication number: KR20190140801A
Application number: KR1020180071850A
Authority: KR
Inventors: 이수영; 신영훈; 김태호; 채명수; 김준우
Original assignee: 한국과학기술원
Priority date: 2018-05-23
Filing date: 2018-06-22
Publication date: 2019-12-20

Abstract

Disclosed are a method for recognizing the emotion, age and gender of a user based on image, voice and text information. A recognition method performed by a recognition system according to an embodiment of the present invention comprises the steps of: receiving an estimated value estimated from different types of data; training the input estimated value through a learning model; recognizing multi-model based user identification information as learning the estimated value through the learning model; and transmitting the recognized user identification information as a recognition result.

Description

A method for recognizing user's emotion, age and gender based on video, voice and text information {A MULTIMODAL SYSTEM FOR SIMULTANEOUS EMOTION, AGE AND GENDER RECOGNITION}

아래의 설명은 영상/음성/텍스트 데이터로부터 추정된 결과값을 이용하여 사용자의 식별 정보를 인식하는 기술에 관한 것이다. The following description relates to a technique for recognizing identification information of a user using the result value estimated from the video / audio / text data.

스마트 폰을 이용한 서비스가 활성화 됨에 따라, 인공지능 대화 쳇봇이 등장하기 시작했다. 기존의 텍스트 기반 쳇봇의 경우에는 질의응답 수준의 대화는 가능했으나, 감정과 나이, 성별 같은 화자의 상태에 맞춤형 대응을 해주지 못하였다.As services using smartphones became active, artificial intelligence chat bots began to emerge. In the case of the existing text-based botbots, conversations at the Q & A level were possible, but they could not respond to the speaker's condition such as emotion, age, and gender.

한편, 영상/음성/텍스트 중 한가지 모듈만 가지고 사용자의 감정을 파악하는 연구는 많이 진행된 바 있다. 하지만, 한가지 모듈만 가지고 화자의 감정/나이/성별을 판단하는 것은 그 성능에 한계가 있다. 또한 주변이 어둡거나 노이즈가 많은 환경에서는 각각 영상과 음성을 이용한 인식이 어려울 수 있다.On the other hand, a lot of research has been conducted to grasp the emotion of the user using only one module of the image / audio / text. However, judging the emotion / age / gender of a speaker with only one module is limited in its performance. In addition, in a dark or noisy environment, it may be difficult to recognize video and audio, respectively.

이러한 세 가지 데이터를 동시에 사용하고자 할 때 각각의 성능이 다르기 때문에 판단 결과를 효과적으로 활용하여 최종 판단을 내리기 어렵다. 단순히 확률 결과 값을 평균으로 내면, 인식이 좋은 데이터의 성능을 떨어트리는 결과를 야기할 수 있다. When you want to use these three data at the same time, each performance is different, so it is difficult to make the final judgment by effectively utilizing the judgment results. Simply averaging probability results can result in poor perception of data.

영상 데이터, 음성 데이터, 텍스트 데이터를 종합적으로 활용하여 사용자의 감정, 나이, 성별을 포함하는 사용자의 식별 정보를 인식하는 멀티 모달 기반의 인식 시스템 및 방법을 제공할 수 있다. A multi-modal-based recognition system and method for recognizing user's identification information including user's emotion, age, and gender may be provided by comprehensively utilizing image data, voice data, and text data.

인식 시스템에 의해 수행되는 인식 방법은, 서로 다른 종류의 데이터로부터 추정된 추정값을 입력받는 단계; 상기 입력된 추정값을 학습 모델을 통하여 학습시키는 단계; 상기 학습 모델을 통하여 상기 추정값을 학습시킴에 따라 멀티 모달 기반의 사용자의 식별 정보를 인식하는 단계; 및 상기 인식된 사용자의 식별 정보를 인식 결과로서 전달하는 단계를 포함할 수 있다. A recognition method performed by a recognition system includes: receiving an estimated value estimated from different kinds of data; Training the input estimate through a learning model; Recognizing identification information of a multi-modal user based on learning the estimated value through the learning model; And transmitting the recognized identification information of the user as a recognition result.

상기 학습 모델을 통하여 상기 추정값을 학습시킴에 따라 멀티 모달 기반의 사용자의 식별 정보를 인식하는 단계는, 상기 추정값의 각각과 상기 추정값의 각각을 학습 모델에 학습시킴에 따라 획득된 출력값의 각각의 내적을 통하여 상기 사용자의 식별 정보와 관련된 인식 결과를 획득하는 단계를 포함할 수 있다. Recognizing identification information of a multi-modal user according to learning the estimated value through the learning model, each inner product of the output value obtained by learning each of the estimated value and each of the estimated values in a learning model. It may include obtaining a recognition result associated with the identification information of the user through.

상기 입력된 추정값을 학습 모델을 통하여 학습시키는 단계는, 상기 추정값의 각각을 입력값으로 상기 학습 모델에 구성된 히든 레이어에 통과시킴에 따라 출력값을 획득하는 단계를 포함할 수 있다. The training of the input estimate through the learning model may include obtaining an output value by passing each of the estimates as an input value to a hidden layer configured in the learning model.

상기 입력된 추정값을 학습 모델을 통하여 학습시키는 단계는, 인공신경망에 기반하여 구성된 학습 모델에 상기 추정값의 각각을 동시에 입력하여 학습시키는 단계를 포함할 수 있다. The training of the input estimate through the learning model may include simultaneously inputting and learning each of the estimates into a learning model constructed based on an artificial neural network.

상기 서로 다른 종류의 데이터로부터 추정된 추정값을 입력받는 단계는, 상기 사용자와 연관된 영상 데이터, 음성 데이터 및 텍스트 데이터 중 적어도 하나 이상의 데이터에 기반하여 상기 사용자의 식별 정보를 인식하기 위하여 추정값을 추정하는 단계를 포함할 수 있다. The receiving of the estimated values from the different types of data may include estimating the estimated values for recognizing identification information of the user based on at least one or more data among image data, audio data, and text data associated with the user. It may include.

상기 학습 모델을 통하여 상기 추정값을 학습시킴에 따라 멀티 모달 기반의 사용자의 식별 정보를 인식하는 단계는, 상기 사용자의 감정, 나이 또는 성별 중 적어도 하나 이상을 포함하는 사용자의 식별 정보를 획득하는 단계를 포함할 수 있다. Recognizing identification information of a multi-modal user according to learning the estimated value through the learning model, obtaining identification information of the user including at least one of an emotion, an age, and a gender of the user. It may include.

인식 시스템은, 서로 다른 종류의 데이터로부터 추정된 추정값을 입력받는 입력부; 상기 입력된 추정값을 학습 모델을 통하여 학습시키는 학습부; 상기 학습 모델을 통하여 상기 추정값을 학습시킴에 따라 멀티 모달 기반의 사용자의 식별 정보를 인식하는 인식부; 및 상기 인식된 사용자의 식별 정보를 인식 결과로서 전달하는 전달부를 포함할 수 있다. The recognition system includes an input unit for receiving an estimated value estimated from different types of data; A learning unit learning the input estimate through a learning model; A recognition unit for recognizing identification information of a multi-modal user based on learning the estimated value through the learning model; And a transmission unit for transmitting the recognized identification information of the user as a recognition result.

상기 인식부는, 상기 추정값의 각각과 상기 추정값의 각각을 학습 모델에 학습시킴에 따라 획득된 출력값의 각각의 내적을 통하여 상기 사용자의 식별 정보와 관련된 인식 결과를 획득할 수 있다. The recognition unit may obtain a recognition result related to the identification information of the user through each inner product of the output value obtained by learning each of the estimated value and each of the estimated values in a learning model.

상기 학습부는, 상기 추정값의 각각을 입력값으로 상기 학습 모델에 구성된 히든 레이어에 통과시킴에 따라 출력값을 획득할 수 있다. The learner may obtain an output value by passing each of the estimated values as an input value through a hidden layer configured in the learning model.

상기 학습부는, 인공신경망에 기반하여 구성된 학습 모델에 상기 추정값의 각각을 동시에 입력하여 학습시킬 수 있다. The learning unit may learn by simultaneously inputting each of the estimated values to a learning model constructed based on an artificial neural network.

상기 입력부는, 상기 사용자와 연관된 영상 데이터, 음성 데이터 및 텍스트 데이터 중 적어도 하나 이상의 데이터에 기반하여 상기 사용자의 식별 정보를 인식하기 위하여 추정값을 추정할 수 있다. The input unit may estimate an estimated value for recognizing identification information of the user based on at least one or more data among image data, audio data, and text data associated with the user.

상기 인식부는, 상기 사용자의 감정, 나이 또는 성별 중 적어도 하나 이상을 포함하는 사용자의 식별 정보를 획득할 수 있다. The recognition unit may obtain identification information of a user including at least one of an emotion, an age, and a gender of the user.

일 실시예에 따른 인식 시스템은 서로 다른 종류의 데이터를 인공신경망을 통하여 학습하여 사용자의 식별 정보의 인식률을 향상시킬 수 있다. A recognition system according to an embodiment may improve recognition rates of identification information of a user by learning different types of data through an artificial neural network.

일 실시예에 따른 인식 시스템은 영상 데이터, 음성 데이터 및 텍스트 데이터로부터 추정된 추정값을 이용하여 멀티 모달 기반의 사용자의 감정, 나이, 성별을 인식할 수 있다. The recognition system according to an exemplary embodiment may recognize the emotion, age, and gender of a multi-modal user based on the estimated values estimated from the image data, the voice data, and the text data.

도 1은 일 실시예에 따른 네트워크 환경의 예를 설명하기 위한 도면이다.
도 2는 일 실시예에 따른 인식 시스템의 구성을 설명하기 위한 블록도이다.
도 3은 일 실시예에 따른 인식 시스템에서 영상, 음성, 텍스트 정보를 기반으로 사용자의 감정, 나이, 성별을 인식하는 방법을 설명하기 위한 흐름도이다.
도 4는 일 실시예에 따른 인식 시스템에서 사용자의 식별 정보를 인식하기 위한 인공신경망 구조를 설명하기 위한 도면이다.
도 5는 일 실시예에 따른 인식 시스템에서 사용자의 감정, 나이, 성별을 인식하는 동작을 설명하기 위한 도면이다. 1 is a diagram illustrating an example of a network environment according to an exemplary embodiment.
2 is a block diagram illustrating a configuration of a recognition system according to an exemplary embodiment.
3 is a flowchart illustrating a method of recognizing a user's emotion, age, and gender based on image, voice, and text information in a recognition system according to an exemplary embodiment.
4 is a diagram illustrating an artificial neural network structure for recognizing identification information of a user in a recognition system, according to an exemplary embodiment.
5 is a diagram for describing an operation of recognizing a user's emotion, age, and gender in a recognition system, according to an exemplary embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, exemplary embodiments will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 따른 네트워크 환경의 예를 설명하기 위한 도면이다.1 is a diagram illustrating an example of a network environment according to an exemplary embodiment.

도 1의 네트워크 환경은 전자 기기(110), 인식 시스템(100) 및 네트워크(120)를 포함하는 예를 나타내고 있다. 이러한 도 1은 발명의 설명을 위한 일례로 전자 기기의 수나 시스템의 수가 도 1과 같이 한정되는 것은 아니다. The network environment of FIG. 1 illustrates an example including an electronic device 110, a recognition system 100, and a network 120. 1 is an example for describing the present invention, and the number of electronic devices or the number of systems is not limited as shown in FIG. 1.

전자 기기(110)는 컴퓨터 장치로 구현되는 고정형 단말이거나 이동형 단말일 수 있다. 전자 기기(110)의 예를 들면, 스마트폰(smart phone), 휴대폰, 네비게이션, 컴퓨터, 노트북, 디지털방송용 단말, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 태블릿 PC 등이 있다. 일례로 전자 기기(110)는 무선 또는 유선 통신 방식을 이용하여 네트워크(120)를 통해 다른 전자 기기 및/또는 서버(100)와 통신할 수 있다. The electronic device 110 may be a fixed terminal implemented as a computer device or a mobile terminal. Examples of the electronic device 110 include a smart phone, a mobile phone, a navigation, a computer, a notebook computer, a digital broadcasting terminal, a personal digital assistant (PDA), a portable multimedia player (PMP), a tablet PC, and the like. For example, the electronic device 110 may communicate with other electronic devices and / or servers 100 through the network 120 using a wireless or wired communication scheme.

통신 방식은 제한되지 않으며, 네트워크(120)가 포함할 수 있는 통신망(일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망)을 활용하는 통신 방식뿐만 아니라 기기들간의 근거리 무선 통신 역시 포함될 수 있다. 예를 들어, 네트워크(120)는, PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 네트워크(120)는 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.The communication method is not limited and may include not only a communication method using a communication network (eg, a mobile communication network, a wired internet, a wireless internet, a broadcasting network) that the network 120 may include, but also a short range wireless communication between devices. For example, the network 120 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). And one or more of networks such as the Internet. In addition, the network 120 may include any one or more of network topologies including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree, or a hierarchical network. It is not limited.

인식 시스템(100)은 전자 기기(110)와 네트워크(120)를 통해 통신하여 명령, 코드, 파일, 컨텐츠, 서비스 등을 제공하는 컴퓨터 장치 또는 복수의 컴퓨터 장치들로 구현될 수 있다. 일례로, 인식 시스템(100)은 네트워크(120)를 통해 접속한 전자 기기(110)로 어플리케이션의 설치를 위한 파일을 제공할 수 있다. 이 경우 전자 기기(110)는 인식 시스템(100)으로부터 제공된 파일을 이용하여 어플리케이션을 설치할 수 있다. 또한 전자 기기(110)가 포함하는 운영체제(Operating System, OS) 및 적어도 하나의 프로그램(일례로 브라우저나 상기 설치된 어플리케이션)의 제어에 따라 인식 시스템(100)에 접속하여 인식 시스템(100)이 제공하는 서비스나 컨텐츠를 제공받을 수 있다. 예를 들어, 전자 기기(110)가 어플리케이션의 제어에 따라 네트워크(120)를 통해 서비스 요청 메시지를 인식 시스템(100)으로 전송하면, 인식 시스템(100)은 서비스 요청 메시지에 대응하는 코드를 전자 기기(110)로 전송할 수 있고, 전자 기기(110)는 어플리케이션의 제어에 따라 코드에 따른 화면을 구성하여 표시함으로써 사용자에게 컨텐츠를 제공할 수 있다. 또한, 인식 시스템(100)은 서버일 수 있으며, 영상, 음성, 텍스트 정보를 기반으로 사용자의 감정, 나이, 성별을 인식하는 서비스를 제공할 수 있다. The recognition system 100 may be implemented as a computer device or a plurality of computer devices that communicate with the electronic device 110 through a network 120 to provide commands, codes, files, contents, services, and the like. For example, the recognition system 100 may provide a file for installing an application to the electronic device 110 connected through the network 120. In this case, the electronic device 110 may install an application using a file provided from the recognition system 100. In addition, the recognition system 100 accesses the recognition system 100 under the control of an operating system (OS) included in the electronic device 110 and at least one program (for example, a browser or the installed application). You can receive services or content. For example, when the electronic device 110 transmits a service request message to the recognition system 100 through the network 120 under the control of an application, the recognition system 100 sends a code corresponding to the service request message to the electronic device. The electronic device 110 may provide content to the user by configuring and displaying a screen according to a code under the control of an application. In addition, the recognition system 100 may be a server, and may provide a service for recognizing a user's emotion, age, and gender based on image, voice, and text information.

도 2는 일 실시예에 따른 인식 시스템의 구성을 설명하기 위한 블록도이고, 도 3은 일 실시예에 따른 인식 시스템에서 영상, 음성, 텍스트 정보를 기반으로 사용자의 감정, 나이, 성별을 인식하는 방법을 설명하기 위한 흐름도이다.2 is a block diagram illustrating a configuration of a recognition system according to an embodiment. FIG. 3 is a diagram illustrating recognizing a user's emotion, age, and gender based on image, voice, and text information in a recognition system according to an embodiment. A flowchart for explaining the method.

인식 시스템(100)의 프로세서는 입력부(210), 학습부(220), 인식부(230) 및 전달부(240)를 포함할 수 있다. 이러한 프로세서의 구성요소들은 인식 시스템에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. 프로세서 및 프로세서의 구성요소들은 도 3의 영상, 음성, 텍스트 정보를 기반으로 사용자의 감정, 나이, 성별을 인식하는 방법이 포함하는 단계들(310 내지 340)을 수행하도록 인식 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서의 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다.The processor of the recognition system 100 may include an input unit 210, a learner 220, a recognizer 230, and a transmitter 240. The components of such a processor may be representations of different functions performed by the processor in accordance with control instructions provided by program code stored in the recognition system. The processor and the components of the processor may control the recognition system to perform the steps 310 to 340 included in the method of recognizing the emotion, age, and gender of the user based on the image, voice, and text information of FIG. 3. . In this case, the processor and the components of the processor may be implemented to execute instructions according to code of an operating system included in the memory and code of at least one program.

프로세서는 영상, 음성, 텍스트 정보를 기반으로 사용자의 감정, 나이, 성별을 인식하는 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예를 들면, 인식 시스템에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 인식 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서가 포함하는 입력부(210), 학습부(220), 인식부(230) 및 전달부(240) 각각은 메모리에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(310 내지 340)을 실행하기 위한 프로세서의 서로 다른 기능적 표현들일 수 있다. The processor may load program code stored in a file of a program for recognizing a user's emotion, age, and gender based on the image, voice, and text information into a memory. For example, when a program is executed in the recognition system, the processor may control the recognition system to load program code from a file of the program into a memory under control of an operating system. In this case, each of the processor and the input unit 210, the learner 220, the recognizer 230, and the transmitter 240 included in the processor executes a command of a corresponding part of the program code loaded in the memory to perform subsequent steps. Different functional representations of the processor for executing 310-340.

단계(310)에서 입력부(210)는 서로 다른 종류의 데이터로부터 추정된 추정값을 입력받을 수 있다. 입력부(210)는 사용자와 연관된 영상 데이터, 음성 데이터 및 텍스트 데이터 중 적어도 하나 이상의 데이터에 기반하여 사용자의 식별 정보를 인식하기 위하여 추정값을 추정할 수 있다.In operation 310, the input unit 210 may receive an estimated value estimated from different types of data. The input unit 210 may estimate the estimated value to recognize identification information of the user based on at least one or more data among image data, audio data, and text data associated with the user.

단계(320)에서 학습부(220)는 입력된 추정값을 학습 모델을 통하여 학습시킬 수 있다. 학습부(220)는 추정값의 각각을 입력값으로 학습 모델에 구성된 히든 레이어에 통과시킴에 따라 출력값을 획득할 수 있다. 학습부(220)는 인공신경망에 기반하여 구성된 학습 모델에 추정값의 각각을 동시에 입력하여 학습시킬 수 있다.In operation 320, the learner 220 may train the input estimate through the learning model. The learner 220 may obtain an output value by passing each of the estimated values as an input value to the hidden layer configured in the learning model. The learning unit 220 may learn by inputting each of the estimated values simultaneously to the learning model constructed based on the artificial neural network.

단계(330)에서 인식부(230)는 학습 모델을 통하여 추정값을 학습시킴에 따라 멀티 모달 기반의 사용자의 식별 정보를 인식할 수 있다. 인식부(230)는 추정값의 각각과 추정값의 각각을 학습 모델에 학습시킴에 따라 획득된 출력값의 각각의 내적을 통하여 사용자의 식별 정보와 관련된 인식 결과를 획득할 수 있다. 인식부(230)는 In operation 330, the recognition unit 230 may recognize identification information of the multi-modal user based on learning the estimated value through the learning model. The recognition unit 230 may acquire the recognition result related to the identification information of the user through each inner product of the output value obtained by learning each of the estimated value and each of the estimated values in the learning model. The recognition unit 230

단계(340)에서 전달부(240)는 인식된 사용자의 식별 정보를 인식 결과로서 전달할 수 있다. 전달부(240)는 사용자의 감정, 나이 또는 성별 중 적어도 하나 이상을 포함하는 사용자의 식별 정보를 획득할 수 있다.In operation 340, the transmitter 240 may transmit the recognized user identification information as a recognition result. The delivery unit 240 may obtain identification information of the user including at least one of an emotion, an age, and a gender of the user.

도 4는 일 실시예에 따른 인식 시스템에서 사용자의 식별 정보를 인식하기 위한 인공신경망 구조를 설명하기 위한 도면이다.4 is a diagram illustrating an artificial neural network structure for recognizing identification information of a user in a recognition system, according to an exemplary embodiment.

인식 시스템은 기계학습을 통하여 사용자의 식별 정보를 인식할 수 있다. 일례로, 기계학습은 인공지능의 한 분야로 새로운 정보를 학습하고, 학습을 수행함에 따라 습득된 정보를 효율적으로 사용할 수 있는 능력과 결부시키는 지식을 습득할 수 있고, 작업을 반복적으로 수행함으로써 결과를 획득하는 기술의 개선 과정이다. 예를 들면, 인식 시스템은 컴퓨터가 여러 데이터를 이용하여 마치 사람처럼 스스로 학습할 수 있게 하기 위하여 인공신경망을 기반으로 구축한 기계 학습 기술인 딥 러닝을 통하여 사용자의 식별 정보를 분석 및 추론할 수 있다. 이러한 딥 러닝은 인간의 두뇌가 수많은 데이터 속에서 패턴을 발견한 뒤 사물을 구분하는 정보처리 방식을 모방하여 컴퓨터가 사물을 분별할 수 있도록 기계 학습을 시킨다. 딥 러닝 기술을 적용하여 사람이 모든 판단 기준을 정해주지 않아도 컴퓨터가 스스로 인지, 추론 및 판단할 수 있게 된다. 이에, 인식 시스템은 딥러닝의 예로 CNN, RNN, DNN 등의 인공신경망을 이용하여 사용자의 식별 정보를 추론할 수 있다. 인식 시스템은 사용자의 식별 정보를 학습시키기 위한 인공신경망 기반의 학습 모델을 구성할 수 있다. The recognition system may recognize identification information of the user through machine learning. Machine learning, for example, is a field of artificial intelligence that learns new information, acquires knowledge that is tied to the ability to use the information efficiently as it learns, and iterates on tasks The process of improvement of the technology to obtain. For example, the recognition system may analyze and infer the user's identification information through deep learning, a machine learning technology based on an artificial neural network, in order to enable a computer to learn by itself like a human being using various data. Deep learning mimics the information processing method that separates objects after the human brain discovers patterns in a lot of data, and then makes machine learning so that computers can distinguish things. By applying deep learning technology, computers can recognize, reason, and judge on their own, even if a person does not set all the criteria. Accordingly, the recognition system may infer the identification information of the user using an artificial neural network such as CNN, RNN, DNN as an example of deep learning. The recognition system may construct an artificial neural network based learning model for learning user identification information.

예를 들면, 학습 모델에 구성된 인공신경망(400)의 구조는 적어도 하나의 FC Hidden Layer(이하, 히든 레이어로 기재)(410)를 포함할 수 있다. 인식 시스템은 영상 정보, 음성 정보 및 텍스트 정보로부터 사용자의 식별 정보에 대한 추정값을 인공신경망(400)의 입력값으로 입력할 수 있다. 인식 시스템은 영상 정보, 음성 정보 및 텍스트 정보로부터 인식된 감정, 나이, 성별을 포함하는 사용자의 식별 정보에 대한 추정값을 입력받을 수 있다. 이에, 각각의 추정값들이 학습 모델의 히든 레이어를 통과하게 된다. 인식 시스템은 히든 레이어를 통과한 각각의 출력값을 획득할 수 있다. 인식 시스템은 각각의 추정값과 각각의 출력값의 내적을 통하여 최종 인식 결과(420)를 도출할 수 있다. 이때, 인식 시스템은 각각의 출력값을 가중치로 하여 각각의 추정값과 각각의 가중치의 내적을 통해 사용자의 식별 정보를 인식할 수 있다. For example, the structure of the neural network 400 configured in the learning model may include at least one FC hidden layer (hereinafter, described as a hidden layer) 410. The recognition system may input an estimated value for the identification information of the user from the image information, the voice information, and the text information as an input value of the artificial neural network 400. The recognition system may receive an estimated value for identification information of the user including emotion, age, and gender recognized from the image information, the voice information, and the text information. Accordingly, each of the estimated values passes through the hidden layer of the learning model. The recognition system may obtain each output value passing through the hidden layer. The recognition system may derive the final recognition result 420 through the inner product of each estimated value and each output value. In this case, the recognition system may recognize the identification information of the user through the dot product of each estimated value and each weight using each output value as a weight.

도 5는 일 실시예에 따른 인식 시스템에서 사용자의 감정, 나이, 성별을 인식하는 동작을 설명하기 위한 도면이다.5 is a diagram for describing an operation of recognizing a user's emotion, age, and gender in a recognition system, according to an exemplary embodiment.

인식 시스템은 영상 데이터, 음성 데이터 또는 텍스트 데이터 중 적어도 하나 이상의 데이터로부터 추정된 추정값을 이용하여 멀티 모달 기반의 사용자의 식별 정보를 인식할 수 있다. 사용자(500)가 전자 기기(110)을 통하여 동작되는 인식 서비스를 설명하기로 한다. 이때, 전자 기기(110)에 플랫폼 또는 애플리케이션 형태로 인식 시스템에서 제공하는 서비스가 실행되어 동작될 수 있고, 또는, 서버와의 통신을 통하여 서비스가 실행되어 동작될 수 있다. 예를 들면, 전자 기기(110)에 상기 서비스를 위한 기능이 셋팅될 수 있다. 전자 기기(110)에 사용자(500)의 얼굴, 표정 등의 영상 정보, 텍스트 정보를 인식할 수 있는 센서 및 카메라가 장착되어 있을 수 있고, 사용자의 음성 정보를 인식할 수 있는 마이크, 스피커 등이 장착되어 있을 수 있다. 여기서, 전자 기기(110)는 사용자와 컴퓨터 인터랙션 과정에서 동시 여러 모달리티의 입출력을 허용하며 다수의 모달리티의 조합과 입력 신호 통합해석 등을 통하여 상호 의사를 교환하는 기기일 수 있다. The recognition system may recognize the identification information of the user based on the multi modal using the estimated value estimated from at least one of the image data, the audio data, or the text data. A recognition service operated by the user 500 through the electronic device 110 will be described. In this case, a service provided by the recognition system in the form of a platform or an application may be executed and operated in the electronic device 110, or the service may be executed and operated through communication with a server. For example, a function for the service may be set in the electronic device 110. The electronic device 110 may be equipped with a sensor and a camera capable of recognizing image information, text information, such as a face and facial expression of the user 500, a microphone, a speaker, etc., capable of recognizing the user's voice information. It may be mounted. Here, the electronic device 110 may be a device that allows input and output of multiple modalities simultaneously in a user and computer interaction process and exchanges intentions through a combination of multiple modalities and integrated input signal analysis.

다시 말해서, 인식 시스템은 전자 기기(110)를 통하여 멀티 모달에 기반하여 사용자와 관련된 텍스트 데이터, 영상 데이터, 음성 데이터 등을 수집할 수 있다. 일례로, 전자 기기(110)에서 제공되는 인공 지능을 통하여 사용자와 대화를 수행함에 따라 음성 데이터를 획득할 수 있다. 예를 들면, 텍스트 데이터는 사용자로부터 기재된 필기체, 사용자로부터 작성된 글과 같이 텍스트가 포함된 데이터, 영상 데이터는 사용자의 얼굴, 얼굴의 표정 등이 포함된 데이터, 음성 데이터는 사용자의 목소리, 억양, 톤, 빠르기 등이 포함된 데이터일 수 있다.In other words, the recognition system may collect text data, image data, audio data, etc. related to the user based on the multi modal through the electronic device 110. For example, voice data may be acquired as a user talks through an artificial intelligence provided by the electronic device 110. For example, the text data may include a handwriting written by a user, data including text such as a text written by the user, image data may include a user's face, facial expression, etc., and voice data may include a user's voice, intonation, and tone. , Data may be included.

인식 시스템은 서로 다른 종류의 데이터로부터 추정된 추정값을 입력받을 수 있다. 예를 들면, 인식 시스템은 사용자와 관련된 텍스트 데이터, 영상 데이터 또는 음성 데이터로부터 추정된 각각의 추정값을 입력받을 수 있다. 인식 시스템은 텍스트 데이터로부터 사용자의 식별 정보를 인식하기 위한 분석을 수행할 수 있다. 마찬가지로, 인식 시스템은 영상 데이터로부터 사용자의 식별 정보를 인식하기 위한 분석을 수행할 수 있고, 음성 데이터로부터 사용자의 식별 정보를 인식하기 위한 분석을 수행할 수 있다. 이때, 영상 데이터 및 음성 데이터에 포함된 노이즈를 제거하기 위한 전처리 과정이 수행된 뒤, 분석을 수행함에 따라 추정값이 추정될 수 있다. 예를 들면, 인식 시스템은 기 저장된 데이터와 각각의 텍스트 데이터, 영상 데이터, 음성 데이터와 비교하여 추정값을 추정할 수 있다. The recognition system may receive an estimated value estimated from different kinds of data. For example, the recognition system may receive each estimated value estimated from text data, image data, or audio data associated with a user. The recognition system may perform an analysis for recognizing the identification information of the user from the text data. Similarly, the recognition system may perform the analysis for recognizing the identification information of the user from the image data, and may perform the analysis for recognizing the identification information of the user from the voice data. In this case, after a preprocessing process for removing noise included in the image data and the audio data is performed, the estimated value may be estimated as the analysis is performed. For example, the recognition system may estimate the estimated value by comparing previously stored data with respective text data, image data, and audio data.

인식 시스템은 입력된 추정값을 학습 모델을 통하여 학습시킬 수 있다. 일례로, 인식 시스템은 각각의 데이터에 대한 분석을 수행함에 따라 추정된 추정값을 이용하여 인공신경망에 입력할 수 있다. 또는 인식 시스템은 인식 시스템 이외의 다른 시스템에 의하여 각각의 데이터로부터 분석이 수행됨에 따라 추정된 추정값을 이용하여 인공신경망에 입력할 수 있다. 이때, 인식 시스템은 인공신경망에 기반한 학습 모델을 구성할 수 있다. 인식 시스템은 학습 모델에 기반하여 추정값을 학습시킴으로써 출력값을 획득할 수 있다. 일례로, 학습 모델은 적어도 하나 이상의 히든 레이어로 구성된 인공신경망으로 구성될 수 있다. 인식 시스템은 추정값을 히든 레이어에 통과시킴에 따라 출력값을 획득할 수 있고, 획득된 출력값을 결과값으로 하여 추정값과 출력값을 획득할 수 있다. 이때, 인식 시스템은 학습 모델에 각각의 추정값을 동시에 입력하여 학습시킬 수 있다. 이와 같이, 추정값을 학습시킴으로써 결과값의 정확도뿐만 아니라 사용자의 식별 정보를 인식하는 성능을 향상시킬 수 있다. 인식 시스템은 추정값과 출력값의 내적을 통하여 최종 인식 결과를 획득할 수 있다. 이때, 인식 시스템은 출력값을 가중치로 하여 추정값과 가중치의 내적을 통하여 최종 인식 결과를 도출할 수 있다. The recognition system may train the inputted estimation value through a learning model. For example, the recognition system may input the artificial neural network using the estimated value estimated as the analysis of each data is performed. Alternatively, the recognition system may input the artificial neural network using an estimated value estimated as the analysis is performed from each data by a system other than the recognition system. In this case, the recognition system may construct a learning model based on an artificial neural network. The recognition system may obtain an output value by learning the estimated value based on the learning model. In one example, the learning model may be composed of an artificial neural network composed of at least one hidden layer. The recognition system may obtain an output value by passing the estimated value through the hidden layer, and obtain the estimated value and the output value by using the obtained output value as a result value. In this case, the recognition system may learn by inputting each estimated value simultaneously to the learning model. As such, by learning the estimated value, the performance of recognizing the identification information of the user as well as the accuracy of the result value can be improved. The recognition system may obtain the final recognition result through the dot product of the estimated value and the output value. In this case, the recognition system may derive the final recognition result through the dot product of the estimated value and the weight using the output value as the weight.

인식 시스템은 사용자의 텍스트 데이터, 영상 데이터 및 음성 데이터를 모두 종합적으로 활용하여 사용자의 식별 정보를 파악할 수 있다. 예를 들면, 인식 시스템은 사용자의 감정, 나이, 성별을 판단할 수 있다. 인식 시스템은 전자 기기(110)에 최종 인식 결과를 출력하도록 제공할 수 있다. 인식 시스템은 전자 기기(110)에게 최종 인식 결과를 전달함에 따라 전자 기기(110)에 사용자의 나이, 사용자의 성별, 사용자의 감정을 포함하는 인식 결과가 출력될 수 있다. 만약, 인식 시스템은 사용자의 나이, 사용자의 성별, 사용자의 감정을 판단한 결과와 함께 판단 결과에 대한 정확도를 함께 표시할 수 있다. 또한, 인식 시스템은 사용자의 나이를 정확하게 판단하지 못한 경우, 추정된 기 설정된 범위의 나이대로 결과를 제공할 수 있다. 또한, 인식 시스템은 사용자의 감정의 경우, 사전에 정의 및 분류된 감정 중 하나 이상의 감정을 선택하여 결과로서 제공할 수 있다. 예를 들면, 기쁨, 슬쁨, 분노, 화남 등 복수 개의 감정을 사전에 설정해놓은 경우, 사용자에게 해당되는 감정을 하나 이상 선택하여 결과로 제공할 수 있다. 또한, 인식 시스템은 날짜별, 요일별, 시간대에 따라 분석을 수행하고, 날짜별, 요일별, 시간대에 기초하여 분석된 인식 결과를 제공할 수도 있다. 또는, 인식 시스템은 사용자로부터 인식이 요청된 시점을 기준으로 기설정된 기간 동안의 데이터에 기반하여 인식을 수행한 인식 결과를 제공할 수도 있다. The recognition system may comprehensively utilize all of the text data, the image data, and the audio data of the user to identify the identification information of the user. For example, the recognition system may determine the user's emotion, age, and gender. The recognition system may provide the electronic device 110 to output a final recognition result. As the recognition system transmits the final recognition result to the electronic device 110, a recognition result including the age of the user, the gender of the user, and the emotion of the user may be output to the electronic device 110. If the recognition system determines the user's age, the user's gender, and the user's emotion, the recognition system may display the accuracy of the determination result together. In addition, when the recognition system does not accurately determine the age of the user, the recognition system may provide the result according to the estimated range of age. In addition, in the case of a user's emotion, the recognition system may select one or more emotions among previously defined and classified emotions and provide them as a result. For example, when a plurality of emotions such as joy, sadness, anger, and anger are set in advance, one or more emotions corresponding to the user may be selected and provided as a result. In addition, the recognition system may perform analysis according to date, day of the week, and time zone, and may provide recognition results analyzed based on date, day of the week, and time zone. Alternatively, the recognition system may provide a recognition result of performing recognition based on data for a predetermined period of time based on a time point when recognition is requested from the user.

인식 시스템은 사용자와 관련된 데이터를 추가적으로 수집할 경우, 학습 모델을 통하여 학습시킴으로써 사용자의 식별 정보에 대한 인식 결과를 업데이트하여 제공할 수 있다. 더 나아가, 인식 시스템은 사용자의 나이, 성별, 감정 이외에도 다른 식별 정보도 추가적으로 인식할 수 있다. When the recognition system additionally collects data related to the user, the recognition system may update the recognition result of the identification information of the user by providing a learning model. Furthermore, the recognition system may additionally recognize other identification information in addition to the user's age, gender, and emotion.

일 실시예에 따른 인식 시스템은 보다 높은 인식률을 제공할 수 있다. 인식 시스템은 각각의 데이터를 상호 보완적으로 이용하여 인식 기능을 작동시킬 수 있다. A recognition system according to an embodiment may provide a higher recognition rate. The recognition system may operate the recognition function by using each data complementarily.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the devices and components described in the embodiments are, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors, microcomputers, field programmable gate arrays (FPGAs). Can be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of explanation, one processing device may be described as being used, but one of ordinary skill in the art will appreciate that the processing device includes a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the above, and configure the processing device to operate as desired, or process it independently or collectively. You can command the device. Software and / or data may be any type of machine, component, physical device, virtual equipment, computer storage medium or device in order to be interpreted by or to provide instructions or data to the processing device. It can be embodied in. The software may be distributed over networked computer systems so that they may be stored or executed in a distributed manner. Software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. Method according to the embodiment is implemented in the form of program instructions that can be executed by various computer means may be recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the media may be those specially designed and constructed for the purposes of the embodiments, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine code, such as produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described by the limited embodiments and the drawings as described above, various modifications and variations are possible to those skilled in the art from the above description. For example, the described techniques may be performed in a different order than the described method, and / or components of the described systems, structures, devices, circuits, etc. may be combined or combined in a different manner than the described method, or other components. Or even if replaced or replaced by equivalents, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are within the scope of the claims which follow.

Claims

In the recognition method performed by the recognition system,
Receiving an estimated value estimated from different kinds of data;
Training the input estimate through a learning model;
Recognizing identification information of a multi-modal user based on learning the estimated value through the learning model; And
Delivering the recognized user identification information as a recognition result
Recognition method comprising a.

The method of claim 1,
Recognizing identification information of a multi-modal based user by learning the estimated value through the learning model,
Acquiring a recognition result related to the identification information of the user through each inner product of the output values obtained by learning each of the estimates and each of the estimates in a learning model;
Recognition method comprising a.

The method of claim 1,
The training of the inputted estimation value through a learning model may include:
Acquiring an output value by passing each of the estimated values as an input value through a hidden layer configured in the learning model.
Recognition method comprising a.

The method of claim 3,
The training of the inputted estimation value through a learning model may include:
Training by inputting each of the estimates simultaneously into a learning model constructed based on an artificial neural network;
Recognition method comprising a.

The method of claim 1,
Receiving an estimated value estimated from the different types of data,
Estimating an estimated value for recognizing identification information of the user based on at least one data among image data, audio data, and text data associated with the user;
Recognition method comprising a.

The method of claim 2,
Recognizing identification information of a multi-modal user according to learning the estimated value through the learning model,
Obtaining identification information of the user including at least one of the emotion, age or gender of the user
Recognition method comprising a.

In a recognition system,
An input unit for receiving an estimated value estimated from different types of data;
A learning unit learning the input estimate through a learning model;
A recognition unit for recognizing identification information of a multi-modal user based on learning the estimated value through the learning model; And
Delivery unit for transmitting the recognized user identification information as a recognition result
Recognition system comprising a.

The method of claim 7, wherein
The recognition unit,
Acquiring a recognition result related to the identification information of the user through each inner product of the output values obtained by learning each of the estimated values and each of the estimated values in a learning model;
Recognition system, characterized in that.

The method of claim 7, wherein
The learning unit,
Obtaining an output value by passing each of the estimated values as an input value through a hidden layer configured in the learning model
Recognition system, characterized in that.

The method of claim 9,
The learning unit,
Learning by inputting each of the estimated values simultaneously into a learning model constructed based on an artificial neural network
Recognition system, characterized in that.

The method of claim 7, wherein
The input unit,
Estimating an estimated value for recognizing identification information of the user based on at least one data among image data, audio data, and text data associated with the user;
Recognition system, characterized in that.

The method of claim 8,
The recognition unit,
Acquiring identification information of the user including at least one or more of the user's emotion, age or gender.
Recognition system, characterized in that.