KR20200119368A

KR20200119368A - Electronic apparatus based on recurrent neural network of attention using multimodal data and operating method thereof

Info

Publication number: KR20200119368A
Application number: KR1020190032654A
Authority: KR
Inventors: 이수영; 신영훈; 김태호; 최신국; 김태훈
Original assignee: 한국과학기술원
Priority date: 2019-03-22
Filing date: 2019-03-22
Publication date: 2020-10-20
Also published as: WO2020196976A1; KR102183280B1

Abstract

The present invention relates to an electronic device based on a recurrent neural network with attention that uses multimodal data and an operation method thereof. According to various embodiments of the present invention, the method comprises: detecting multimodal data related to at least two of the image, voice, text of a user based on a recurrent neural network with attention that uses multimodal data; calculating a first attention variable based on the multimodal data; calculating a second attention variable based on the multimodal data and the first attention variable; and inferring a result value based on the second attention variable. Therefore, the electronic device can more accurately and stably recognize the emotion of the user.

Description

An electronic device based on a cyclical neural network using multimodal data and its operation method {ELECTRONIC APPARATUS BASED ON RECURRENT NEURAL NETWORK OF ATTENTION USING MULTIMODAL DATA AND OPERATING METHOD THEREOF}

다양한 실시예들은 멀티모달 데이터(multimodal data)를 이용한 주의집중(attention)의 순환 신경망(recurrent neural network) 기반 전자 장치 및 그의 동작 방법에 관한 것이다. Various embodiments relate to an electronic device based on a recurrent neural network for attention using multimodal data and a method of operating the same.

최근 시장에 출시되는 인공지능 스피커와 같은 전자 장치의 경우 질의 응답을 하는 수준의 대화만 가능하다. 그러나, 감성대화에 대한 수요가 증가함과 동시에 딥러닝 기술이 발전하면서, 사용자의 영상(얼굴), 음성 또는 텍스트를 기반으로 사용자의 감정을 추정하는 딥러닝 기술이 개발되고 있는 추세다. In the case of electronic devices such as artificial intelligence speakers, which are recently released on the market, only conversations at the level of question and answer are possible. However, as the demand for emotional dialogue increases and deep learning technology develops, deep learning technology that estimates the user's emotion based on the user's image (face), voice, or text is being developed.

그런데, 상기와 같이 단일 입력 값으로 사용자의 감정을 추정하는 전자 장치는 여려 입력 값을 기반으로 사용자의 감정을 추정하는 전자 장치에 비해 그 정확도가 낮다. 이로 인하여, 전자 장치에서 사용자의 감정에 대한 안정적인 추정 결과를 제공하는 데 한계가 있다. 따라서, 전자 장치가 사용자의 감정을 보다 정확하고 안정적으로 추정할 수 있는 방안이 요구된다. However, as described above, an electronic device that estimates a user's emotion based on a single input value is less accurate than an electronic device that estimates a user's emotion based on multiple input values. For this reason, there is a limitation in providing a stable estimation result for the user's emotion in the electronic device. Accordingly, there is a need for a method in which the electronic device can more accurately and stably estimate the user's emotion.

다양한 실시예들에 따른 전자 장치의 동작 방법은 멀티모달 데이터(multimodal data)를 이용한 주의집중(attention)의 순환 신경망(recurrent neural network) 기반으로 수행되며, 사용자의 영상, 음성 또는 텍스트 중 적어도 어느 두 개와 관련되는 멀티모달 데이터를 검출하는 동작, 상기 멀티모달 데이터에 기반하여, 제 1 주의집중 변수를 계산하는 동작, 상기 멀티모달 데이터 및 상기 제 1 주의집중 변수에 기반하여, 제 2 주의집중 변수를 계산하는 동작, 및 상기 제 2 주의집중 변수에 기반하여, 결과값을 추론하는 동작을 포함할 수 있다. The operation method of an electronic device according to various embodiments is performed based on a recurrent neural network of attention using multimodal data, and at least one of a user's video, audio, or text. An operation of detecting multimodal data related to a dog, an operation of calculating a first attention variable based on the multimodal data, a second attention variable based on the multimodal data and the first attention variable It may include an operation of calculating and an operation of inferring a result value based on the second attention variable.

다양한 실시예들에 따른 전자 장치는 멀티모달 데이터를 이용한 주의집중의 순환 신경망 기반으로 동작하며, 입력 모듈, 및 상기 입력 모듈과 연결되는 프로세서를 포함할 수 있다. 다양한 실시예들에 따르면, 상기 프로세서는, 상기 입력 모듈을 통하여, 사용자의 영상, 음성 또는 텍스트 중 적어도 어느 두 개와 관련되는 멀티모달 데이터를 검출하고, 상기 멀티모달 데이터에 기반하여, 제 1 주의집중 변수를 계산하고, 상기 멀티모달 데이터 및 상기 제 1 주의집중 변수에 기반하여, 제 2 주의집중 변수를 계산하고, 상기 제 2 주의집중 변수에 기반하여, 결과값을 추론하도록 구성될 수 있다. An electronic device according to various embodiments of the present disclosure operates based on a recurrent neural network for attention-focusing using multimodal data, and may include an input module and a processor connected to the input module. According to various embodiments, the processor, through the input module, detects multimodal data related to at least any two of a user's video, audio, or text, and based on the multimodal data, the first attention is focused. It may be configured to calculate a variable, calculate a second attention variable based on the multimodal data and the first attention variable, and infer a result value based on the second attention variable.

다양한 실시예들에 따르면, 전자 장치는 사용자의 음성, 영상 또는 텍스트 중 적어도 어느 두 개 또는 세 개로부터 사용자의 감정을 인식할 수 있다. 이 때 전자 장치는 주의집중의 순환 신경망을 기반으로 특정 동작을 반복 수행함으로써, 사용자의 감정을 인식하는 데 있어서 사용자의 음성, 영상 및 텍스트 각각에 대한 영향력이 고려될 수 있다. 예를 들면, 사용자의 음성, 영상 또는 텍스트 중 적어도 어느 하나가 수신되지 않거나 노이즈가 심한 경우, 전자 장치가 사용자의 감정을 인식하는 데 있어서 사용자의 음성, 영상 또는 텍스트 중 적어도 어느 하나의 영향력을 감소시킬 수 있다. 이를 통해, 전자 장치는, 보다 정확하고 안정적으로 사용자의 감정을 인식할 수 있다.According to various embodiments, the electronic device may recognize the user's emotion from at least any two or three of the user's voice, image, or text. In this case, the electronic device repeatedly performs a specific operation based on the recurrent neural network of attention, so that the influence of the user's voice, image, and text may be considered in recognizing the user's emotion. For example, when at least one of the user's voice, video, or text is not received or the noise is severe, the electronic device reduces the influence of at least one of the user's voice, video, or text in recognizing the user's emotions. I can make it. Through this, the electronic device may more accurately and stably recognize the user's emotion.

도 1은 다양한 실시예들에 따른 전자 장치를 도시하는 도면이다.
도 2는 도 1의 프로세서를 도시하는 도면이다.
도 3은 다양한 실시예들에 따른 전자 장치의 동작 방법을 도시하는 도면이다.
도 4는 도 3의 제 2 주의집중 변수 계산 동작을 도시하는 도면이다.
도 5는 다양한 실시예들에 따른 전자 장치의 동작 방법을 설명하기 위한 도면이다. 1 is a diagram illustrating an electronic device according to various embodiments.
FIG. 2 is a diagram illustrating the processor of FIG. 1.
3 is a diagram illustrating a method of operating an electronic device according to various embodiments.
4 is a diagram illustrating an operation of calculating a second attention variable of FIG. 3.
5 is a diagram for describing a method of operating an electronic device according to various embodiments.

이하, 본 문서의 다양한 실시예들이 첨부된 도면을 참조하여 설명된다. Hereinafter, various embodiments of the present document will be described with reference to the accompanying drawings.

도 1은 다양한 실시예들에 따른 전자 장치(100)를 도시하는 도면이다. 1 is a diagram illustrating an electronic device 100 according to various embodiments.

도 1을 참조하면, 다양한 실시예들에 따른 전자 장치(100)는 멀티모달 데이터를 이용하여 주의집중(attention)의 순환 신경망(recurrent neural network)을 기반으로 사용자의 감정(emotion)을 인식하기 위한 것으로, 입력 모듈(110), 출력 모듈(120), 메모리(130) 또는 프로세서(140) 중 적어도 어느 하나를 포함할 수 있다. 어떤 실시예에서는, 전자 장치(100)의 구성 요소들 중 적어도 어느 하나가 생략되거나, 전자 장치(100)에 하나 이상의 다른 구성 요소가 추가될 수 있다. Referring to FIG. 1, an electronic device 100 according to various embodiments is configured to recognize a user's emotion based on a recurrent neural network of attention using multimodal data. As such, it may include at least one of the input module 110, the output module 120, the memory 130, and the processor 140. In some embodiments, at least one of the components of the electronic device 100 may be omitted, or one or more other components may be added to the electronic device 100.

입력 모듈(110)은 전자 장치(100)의 구성 요소에 사용될 명령 또는 데이터를 전자 장치(100)의 외부로부터 수신할 수 있다. 입력 모듈(110)은, 사용자가 전자 장치(100)에 직접적으로 명령 또는 데이터를 입력하도록 구성되는 입력 장치 또는 외부 전자 장치와 유선 또는 무선으로 통신하여 명령 또는 데이터를 수신하도록 구성되는 통신 장치 중 적어도 어느 하나를 포함할 수 있다. 예를 들면, 입력 장치는 마이크로폰(microphone), 마우스(mouse), 키보드(keyboard) 또는 카메라(camera) 중 적어도 어느 하나를 포함할 수 있다. 예를 들면, 통신 장치는 유선 통신 장치 또는 무선 통신 장치 중 적어도 어느 하나를 포함하며, 무선 통신 장치는 근거리 통신 장치 또는 원거리 통신 장치 중 적어도 어느 하나를 포함할 수 있다. The input module 110 may receive commands or data to be used for components of the electronic device 100 from outside the electronic device 100. The input module 110 includes at least one of an input device configured to directly input a command or data to the electronic device 100 or a communication device configured to receive commands or data by communicating with an external electronic device by wire or wirelessly. It can contain either. For example, the input device may include at least one of a microphone, a mouse, a keyboard, and a camera. For example, the communication device may include at least one of a wired communication device or a wireless communication device, and the wireless communication device may include at least one of a short-range communication device and a long-distance communication device.

출력 모듈(120)은 전자 장치(100)의 외부로 정보를 제공할 수 있다. 출력 모듈(120)은 정보를 청각적으로 출력하도록 구성되는 오디오 출력 장치, 정보를 시각적으로 출력하도록 구성되는 표시 장치 또는 외부 전자 장치와 유선 또는 무선으로 통신하여 정보를 전송하도록 구성되는 통신 장치 중 적어도 어느 하나를 포함할 수 있다. 예를 들면, 통신 장치는 유선 통신 장치 또는 무선 통신 장치 중 적어도 어느 하나를 포함하며, 무선 통신 장치는 근거리 통신 장치 또는 원거리 통신 장치 중 적어도 어느 하나를 포함할 수 있다.The output module 120 may provide information to the outside of the electronic device 100. The output module 120 includes at least one of an audio output device configured to audibly output information, a display device configured to visually output information, or a communication device configured to transmit information by wired or wireless communication with an external electronic device. It can contain either. For example, the communication device may include at least one of a wired communication device or a wireless communication device, and the wireless communication device may include at least one of a short-range communication device and a long-distance communication device.

메모리(130)는 전자 장치(100)의 구성 요소에 의해 사용되는 데이터를 저장할 수 있다. 데이터는 프로그램 또는 이와 관련된 명령에 대한 입력 데이터 또는 출력 데이터를 포함할 수 있다. 예를 들면, 메모리(130)는 휘발성 메모리 또는 비휘발성 메모리 중 적어도 어느 하나를 포함할 수 있다. The memory 130 may store data used by components of the electronic device 100. The data may include input data or output data for a program or a command related thereto. For example, the memory 130 may include at least one of a volatile memory or a nonvolatile memory.

프로세서(140)는 메모리(130)의 프로그램을 실행하여, 전자 장치(100)의 구성 요소를 제어할 수 있고, 데이터 처리 또는 연산을 수행할 수 있다. 프로세서(140)는 입력 모듈(110)을 통하여, 사용자의 멀티모달 데이터(multimodal data)를 검출할 수 있다. 멀티모달 데이터는 사용자의 음성, 영상 또는 텍스트 중 적어도 어느 두 개와 관련될 수 있다. 일 실시예에 따르면, 프로세서(140)는 통신 장치를 통하여, 외부 전자 장치(미도시)로부터 사용자와 관련된 영상 데이터로서 멀티모달 데이터를 수신할 수 있다. 다른 실시예에 따르면, 프로세서(140)는 입력 장치를 통하여, 사용자와 관련된 영상 데이터로서 멀티모달 데이터를 수집할 수 있다. 그리고 프로세서(140)는 멀티모달 데이터에 기반하여, 사용자의 감정을 인식할 수 있다. 즉 프로세서(140)는 사용자의 음성, 영상 또는 텍스트 중 적어도 어느 두 개로부터 사용자의 감정을 인식할 수 있다. 이 때 프로세서(140)는 주의집중의 순환 신경망을 기반으로 사용자의 감정을 인식할 수 있다. 이를 통해, 프로세서(140)는, 보다 정확하고 안정적으로 사용자의 감정을 인식할 수 있다. 이 때 프로세서(140)는 사용자의 감정에 대한 인식 결과 또는 그와 관련된 서비스 중 적어도 어느 하나를 제공할 수 있다. 여기서, 프로세서(140)는 출력 모듈(120)을 통하여 인식 결과를 출력할 수 있다. The processor 140 may execute a program in the memory 130 to control components of the electronic device 100 and perform data processing or operation. The processor 140 may detect multimodal data of a user through the input module 110. The multimodal data may be related to at least any two of a user's voice, video, or text. According to an embodiment, the processor 140 may receive multimodal data as image data related to a user from an external electronic device (not shown) through a communication device. According to another embodiment, the processor 140 may collect multimodal data as image data related to a user through an input device. In addition, the processor 140 may recognize the user's emotion based on the multimodal data. That is, the processor 140 may recognize the user's emotion from at least any two of the user's voice, video, or text. In this case, the processor 140 may recognize the user's emotions based on the recurrent neural network of attention. Through this, the processor 140 may more accurately and stably recognize the user's emotion. In this case, the processor 140 may provide at least one of a result of recognizing a user's emotion or a service related thereto. Here, the processor 140 may output a recognition result through the output module 120.

도 2는 도 1의 프로세서(140)를 도시하는 도면이다. FIG. 2 is a diagram illustrating the processor 140 of FIG. 1.

도 2를 참조하면, 프로세서(140)는 전처리부(210), 싱글모달(singlemodal) 입력부(220), 주의집중부(230), 싱글모달 히든부(240), 싱글모달 출력부(250), 모달 결합부(260), 감정 인식부(270) 또는 베이스부(280) 중 적어도 어느 하나를 포함할 수 있다. Referring to FIG. 2, the processor 140 includes a preprocessor 210, a single modal input unit 220, an attention focusing unit 230, a single modal hidden unit 240, a single modal output unit 250, It may include at least one of the modal coupling unit 260, the emotion recognition unit 270, and the base unit 280.

전처리부(210)는 멀티모달 데이터로부터 특징점들을 검출할 수 있다. 전처리부(210)는 오디오 처리부(211), 영상 처리부(213) 및 텍스트 처리부(215)를 포함할 수 있다. 오디오 처리부(211)는 입력 모듈(110)로부터 입력되는 멀티모달 데이터로부터 사용자의 음성과 관련된 적어도 하나의 특징점을 검출할 수 있다. 예를 들면, 오디오 처리부(211)는 log mel spectrogram 형태로, 특징점을 검출할 수 있다. 영상 처리부(213)는 멀티모달 데이터로부터 사용자의 영상과 관련된 적어도 하나의 특징점을 검출할 수 있다. 예를 들면, 영상 처리부(213)는 사용자의 얼굴 이미지를 크로핑(cropping)할 수 있다. 텍스트 처리부(215)는 멀티모달 데이터로부터 사용자의 텍스트와 관련된 적어도 하나의 특징점을 검출할 수 있다. 예를 들면, 텍스트 처리부(215)는 사용자의 음성을 텍스트로 변환하고, sentence embedding vector를 이용하여 텍스트를 벡터로 표현할 수 있다. The preprocessor 210 may detect feature points from multimodal data. The preprocessor 210 may include an audio processing unit 211, an image processing unit 213, and a text processing unit 215. The audio processing unit 211 may detect at least one feature point related to a user's voice from multimodal data input from the input module 110. For example, the audio processing unit 211 may detect a feature point in the form of a log mel spectrogram. The image processing unit 213 may detect at least one feature point related to an image of a user from multimodal data. For example, the image processing unit 213 may crop the user's face image. The text processing unit 215 may detect at least one feature point related to the user's text from the multimodal data. For example, the text processing unit 215 may convert a user's voice into text and express the text as a vector using a sentence embedding vector.

싱글모달 입력부(220)는 특징점들을 싱글모달 데이터로 입력할 수 있다. 싱글모달 데이터는 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터를 포함할 수 있다. 싱글모달 입력부(220)는 제 1 싱글 입력부(221), 제 2 싱글 입력부(223) 및 제 3 싱글 입력부(225)를 포함할 수 있다. 제 1 싱글 입력부(221)는 사용자의 음성과 관련된 특징점을 제 1 싱글모달 데이터로 획득할 수 있다. 제 2 싱글 입력부(223)는 사용자의 영상과 관련된 특징점을 제 2 싱글모달 데이터로 획득할 수 있다. 제 3 싱글 입력부(225)는 사용자의 텍스트와 관련된 특징점을 제 3 싱글모달 데이터로 획득할 수 있다. The single modal input unit 220 may input feature points as single modal data. The single modal data may include first single modal data, second single modal data, and third single modal data. The single modal input unit 220 may include a first single input unit 221, a second single input unit 223, and a third single input unit 225. The first single input unit 221 may acquire a feature point related to a user's voice as first single modal data. The second single input unit 223 may acquire a feature point related to a user's image as second single modal data. The third single input unit 225 may acquire feature points related to the user's text as third single modal data.

주의집중부(230)는 싱글모달 데이터에 대응하는 주의집중 변수를 계산할 수 있다. 이 때 주의집중부(230)는 싱글모달 데이터를 기반으로, 제 1 주의집중 변수를 계산할 수 있다. 그리고 주의집중부(230)는 싱글모달 데이터와 가중치들을 기반으로, 제 2 주의집중 변수를 계산할 수 있다. 여기서, 주의 집중부(230)는 미리 정해진 횟수 만큼, 제 2 주의집중 변수를 계산할 수 있다. 주의집중부(230)는 제 1 주의집중부(231), 제 2 주의집중부(233) 및 제 3 주의집중부(235)를 포함할 수 있다. 제 1 주의집중부(231)는 제 1 싱글모달 데이터에 대응하여, 제 1 주의집중 변수 또는 제 2 주의집중 변수를 계산할 수 있다. 제 2 주의집중부(233)는 제 2 싱글모달 데이터에 대응하여, 제 1 주의집중 변수 또는 제 2 주의집중 변수를 계산할 수 있다. 제 3 주의집중부(235)는 제 3 싱글모달 데이터에 대응하여, 제 1 주의집중 변수 또는 제 2 주의집중 변수를 계산할 수 있다. The attention unit 230 may calculate an attention variable corresponding to single modal data. At this time, the attention unit 230 may calculate the first attention variable based on the single modal data. In addition, the attention unit 230 may calculate a second attention variable based on single modal data and weights. Here, the attention concentration unit 230 may calculate the second attention variable by a predetermined number of times. The attention unit 230 may include a first attention unit 231, a second attention unit 233, and a third attention unit 235. The first attention unit 231 may calculate a first attention variable or a second attention variable in response to the first single modal data. The second attention unit 233 may calculate a first attention variable or a second attention variable in response to the second single modal data. The third attention unit 235 may calculate a first attention variable or a second attention variable in response to the third single modal data.

싱글모달 히든부(240)는 순환 신경망의 학습 알고리즘을 통하여, 주의집중 변수를 기반으로, 싱글모달 데이터에 대응하는 감정 추론 값을 획득할 수 있다. 이 때 싱글모달 히든부(240)는 제 1 주의집중 변수를 기반으로, 제 1 감정 추론 값을 획득할 수 있다. 그리고 싱글모달 히든부(240)는 제 2 주의집중 변수를 기반으로, 제 2 감정 추론 값을 획득할 수 있다. 싱글모달 히든부(240)는 제 1 히든부(241), 제 2 히든부(243) 및 제 3 히든부(245)를 포함할 수 있다. 제 1 히든부(241)는 제 1 싱글모달 데이터에 대응하여, 제 1 감정 추론 값 또는 제 2 감정 추론 값을 획득할 수 있다. 제 2 히든부(243)는 제 2 싱글모달 데이터에 대응하여, 제 1 감정 추론 값 또는 제 2 감정 추론 값을 획득할 수 있다. 제 3 히든부(245)는 제 3 싱글모달 데이터에 대응하여, 제 1 감정 추론 값 또는 제 2 감정 추론 값을 획득할 수 있다. The single modal hidden unit 240 may acquire an emotion inference value corresponding to the single modal data based on the attention variable through the learning algorithm of the recurrent neural network. In this case, the single modal hidden unit 240 may obtain a first emotional inference value based on the first attention variable. In addition, the single modal hidden unit 240 may obtain a second emotional inference value based on the second attention variable. The single modal hidden part 240 may include a first hidden part 241, a second hidden part 243, and a third hidden part 245. The first hidden unit 241 may acquire a first emotion inference value or a second emotion inference value in response to the first single modal data. The second hidden unit 243 may acquire a first emotion inference value or a second emotion inference value in response to the second single modal data. The third hidden unit 245 may acquire a first emotion inference value or a second emotion inference value in response to the third single modal data.

싱글모달 출력부(250)는 싱글모달 데이터에 대응하는 감정 추론 값을 출력할 수 있다. 이 때 싱글모달 출력부(250)는 제 1 감정 추론 값 또는 제 2 감정 추론 값을 출력할 수 있다. 싱글모달 출력부(250)는 제 1 싱글 출력부(251), 제 2 싱글 출력부(253) 및 제 3 싱글 출력부(255)를 포함할 수 있다. 제 1 싱글 출력부(251)는 제 1 싱글모달 데이터에 대응하여, 제 1 감정 추론 값 또는 제 2 감정 추론 값을 모달 결합부(260)로 출력할 수 있다. 제 2 싱글 출력부(253)는 제 2 싱글모달 데이터에 대응하여, 제 1 감정 추론 값 또는 제 2 감정 추론 값을 모달 결합부(260)로 출력할 수 있다. 제 3 싱글 출력부(255)는 제 3 싱글모달 데이터에 대응하여, 제 1 감정 추론 값 또는 제 2 감정 추론 값을 모달 결합부(260)로 출력할 수 있다. The single modal output unit 250 may output an emotion inference value corresponding to single modal data. In this case, the single modal output unit 250 may output a first emotion inference value or a second emotion inference value. The single modal output unit 250 may include a first single output unit 251, a second single output unit 253 and a third single output unit 255. The first single output unit 251 may output a first emotion inference value or a second emotion inference value to the modal combiner 260 in response to the first single modal data. The second single output unit 253 may output the first emotion inference value or the second emotion inference value to the modal combiner 260 in response to the second single modal data. The third single output unit 255 may output the first emotion inference value or the second emotion inference value to the modal combiner 260 in response to the third single modal data.

모달 결합부(260)는 싱글모달 데이터에 대응하는 감정 추론 값을 결합할 수 있다. 이 때 모달 결합부(260)는 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하는 제 1 감정 추론 값을 결합할 수 있다. 그리고 모달 결합부(260)는 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하는 제 2 감정 추론 값을 결합할 수 있다.The modal combiner 260 may combine emotion inference values corresponding to single modal data. In this case, the modal combiner 260 may combine first emotional inference values corresponding to each of the first single modal data, the second single modal data, and the third single modal data. In addition, the modal combining unit 260 may combine second emotional inference values corresponding to each of the first single modal data, the second single modal data, and the third single modal data.

감정 인식부(270)는 결합된 감정 추론 값에 기반하여, 사용자의 감정을 인식할 수 있다. 이 때 감정 인식부(270)는 결합된 제 1 감정 추론 값을 베이스부(280)에 전달할 수 있다. 그리고 감정 인식부(270)는 결합된 제 2 감정 추론 값을 베이스부(280)에 전달할 수 있다. 여기서, 감정 인식부(270)는 미리 정해진 횟수 만큼, 결합된 제 2 감정 추론 값을 베이스부(280)를 베이스부(280)에 전달할 수 있다. The emotion recognition unit 270 may recognize a user's emotion based on the combined emotion inference value. In this case, the emotion recognition unit 270 may transmit the combined first emotion inference value to the base unit 280. In addition, the emotion recognition unit 270 may transmit the combined second emotion inference value to the base unit 280. Here, the emotion recognition unit 270 may transmit the combined second emotion inference value to the base unit 280 by a predetermined number of times.

베이스부(280)는 결합된 감정 추론 값으로부터 싱글모달 데이터에 대응하는 가중치들을 획득할 수 있다. 이 때 베이스부(280)는 결합된 감정 추론 값에 대한 각 싱글모달 데이터의 영향력을 판단하고, 영향력에 따라 각 싱글모달 데이터에 가중치들을 부여할 수 있다. 예를 들면, 영향력이 적을수록, 낮은 가중치를 부여할 수 있다. 즉 베이스부(280)는 결합된 제 1 감정 추론 값으로부터 사용자의 음성, 영상 또는 데이터와 각각 관련되는 가중치들을 획득할 수 있다. 그리고 베이스부(280)는 결합된 제 2 감정 추론 값으로부터 사용자의 음성, 영상 또는 데이터와 각각 관련되는 가중치들을 획득할 수 있다. 이를 통해, 베이스부(280)는 가중치들을 주의집중부(230)에 제공할 수 있다. 베이스부(280)는 제 1 베이스부(281), 제 2 베이스부(283) 및 제 3 베이스부(285)를 포함할 수 있다. 제 1 베이스부(281)는 제 1 싱글모달 데이터에 대응하는 가중치를 제 1 주의집중부(231)에 제공할 수 있다. 제 2 베이스부(283)는 제 2 싱글모달 데이터에 대응하는 가중치를 제 2 주의집중부(233)에 제공할 수 있다. 제 3 베이스부(285)는 제 3 싱글모달 데이터에 대응하는 가중치를 제 3 주의집중부(235)에 제공할 수 있다. The base unit 280 may obtain weights corresponding to single modal data from the combined emotion inference value. In this case, the base unit 280 may determine the influence of each single modal data on the combined emotional inference value, and assign weights to each single modal data according to the influence. For example, the lower the influence, the lower the weight can be given. That is, the base unit 280 may obtain weights respectively related to the user's voice, image, or data from the combined first emotion inference value. In addition, the base unit 280 may obtain weights respectively related to the user's voice, image, or data from the combined second emotion inference value. Through this, the base unit 280 may provide weights to the attention unit 230. The base portion 280 may include a first base portion 281, a second base portion 283, and a third base portion 285. The first base unit 281 may provide a weight corresponding to the first single modal data to the first attention unit 231. The second base unit 283 may provide a weight corresponding to the second single modal data to the second attention unit 233. The third base unit 285 may provide a weight corresponding to the third single modal data to the third attention unit 235.

다양한 실시예들에 따른 전자 장치(100)는 멀티모달 데이터를 이용한 주의집중의 순환 신경망 기반으로 동작하며, 입력 모듈(110), 및 입력 모듈(110)과 연결되는 프로세서(140)를 포함할 수 있다. The electronic device 100 according to various embodiments operates based on a recurrent neural network of attention using multimodal data, and may include an input module 110 and a processor 140 connected to the input module 110. have.

다양한 실시예들에 따르면, 프로세서(140)는, 입력 모듈(110)을 통하여, 사용자의 영상, 음성 또는 텍스트 중 적어도 어느 두 개와 관련되는 멀티모달 데이터를 검출하고, 멀티모달 데이터에 기반하여, 제 1 주의집중 변수를 계산하고, 감정 추론 값을멀티모달 데이터 및 제 1 주의집중 변수에 기반하여, 제 2 주의집중 변수를 계산하고, 제 2 주의집중 변수에 기반하여, 결과값을 추론하도록 구성될 수 있다. According to various embodiments, the processor 140, through the input module 110, detects multimodal data related to at least any two of a user's video, voice, or text, and based on the multimodal data, the processor 140 The first attention variable is calculated, the emotional inference value is calculated based on the multimodal data and the first attention variable, the second attention variable is calculated, and the result value is inferred based on the second attention variable. I can.

다양한 실시예들에 따르면, 프로세서(140)는, 제 1 주의집중 변수에 기반하여, 제 1 감정 추론 값을 획득하고, 멀티모달 데이터 및 제 1 감정 추론 값에 기반하여, 제 2 주의집중 변수를 계산하도록 구성될 수 있다. According to various embodiments, the processor 140 obtains a first emotion inference value based on the first attention variable, and determines a second attention variable based on the multimodal data and the first emotion inference value. It can be configured to calculate.

다양한 실시예들에 따르면, 프로세서(140)는 제 2 주의집중 변수에 기반하여, 결과값으로서 사용자의 감정을 인식하도록 구성될 수 있다.According to various embodiments, the processor 140 may be configured to recognize a user's emotion as a result value based on the second attention variable.

다양한 실시예들에 따르면, 프로세서(140)는, 제 2 주의집중 변수에 기반하여, 제 2 감정 추론 값을 획득하고, 멀티모달 데이터 및 제 2 감정 추론 값에 기반하여, 제 2 주의집중 변수를 재차 계산하도록 구성될 수 있다. According to various embodiments, the processor 140 obtains a second emotion inference value based on the second attention variable, and calculates the second attention variable based on the multimodal data and the second emotion inference value. It can be configured to recalculate.

다양한 실시예들에 따르면, 프로세서(140)는, 제 2 주의집중 변수를 재차 계산한 후에, 제 2 감정 추론 값을 획득하도록 복귀할 수 있다. According to various embodiments, the processor 140 may return to obtain the second emotional inference value after recalculating the second attention variable.

다양한 실시예들에 따르면, 프로세서(140)는, 제 2 주의집중 변수에 기반하여, 제 2 감정 추론 값을 획득하고, 제 2 감정 추론 값을 통하여, 감정을 인식하도록 구성될 수 있다. According to various embodiments, the processor 140 may be configured to obtain a second emotion inference value based on the second attention variable, and to recognize an emotion through the second emotion inference value.

다양한 실시예들에 따르면, 프로세서(140)는, 멀티모달 데이터를 분석하여, 음성과 관련되는 제 1 싱글모달 데이터, 영상과 관련되는 제 2 싱글모달 데이터 및 텍스트와 관련되는 제 3 싱글모달 데이터를 획득하고, 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하여, 제 1 주의집중 변수를 계산하도록 구성될 수 있다. According to various embodiments, the processor 140 analyzes multimodal data, and analyzes the first single modal data related to voice, second single modal data related to an image, and third single modal data related to text. It may be configured to acquire and calculate a first attention variable in response to each of the first single modal data, the second single modal data, and the third single modal data.

다양한 실시예들에 따르면, 프로세서(140)는, 제 1 감정 추론 값으로부터 음성, 영상 및 데이터와 각각 관련되는 가중치들을 획득하고, 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터와 가중치들을 기반으로, 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하여, 제 2 주의집중 변수를 계산하도록 구성될 수 있다.According to various embodiments, the processor 140 obtains weights associated with voice, image, and data from the first emotion inference value, and includes first single modal data, second single modal data, and third single modal data. Based on the and weights, it may be configured to calculate a second attention variable corresponding to each of the first single modal data, the second single modal data, and the third single modal data.

다양한 실시예들에 따르면, 프로세서(140)는, 제 2 감정 추론 값으로부터 음성, 영상 및 데이터와 각각 관련되는 가중치들을 획득하고, 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터와 가중치들을 기반으로, 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하여, 제 2 주의집중 변수를 재차 계산하도록 구성될 수 있다. According to various embodiments, the processor 140 acquires weights associated with voice, image, and data from the second emotion inference value, and includes first single modal data, second single modal data, and third single modal data. Based on the and weights, it may be configured to recalculate the second attention variable corresponding to each of the first single modal data, the second single modal data, and the third single modal data.

도 3은 다양한 실시예들에 따른 전자 장치(100)의 동작 방법을 도시하는 도면이다. 도 5는 다양한 실시예들에 따른 전자 장치의 동작 방법을 설명하기 위한 도면이다. 3 is a diagram illustrating a method of operating an electronic device 100 according to various embodiments. 5 is a diagram for describing a method of operating an electronic device according to various embodiments.

도 3을 참조하면, 전자 장치(100)는 310 동작에서 멀티모달 데이터(X)를 검출할 수 있다. 프로세서(140)는 입력 모듈(110)을 통하여, 사용자의 멀티모달 데이터(X)를 검출할 수 있다. 멀티모달 데이터(X)는 사용자의 음성, 영상 또는 텍스트 중 적어도 어느 두 개와 관련될 수 있다. 일 실시예에 따르면, 프로세서(140)는 통신 장치를 통하여, 외부 전자 장치(미도시)로부터 사용자와 관련된 영상 데이터로서 멀티모달 데이터(X)를 수신할 수 있다. 다른 실시예에 따르면, 프로세서(140)는 입력 장치를 통하여, 사용자와 관련된 영상 데이터로서 멀티모달 데이터(X)를 수집할 수 있다.Referring to FIG. 3, the electronic device 100 may detect multimodal data X in operation 310. The processor 140 may detect the user's multimodal data X through the input module 110. The multimodal data X may be related to at least any two of a user's voice, video, or text. According to an embodiment, the processor 140 may receive multimodal data X as image data related to a user from an external electronic device (not shown) through a communication device. According to another embodiment, the processor 140 may collect multimodal data X as image data related to a user through an input device.

전자 장치(100)는 320 동작에서 멀티모달 데이터(X)에 기반하여, 제 1 주의집중 변수(A₀)를 계산할 수 있다. 프로세서(140)는 멀티모달 데이터(X)를 분석하여, 음성과 관련되는 제 1 싱글모달 데이터, 영상과 관련되는 제 2 싱글모달 데이터 및 텍스트와 관련되는 제 3 싱글모달 데이터를 획득할 수 있다. 그리고 프로세서(140)는 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하여, 제 1 주의집중 변수(A₀)를 계산할 수 있다. The electronic device 100 may calculate the first attention variable A ₀ based on the multimodal data X in operation 320. The processor 140 may analyze the multimodal data X to obtain first single modal data related to voice, second single modal data related to an image, and third single modal data related to text. In addition, the processor 140 may calculate a first attention variable A ₀ corresponding to each of the first single modal data, the second single modal data, and the third single modal data.

전자 장치(100)는 330 동작에서 멀티모달 데이터(X)에 기반하여, 제 2 주의집중 변수(A₁, …, A_k)를 계산할 수 있다. 프로세서(140)는 정해진 횟수(k) 만큼 제 2 주의집중 변수(A₁, …, A_k)를 계산할 수 있다. 여기서, 프로세서(140)는 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하여, 제 2 주의집중 변수(A₁, …, A_k)를 계산할 수 있다. 이 때 프로세서(140)는 멀티모달 데이터(X)에 기반하여, 제 2 주의집중 변수(A₁)를 계산할 수 있다. 여기서, 프로세서(140)는 제 1 주의집중 변수(A₀)에 기반하여, 제 1 감정 추론 값(B₀-C₀)를 획득하고, 멀티모달 데이터(X) 및 제 1 감정 추론 값(B₀-C₀)에 기반하여, 제 2 주의집중 변수(A₁)를 계산할 수 있다. 그리고 프로세서(140)는 멀티모달 데이터(X) 및 제 2 주의집중 변수(A₁)에 기반하여, 제 2 주의집중 변수(A₂, …, A_k)를 재차 계산할 수 있다. 여기서, 프로세서(140)는 제 2 주의집중 변수(A₁)에 기반하여, 제 2 감정 추론 값(B₁-C₁)를 획득하고, 멀티모달 데이터(X) 및 제 2 감정 추론 값(B₁-C₁, …, B_k-C_k)에 기반하여, 제 2 주의집중 변수(A₂, …, A_k)를 재차 계산할 수 있다. 도 4는 도 3의 제 2 주의집중 변수(A₁, …, A_k) 계산 동작을 도시하는 도면이다. The electronic device 100 may calculate the second attention variables A ₁ , …, A _k based on the multimodal data X in operation 330. The processor 140 may calculate the second attention variables A ₁ , …, A _k for a predetermined number of times k. Here, the processor 140 may calculate the second attention variables A ₁ , …, A _k corresponding to each of the first single modal data, the second single modal data, and the third single modal data. In this case, the processor 140 may calculate the second attention variable A ₁ based on the multimodal data X. Here, the processor 140 acquires a first emotional inference value (B ₀ -C ₀ ) based on the first attention variable (A ₀ ), and multimodal data (X) and a first emotional inference value (B _{Based on 0} -C ₀ ), the second attention variable A ₁ may be calculated. In addition, the processor 140 may recalculate the second attention variables A ₂ , …, A _k based on the multimodal data X and the second attention variable A ₁ . Here, the processor 140 acquires a second emotional inference value (B ₁ -C ₁ ) based on the second attention variable (A ₁ ), and multimodal data (X) and a second emotional inference value (B _{Based on 1} -C ₁ , …, B _k -C _k ), the second attention variables (A ₂ , …, A _k ) can be calculated again. FIG. 4 is a diagram illustrating an operation of calculating the second attention variables A ₁ , ..., A _{k of} FIG. 3.

도 4를 참조하면, 전자 장치(100)는 410 동작에서 제 1 감정 추론 값(B₀-C₀)을 획득할 수 있다. 이 때 프로세서(140)는 제 1 주의집중 변수(A0)에 기반하여, 제 1 감정 추론 값(B₀-C₀)을 획득할 수 있다. 여기서, 프로세서(140)는 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하는 제 1 감정 추론 값(B₀)을 획득할 수 있다. 그리고 프로세서(140)는 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하는 제 1 감정 추론 값(B₀)를 결합하여, 결합된 제 1 감정 추론 값(C₀)을 획득할 수 있다.Referring to FIG. 4, in operation 410, the electronic device 100 may obtain a first emotional inference value B ₀ -C ₀ . In this case, the processor 140 may obtain a first emotional inference value (B ₀ -C ₀ ) based on the first attention variable A0. Here, the processor 140 may obtain a first emotion inference value B ₀ corresponding to each of the first single modal data, the second single modal data, and the third single modal data. Further, the processor 140 combines the first emotion inference value B ₀ corresponding to each of the first single modal data, the second single modal data, and the third single modal data, and the combined first emotion inference value C ₀ ) Can be obtained.

전자 장치(100)는 420 동작에서 제 2 주의집중 변수(A_n)에 대한 계산 횟수(n)을 1로 설정할 수 있다. 전자 장치(100)는 430 동작에서 멀티모달 데이터(X) 및 제 1 감정 추론 값(B₀-C₀)에 기반하여, 제 2 주의집중 변수(A₁)를 계산할 수 있다. 프로세서(140)는 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하여, 제 2 주의집중 변수(A1)를 계산할 수 있다. 이를 위해, 프로세서(140)는 제 1 감정 추론 값(B₀-C₀)로부터 사용자의 음성, 영상 및 데이터와 관련되는 가중치(D₀)들을 각각 획득할 수 있다. 여기서, 프로세서(140)는 결합된 제 1 감정 추론 값(C₀)에 대한 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각의 영향력을 판단하고, 영향력에 따라 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 가중치(D₀)들을 부여할 수 있다. 그리고 프로세서(140)는 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터와 가중치(D₀)들을 각각 대응시켜 곱셈 연산을 수행함으로써, 제 2 주의집중 변수(A₁)를 계산할 수 있다.The electronic device 100 may set the number of calculations _n for the second attention variable A _n to 1 in operation 420. The electronic device 100 may calculate the second attention variable A ₁ based on the multimodal data X and the first emotional inference value B ₀ -C ₀ in operation 430. The processor 140 may calculate a second attention variable A1 corresponding to each of the first single modal data, the second single modal data, and the third single modal data. To this end, the processor 140 may obtain weights D ₀ associated with the user's voice, video, and data from the first emotion inference value B ₀ -C ₀ , respectively. Here, the processor 140 determines the influence of each of the first single modal data, the second single modal data, and the third single modal data with respect to the combined first emotion inference value C ₀ , and the first single modal data Weights D ₀ may be assigned to each of modal data, second single modal data, and third single modal data. In addition, the processor 140 performs a multiplication operation by matching each of the first single modal data, the second single modal data, and the third single modal data and the weight D ₀ to calculate the second attention variable A ₁ . I can.

전자 장치(100)는 440 동작에서 제 2 주의집중 변수(A₁, …, A_k)에 기반하여, 제 2 감정 추론 값(B₁-C₁, …, B_k-C_k)을 획득할 수 있다. 여기서, 프로세서(140)는 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하는 제 2 감정 추론 값(B₁-C₁, …, B_k-C_k)을 결합할 수 있다.In operation 440, the electronic device 100 obtains a second emotional inference value (B ₁ -C ₁ , …, B _k -C _k ) based on the second attention variable (A ₁ , …, A _k ). I can. Here, the processor 140 combines the second emotional inference values (B ₁ -C ₁ , …, B _k -C _k ) corresponding to each of the first single modal data, the second single modal data, and the third single modal data. can do.

전자 장치(100)는 450 동작에서 제 2 주의집중 변수(A_n)에 대한 계산 횟수(n)가 정해진 횟수(k)에 도달했는 지의 여부를 판단할 수 있다. 이 때 450 동작에서 제 2 주의집중 변수(A_n)에 대한 계산 횟수(n)가 정해진 횟수(k)에 도달한 것으로 판단되면, 전자 장치(100)는 도 3으로 리턴할 수 있다. In operation 450, the electronic device 100 may determine whether the number of calculations _n for the second attention variable A _n has reached a predetermined number k. At this time, if it is determined that the number of calculations _n for the second attention variable A _n has reached the predetermined number k in operation 450, the electronic device 100 may return to FIG. 3.

한편, 450 동작에서 제 2 주의집중 변수(A_n)에 대한 계산 횟수(n)가 정해진 횟수(k)에 도달하지 않은 것으로 판단되면, 전자 장치(100)는 460 동작에서 제 2 주의집중 변수(A_n)에 대한 계산 횟수(n)을 1 만큼 증가시킬 수 있다. 전자 장치(100)는 470 동작에서 멀티모달 데이터(X) 및 제 2 감정 추론 값(B₁-C₁, …, B_k-1-C_k-1)에 기반하여, 제 2 주의집중 변수(A₂, …, A_k)를 계산할 수 있다. 프로세서(140)는 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하여, 제 2 주의집중 변수(A₂, …, A_k)를 재차 계산할 수 있다. 이를 위해, 프로세서(140)는 제 2 감정 추론 값(B₁-C₁, …, B_k-1-C_k-1)으로부터 사용자의 음성, 영상 및 데이터와 관련되는 가중치(D₁, …, D_k-1)들을 각각 획득할 수 있다. 여기서, 프로세서(140)는 결합된 제 2 감정 추론 값(C₁, …, C_k-1)에 대한 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각의 영향력을 판단하고, 영향력에 따라 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 가중치(D₁, …, D_k-1)들을 부여할 수 있다. 그리고 프로세서(140)는 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터와 가중치들을 각각 대응시켜 곱셈 연산을 수행함으로써, 제 1 싱글모달 데이터, 제 2 주의집중 변수(A₂, …, A_k)를 재차 계산할 수 있다. 이 후 전자 장치(100)는 440 동작으로 복귀할 수 있다. On the other hand, if it is determined in operation 450 that the number of calculations (n) for the second attention variable (A _n ) has not reached the predetermined number (k), the electronic device 100 is in operation 460, the second attention variable ( The number of calculations (n) for A _n ) can be increased by 1. The electronic device 100 is based on the multimodal data X and the second emotional inference values (B ₁ -C ₁ , …, B _k-1 -C _k-1 ) in operation 470, and the second attention variable ( A ₂ , …, A _k ) can be calculated. The processor 140 may recalculate the second attention variables A ₂ , …, A _k corresponding to each of the second single modal data and the third single modal data. To this end, the processor 140 is based on the second emotion inference value (B ₁ -C ₁ , …, B _k-1 -C _k-1 ) and weights related to the user's voice, video, and data (D ₁ , ... D _k-1 ) can be obtained respectively. Here, the processor 140 determines the influence of each of the first single modal data, the second single modal data, and the third single modal data for the combined second emotion inference value (C ₁ , …, C _k-1 ), and , Weights D ₁ , …, D _k-1 may be assigned to each of the first single modal data, the second single modal data, and the third single modal data according to the influence. Further, the processor 140 performs a multiplication operation by correlating the first single modal data, the second single modal data, and the third single modal data and weights, respectively, so that the first single modal data and the second attention variable A ₂ , …, A _k ) can be calculated again. After that, the electronic device 100 may return to operation 440.

전자 장치(100)는 340 동작에서 결론값을 추론할 수 있다. 이 때 전자 장치(100)는 결론값으로서 사용자의 감정을 인식할 수 있다. 전자 장치(100)는 제 2 주의집중 변수(A_k)에 기반하여, 사용자의 감정을 인식할 수 있다. 이 때 프로세서(140)는 최종적으로 계산된 제 1 주의집중 변수(A_k)으로부터 획득된 제 2 감정 추론 값(B_k-C_k)을 통하여, 사용자의 감정을 인식할 수 있다. The electronic device 100 may infer a conclusion value in operation 340. In this case, the electronic device 100 may recognize the user's emotion as a conclusion value. The electronic device 100 may recognize a user's emotion based on the second attention variable A _k . In this case, the processor 140 may recognize the emotion of the user through the second emotion inference value B _k -C _k obtained from the finally calculated first attention variable A _k .

다양한 실시예들에 따른 전자 장치(100)의 동작 방법은 멀티모달 데이터를 이용한 주의집중의 순환 신경망 기반으로 수행되며, 사용자의 영상, 음성 또는 텍스트 중 적어도 어느 두 개와 관련되는 멀티모달 데이터를 검출하는 동작, 멀티모달 데이터에 기반하여, 제 1 주의집중 변수를 계산하는 동작, 멀티모달 데이터 및 제 1 주의집중 변수에 기반하여, 제 2 주의집중 변수를 계산하는 동작, 및 제 2 주의집중 변수에 기반하여, 결과값을 추론하는 동작을 포함할 수 있다. The operation method of the electronic device 100 according to various embodiments is performed based on a recursive neural network of attention using multimodal data, and detects multimodal data related to at least any two of a user's image, voice, or text. Operation, an operation of calculating a first attention variable based on multimodal data, an operation of calculating a second attention variable based on the multimodal data and the first attention variable, and an operation of calculating a second attention variable based on the second attention variable Thus, the operation of inferring the result value may be included.

다양한 실시예들에 따르면, 제 2 주의집중 변수를 계산하는 동작은, 제 1 주의집중 변수에 기반하여, 제 1 감정 추론 값을 획득하는 동작, 및 멀티모달 데이터 및 제 1 감정 추론 값에 기반하여, 제 2 주의집중 변수를 계산하는 동작을 포함할 수 있다. According to various embodiments, the operation of calculating the second attention variable includes an operation of acquiring a first emotion inference value based on the first attention variable, and based on multimodal data and the first emotion inference value. , It may include an operation of calculating the second attention variable.

다양한 실시예들에 따르면, 결과값을 추론하는 동작은, 제 2 주의집중 변수에 기반하여, 사용자의 감정을 인식하는 동작을 포함할 수 있다.According to various embodiments, the operation of inferring a result value may include an operation of recognizing a user's emotion based on the second attention variable.

다양한 실시예들에 따르면, 제 2 주의집중 변수를 계산하는 동작은, 제 2 주의집중 변수에 기반하여, 제 2 감정 추론 값을 획득하는 동작, 및 멀티모달 데이터 및 제 2 감정 추론 값에 기반하여, 제 2 주의집중 변수를 재차 계산하는 동작을 더 포함할 수 있다. According to various embodiments, the operation of calculating the second attention variable includes an operation of acquiring a second emotion inference value based on the second attention variable, and based on multimodal data and a second emotion inference value. , The operation of recalculating the second attention variable may be further included.

다양한 실시예들에 따르면, 제 2 주의집중 변수를 재차 계산하는 동작 후에, 제 2 감정 추론 값을 획득하는 동작으로 복귀할 수 있다. According to various embodiments, after the operation of recalculating the second attention variable, the operation may return to the operation of obtaining the second emotional inference value.

다양한 실시예들에 따르면, 제 2 주의집중 변수를 재차 계산하는 동작은, 미리 정해진 횟수 만큼 반복될 수 있다. According to various embodiments, the operation of recalculating the second attention variable may be repeated a predetermined number of times.

다양한 실시예들에 따르면, 감정을 인식하는 동작은, 제 2 주의집중 변수에 기반하여, 제 2 감정 추론 값을 획득하는 동작, 및 상기 제 2 감정 추론 값을 통하여, 상기 감정을 인식하는 동작을 포함할 수 있다. According to various embodiments, the operation of recognizing emotion includes an operation of acquiring a second emotion inference value based on a second attention variable, and an operation of recognizing the emotion through the second emotion inference value. Can include.

다양한 실시예들에 따르면, 제 1 주의집중 변수를 계산하는 동작은, 멀티모달 데이터를 분석하여, 음성과 관련되는 제 1 싱글모달 데이터, 영상과 관련되는 제 2 싱글모달 데이터 및 텍스트와 관련되는 제 3 싱글모달 데이터를 획득하는 동작, 및 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하여, 제 1 주의집중 변수를 계산하는 동작을 포함할 수 있다. According to various embodiments, the operation of calculating the first attention variable may include analyzing multimodal data, and first single modal data related to voice, second single modal data related to video, and second single modal data related to text. An operation of acquiring 3 single modal data, and an operation of calculating a first attention variable corresponding to each of the first single modal data, the second single modal data, and the third single modal data.

다양한 실시예들에 따르면, 제 2 주의집중 변수를 계산하는 동작은, 제 1 감정 추론 값으로부터 음성, 영상 및 데이터와 각각 관련되는 가중치들을 획득하는 동작, 및 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터와 가중치들을 기반으로, 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하여, 제 2 주의집중 변수를 계산하는 동작을 포함할 수 있다. According to various embodiments, the operation of calculating the second attention variable includes an operation of acquiring weights associated with voice, image, and data from the first emotional inference value, and the first single modal data and the second single modal. Based on the data and the third single modal data and weights, an operation of calculating a second attention variable corresponding to each of the first single modal data, the second single modal data, and the third single modal data may be performed.

다양한 실시예들에 따르면, 제 2 주의집중 변수를 재차 계산하는 동작은, 제 2 감정 추론 값으로부터 음성, 영상 및 데이터와 각각 관련되는 가중치들을 획득하는 동작, 및 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터와 가중치들을 기반으로, 제 1 싱글모달 데이터, 제 2 싱글모달 데이터 및 제 3 싱글모달 데이터 각각에 대응하여, 제 2 주의집중 변수를 계산하는 동작을 포함할 수 있다. According to various embodiments, the operation of recalculating the second attention variable includes an operation of acquiring weights respectively related to voice, image, and data from the second emotional inference value, and the first single modal data, the second single Based on the modal data, the third single modal data, and the weights, the operation of calculating a second attention variable corresponding to each of the first single modal data, the second single modal data, and the third single modal data may be included. .

다양한 실시예들에 따르면, 전자 장치(100)는 사용자의 음성, 영상 또는 텍스트 중 적어도 어느 두 개로부터 사용자의 감정을 인식할 수 있다. 이 때 전자 장치(100)는 주의집중의 순환 신경망을 기반으로 특정 동작을 반복 수행함으로써, 사용자의 감정을 인식하는 데 있어서 사용자의 음성, 영상 및 텍스트 각각에 대한 영향력이 고려될 수 있다. 예를 들면, 사용자의 음성, 영상 또는 텍스트 중 적어도 어느 하나가 수신되지 않거나 노이즈가 심한 경우, 전자 장치가 사용자의 감정을 인식하는 데 있어서 사용자의 음성, 영상 또는 텍스트 중 적어도 어느 하나의 영향력을 감소시킬 수 있다. 이를 통해, 전자 장치(100)는, 보다 정확하고 안정적으로 사용자의 감정을 인식할 수 있다.According to various embodiments, the electronic device 100 may recognize a user's emotion from at least one of a user's voice, an image, or a text. In this case, the electronic device 100 repeatedly performs a specific operation based on the recurrent neural network of attention, so that the influence of the user's voice, image, and text may be considered in recognizing the user's emotion. For example, when at least one of the user's voice, video, or text is not received or the noise is severe, the electronic device reduces the influence of at least one of the user's voice, video, or text in recognizing the user's emotions. I can make it. Through this, the electronic device 100 may more accurately and stably recognize the user's emotion.

본 문서의 다양한 실시예들 및 이에 사용된 용어들은 본 문서에 기재된 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 해당 실시 예의 다양한 변경, 균등물, 및/또는 대체물을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다. 본 문서에서, "A 또는 B", "A 및/또는 B 중 적어도 하나", "A, B 또는 C" 또는 "A, B 및/또는 C 중 적어도 하나" 등의 표현은 함께 나열된 항목들의 모든 가능한 조합을 포함할 수 있다. "제 1", "제 2", "첫째" 또는 "둘째" 등의 표현들은 해당 구성요소들을, 순서 또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다. 어떤(예: 제 1) 구성요소가 다른(예: 제 2) 구성요소에 "(기능적으로 또는 통신적으로) 연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소(예: 제 3 구성요소)를 통하여 연결될 수 있다.Various embodiments of the present document and terms used therein are not intended to limit the technology described in this document to a specific embodiment, and should be understood to include various modifications, equivalents, and/or substitutes of the corresponding embodiment. In connection with the description of the drawings, similar reference numerals may be used for similar elements. Singular expressions may include plural expressions unless the context clearly indicates otherwise. In this document, expressions such as "A or B", "at least one of A and/or B", "A, B or C" or "at least one of A, B and/or C" are all of the items listed together. It can include possible combinations. Expressions such as "first", "second", "first" or "second" can modify the corresponding elements regardless of their order or importance, and are only used to distinguish one element from another. The components are not limited. When it is mentioned that a certain (eg, first) component is “(functionally or communicatively) connected” or “connected” to another (eg, second) component, the certain component is It may be directly connected to the component, or may be connected through another component (eg, a third component).

본 문서에서 사용된 용어 "모듈"은 하드웨어, 소프트웨어 또는 펌웨어로 구성된 유닛을 포함하며, 예를 들면, 로직, 논리 블록, 부품, 또는 회로 등의 용어와 상호 호환적으로 사용될 수 있다. 모듈은, 일체로 구성된 부품 또는 하나 또는 그 이상의 기능을 수행하는 최소 단위 또는 그 일부가 될 수 있다. 예를 들면, 모듈은 ASIC(application-specific integrated circuit)으로 구성될 수 있다. The term "module" used in this document includes a unit composed of hardware, software, or firmware, and may be used interchangeably with terms such as, for example, logic, logic blocks, parts, or circuits. A module may be an integrally configured component or a minimum unit or a part of one or more functions. For example, the module may be configured as an application-specific integrated circuit (ASIC).

본 문서의 다양한 실시예들은 기기(machine)(예: 전자 장치(100))에 의해 읽을 수 있는 저장 매체(storage medium)(예: 메모리(130))에 저장된 하나 이상의 명령어들을 포함하는 소프트웨어로서 구현될 수 있다. 예를 들면, 기기의 프로세서(예: 프로세서(140))는, 저장 매체로부터 저장된 하나 이상의 명령어들 중 적어도 하나의 명령을 호출하고, 그것을 실행할 수 있다. 이것은 기기가 호출된 적어도 하나의 명령어에 따라 적어도 하나의 기능을 수행하도록 운영되는 것을 가능하게 한다. 하나 이상의 명령어들은 컴파일러에 의해 생성된 코드 또는 인터프리터에 의해 실행될 수 있는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체 는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, ‘비일시적’은 저장매체가 실재(tangible)하는 장치이고, 신호(signal)(예: 전자기파)를 포함하지 않는다는 것을 의미할 뿐이며, 이 용어는 데이터가 저장매체에 반영구적으로 저장되는 경우와 임시적으로 저장되는 경우를 구분하지 않는다.Various embodiments of the present document are implemented as software including one or more instructions stored in a storage medium (eg, memory 130) readable by a machine (eg, electronic device 100). Can be. For example, the processor of the device (for example, the processor 140) may call at least one instruction from among one or more instructions stored from a storage medium and execute it. This enables the device to be operated to perform at least one function according to the at least one command invoked. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. A storage medium that can be read by a device may be provided in the form of a non-transitory storage medium. Here,'non-transient' only means that the storage medium is a tangible device and does not contain a signal (e.g., electromagnetic wave), and this term refers to the case where data is semi-permanently stored in the storage medium. It does not distinguish between temporary storage cases.

다양한 실시예들에 따르면, 기술한 구성요소들의 각각의 구성요소(예: 모듈 또는 프로그램)는 단수 또는 복수의 개체를 포함할 수 있다. 다양한 실시예들에 따르면, 전술한 해당 구성요소들 중 하나 이상의 구성요소들 또는 동작들이 생략되거나, 또는 하나 이상의 다른 구성요소들 또는 동작들이 추가될 수 있다. 대체적으로 또는 추가적으로, 복수의 구성요소들(예: 모듈 또는 프로그램)은 하나의 구성요소로 통합될 수 있다. 이런 경우, 통합된 구성요소는 복수의 구성요소들 각각의 구성요소의 하나 이상의 기능들을 통합 이전에 복수의 구성요소들 중 해당 구성요소에 의해 수행되는 것과 동일 또는 유사하게 수행할 수 있다. 다양한 실시예들에 따르면, 모듈, 프로그램 또는 다른 구성요소에 의해 수행되는 동작들은 순차적으로, 병렬적으로, 반복적으로, 또는 휴리스틱하게 실행되거나, 동작들 중 하나 이상이 다른 순서로 실행되거나, 생략되거나, 또는 하나 이상의 다른 동작들이 추가될 수 있다. According to various embodiments, each component (eg, a module or program) of the described components may include a singular number or a plurality of entities. According to various embodiments, one or more components or operations among the above-described corresponding components may be omitted, or one or more other components or operations may be added. Alternatively or additionally, a plurality of components (eg, a module or a program) may be integrated into one component. In this case, the integrated component may perform one or more functions of each component of the plurality of components in the same or similar to that performed by the corresponding component among the plurality of components prior to integration. According to various embodiments, operations performed by a module, program, or other component may be sequentially, parallel, repeatedly, or heuristically executed, or one or more of the operations may be executed in a different order, or omitted. , Or one or more other actions may be added.

Claims

In the operation method of an electronic device based on a recurrent neural network of attention using multimodal data,
Detecting multimodal data related to at least any two or three of a user's video, audio, or text;
Calculating a first attention variable based on the multimodal data;
Calculating a second attention variable based on the multimodal data and the first attention variable; And
And inferring a result value based on the second attention variable.

The method of claim 1, wherein the operation of calculating the second attention variable comprises:
Obtaining a first emotional inference value based on the first attention variable; And
And calculating the second attention variable based on the multimodal data and the first emotion inference value.

The method of claim 2, wherein inferring the result value,
And recognizing the emotion of the user based on the second attention variable.

The method of claim 3, wherein the operation of calculating the second attention variable comprises:
Obtaining a second emotional inference value based on the second attention variable; And
The method further comprising recalculating the second attention variable based on the multimodal data and the second emotion inference value.

The method of claim 4,
After the operation of recalculating the second attention variable, the method returns to the operation of obtaining the second emotional inference value.

The method of claim 5, wherein the operation of recalculating the second attention variable comprises:
A method that is repeated a predetermined number of times.

The method of claim 4, wherein the operation of recognizing the emotion,
Obtaining the second emotional inference value based on the second attention variable; And
And recognizing the emotion through the second emotion inference value.

The method of claim 2, wherein the operation of calculating the first attention variable comprises:
Analyzing the multimodal data to obtain first single modal data related to the voice, second single modal data related to the image, and third single modal data related to the text; And
And calculating the first attention variable in response to each of the first single modal data, the second single modal data, and the third single modal data.

The method of claim 8, wherein the operation of calculating the second attention variable comprises:
Acquiring weights respectively related to the voice, video, and data from the first emotion inference value; And
Corresponding to each of the first single modal data, the second single modal data, and the third single modal data based on the first single modal data, the second single modal data, the third single modal data, and the weights. Thus, the method comprising the operation of calculating the second attention variable.

The method of claim 8, wherein the operation of recalculating the second attention variable comprises:
Acquiring weights respectively associated with the voice, video, and data from the second emotion inference value; And
Corresponding to each of the first single modal data, the second single modal data, and the third single modal data based on the first single modal data, the second single modal data, the third single modal data, and the weights. Thus, the method comprising the operation of calculating the second attention variable.

In an electronic device based on a recurrent neural network of attention using multimodal data,
Input module; And
And a processor connected to the input module,
The processor,
Through the input module, detecting multimodal data related to at least any two or three of a user's video, audio, or text,
Based on the multimodal data, calculate a first attention variable,
Calculate a second attention variable based on the multimodal data and the first attention variable,
An electronic device configured to infer a result value based on the second attention variable.

The method of claim 11, wherein the processor,
Obtaining a first emotion inference value based on the first attention variable,
The electronic device configured to calculate the second attention variable based on the multimodal data and the first emotion inference value.

The method of claim 12, wherein the processor,
An electronic device configured to recognize the emotion of the user based on the second attention variable.

The method of claim 13, wherein the processor,
Obtaining a second emotional inference value based on the second attention variable,
The electronic device configured to recalculate the second attention variable based on the multimodal data and the second emotion inference value.

The method of claim 14, wherein the processor,
After recalculating the second attention variable, the electronic device returns to obtain the second emotional inference value.

The method of claim 14, wherein the processor,
Obtaining the second emotional inference value based on the second attention variable,
An electronic device configured to recognize the emotion through the second emotion inference value.