KR102573465B1

KR102573465B1 - Method and system for providing emotion correction during video chat

Info

Publication number: KR102573465B1
Application number: KR1020217035142A
Authority: KR
Inventors: 유에 닝 쿠; 유안 마; 이티안 우; 레이 양
Original assignee: 후아웨이 테크놀러지 컴퍼니 리미티드
Priority date: 2019-04-05
Filing date: 2019-04-05
Publication date: 2023-08-31
Also published as: JP2022528691A; CN113646838A; WO2020204948A1; EP3942552A1; JP7185072B2; KR20210146372A; CN113646838B

Abstract

비디오 채팅에 참여하는 인물의 비디오 및 오디오 신호를 변경하여 인물의 인지된 감정 중 하나 이상 및 인물의 의미론적 감정 간의 동조의 증가가 있는 변경된 비디오 오디오 신호를 산출하는 방법 및 서브시스템이 본 문서에 기술된다. 그러한 방법은 제2 인물과의 비디오 채팅에 참여하는 제1 인물의 비디오 신호 및 오디오 신호를 획득하는 것과, 비디오 신호에 기반하여 제1 인물의 하나 이상의 타입의 인지된 감정을 판정하는 것과, 오디오 신호에 기반하여 제1 인물의 의미론적 감정을 판정하는 것을 포함한다. 방법은 제1 인물의 하나 이상의 타입의 인지된 감정 중 적어도 하나 및 제1 인물의 의미론적 감정 간의 동조를 증가시키기 위해 비디오 신호를 변경하는 것을 또한 포함한다.A method and subsystem for altering the video and audio signals of a person participating in a video chat to yield a modified video-audio signal having an increase in entrainment between one or more of the person's perceived emotions and the person's semantic emotion are described herein. do. Such a method includes obtaining a video signal and an audio signal of a first person engaging in a video chat with a second person, determining one or more types of perceived emotion of the first person based on the video signal, and including the audio signal. and determining the semantic emotion of the first person based on The method also includes altering the video signal to increase the entrainment between at least one of the one or more types of perceived emotion of the first person and the semantic emotion of the first person.

Description

Method and system for providing emotion correction during video chat

본 개시는 일반적으로 비디오 채팅(video chat) 동안의 사용을 위한 방법 및 시스템에 관련되고, 특정한 실시예에서, 비디오 채팅에 참여하는 인물(person)의 비디오 및 오디오 신호를 변경하여 인물의 인지된 감정(perceived emotion) 중 하나 이상 및 인물의 의미론적 감정(semantic emotion) 간의 동조(alignment)의 증가가 있는 변경된 비디오 및 오디오 신호를 산출하는 방법 및 시스템에 관련된다.The present disclosure generally relates to methods and systems for use during video chat, and in particular embodiments, modifies the video and audio signals of a person participating in a video chat to change the person's perceived emotions. A method and system for producing modified video and audio signals with an increase in alignment between one or more of the perceived emotions and a semantic emotion of a person.

자동차 및 다른 타입의 차량의 운전자는 운전하는 동안에 다른 사람들과 채팅하기(chat) 위해 그의 스마트폰 또는 다른 모바일 컴퓨팅 디바이스를 흔히 사용한다. 그러한 채팅은 음성 채팅(voice chat) 또는 비디오 채팅(video chat)일 수 있다. 이 논의의 목적으로, 음성 채팅은 오로지 오디오(audio)인 통신을 지칭하는바, 음성 채팅에 참여하는 두 사람은 서로를 들을 수 있으나, 서로를 볼 수는 없다. 대조적으로, 비디오 채팅은 비디오 채팅에 참여하는 두 사람의 오디오 및 비디오 양자 모두를 포함하는 통신을 지칭하는바, 비디오 채팅에 참여하는 두 사람은 서로를 들을 수도 있고 서로를 볼 수도 있다. 비디오통화(videotelephony) 기술은, 오디오 및 비디오 신호의 수신 및 송신을 가능케 하는 것으로서, 비디오 채팅을 수행하는 데에 사용될 수 있다. 예시적인 비디오통화 제품은, 몇 개만 예로 들면, 애플 사(Apple Inc.)로부터 입수가능한 페이스타임(FaceTime), 구글 사(Google LLC)로부터 모두 입수가능한 구글 듀오(Google Duo) 및 구글 행아웃(Google Hangouts), 마이크로소프트 사(Microsoft Corp.)로부터 입수가능한 스카이프(Skype) 및 텐센트 사(Tencent Corp.)로부터 입수가능한 위챗(WeChat)을 포함한다. 실제로, 운전자의 10%가 운전하는 동안에 그들이 비디오 채팅을 하기 위해 그들의 스마트폰을 사용하였다고 시사하였음을 알아낸 조사가 있었다. 그 백분율은, 특히 반 자율(semi-autonomous) 및 완전 자율(fully-autonomous) 차량이 더 통상적이게 됨에 따라, 장차 증가할 듯하다.Drivers of cars and other types of vehicles often use their smartphones or other mobile computing devices to chat with others while driving. Such chats may be voice chats or video chats. For the purposes of this discussion, voice chat refers to communication that is audio-only: two people participating in a voice chat can hear each other, but cannot see each other. In contrast, video chat refers to communication involving both audio and video between two participants in a video chat, whereby the two participants can hear and see each other. Videotelephony technology, which enables the reception and transmission of audio and video signals, can be used to conduct video chatting. Exemplary video calling products include FaceTime, available from Apple Inc., Google Duo, both available from Google LLC, and Google Hangouts, to name just a few. Hangouts), Skype available from Microsoft Corp. and WeChat available from Tencent Corp.. Indeed, a study found that 10% of drivers indicated that they used their smartphones to video chat while driving. That percentage is likely to increase in the future, especially as semi-autonomous and fully-autonomous vehicles become more common.

차량의 운전자가 드러내는 공격적이거나 화난 거동인 노상 분노(road rage)는 매우 통상적이다. 실제로, 운전자의 대다수가 지난 해에 운전하는 동안에 상당한 화를 표현하였음을 조사가 알아냈다. 노상 분노는 수많은 타입의 직접적인 역효과로 이어질 수 있다. 예를 들어, 차량의 운전자 및 그의 탑승자에 있어서, 노상 분노는 심각한 육체적 부상이나 심지어 죽음을 초래하는 언쟁, 폭행 및 충돌로 이어질 수 있다. 노상 분노는 또한 소정의 직접적이지 않은 역효과로 이어질 수 있다. 예를 들어, 만일 제1 차량을 운전하는 제1 인물이 제2 차량을 운전하는 제2 인물과의 비디오 채팅에 참여하고 있되, 그 동안에 제1 운전자가 노상 분노를 겪는 경우, 제1 인물의 화는 제2 인물에 전해지고/거나 다른 식으로 제2 인물을 어지럽힐(distract) 수 있는데, 이는 제2 인물이 충돌에 연루되는 가능성을 증가시킬 수 있다. 다른 예를 들면, 만일 제1 차량을 운전하는 제1 인물이 한 명 이상의 다른 인물과의 사업 관련 비디오 채팅에 참여하고 있되, 그 동안에 제1 운전자가 노상 분노를 겪는 경우, 제1 인물과 그 한 명 이상의 다른 인물 간의 사업 관계가 망쳐지거나 다른 식으로 역효과를 받을 수 있다.Road rage, the aggressive or angry behavior exhibited by the driver of a vehicle, is very common. Indeed, the survey found that the majority of drivers expressed significant anger while driving in the past year. Road rage can lead to numerous types of direct adverse effects. For example, for drivers and their occupants of vehicles, road rage can lead to altercations, assaults and collisions resulting in serious physical injury or even death. Road rage can also lead to certain non-direct adverse effects. For example, if a first person driving a first vehicle is engaged in a video chat with a second person driving a second vehicle, during which the first driver has a road rage, the first person is angry. may be transmitted to and/or otherwise distract the second person, which may increase the likelihood that the second person is involved in a conflict. As another example, if the first person driving the first vehicle is engaged in a business-related video chat with one or more other people during which the first driver has a road rage, the first person and the first person Business relationships between 100 or more other people could be spoiled or otherwise adversely affected.

본 개시의 하나의 측면에 따르면, 방법은 제2 인물과의 비디오 채팅에 참여하는 제1 인물의 비디오 신호 및 오디오 신호를 획득하는 것과, 비디오 신호에 기반하여 제1 인물의 하나 이상의 타입의 인지된 감정을 판정하는 것과, 오디오 신호에 기반하여 제1 인물의 의미론적 감정을 판정하는 것을 포함한다. 방법은 제1 인물의 하나 이상의 타입의 인지된 감정 중 적어도 하나 및 제1 인물의 의미론적 감정 간의 동조를 증가시키기 위해 비디오 신호를 변경하는 것을 또한 포함한다.According to one aspect of the present disclosure, a method includes obtaining a video signal and an audio signal of a first person engaging in a video chat with a second person, and based on the video signal, recognizing one or more types of the first person. and determining the semantic emotion of the first person based on the audio signal. The method also includes altering the video signal to increase the entrainment between at least one of the one or more types of perceived emotion of the first person and the semantic emotion of the first person.

선택적으로, 선행하는 측면 중 임의의 것에서, 비디오 신호에 기반하여 제1 인물의 하나 이상의 타입의 인지된 감정을 판정하는 것은, 비디오 신호에 기반하여 제1 인물의 얼굴 표정(facial expression) 또는 신체 포즈(body pose) 중 적어도 하나를 검출하는 것과, 제1 인물의 얼굴 표정 또는 신체 포즈 중 적어도 하나에 기반하여 제1 인물의 얼굴 표정 인지된 감정(facial expression perceived emotion) 또는 신체 포즈 인지된 감정(body pose perceived emotion) 중 적어도 하나를 판정하는 것을 포함한다.Optionally, in any of the preceding aspects, determining one or more types of perceived emotion of the first person based on the video signal comprises determining a facial expression or body pose of the first person based on the video signal. Detecting at least one of body pose, and facial expression perceived emotion or body pose perceived emotion of the first person based on at least one of facial expression or body pose of the first person pose perceived emotion).

선택적으로, 선행하는 측면 중 임의의 것에서, 제1 인물의 하나 이상의 타입의 인지된 감정을 판정하는 것은 오디오 신호에 또한 기반하며, 제1 인물의 발화(speech)의 피치(pitch), 비브라토(vibrato) 또는 어조(inflection) 중 적어도 하나를 판정하기 위해 오디오 신호의 오디오 신호 처리를 수행하는 것과, 오디오 신호의 오디오 신호 처리의 결과에 기반하여 제1 인물의 발화 인지된 감정(speech perceived emotion)을 판정하는 것을 포함한다. 그러한 방법은 제1 인물의 발화 인지된 감정 및 제1 인물의 의미론적 감정 간의 동조를 증가시키기 위해 오디오 신호를 변경하는 것을 더 포함할 수 있다.Optionally, in any of the preceding aspects, determining the perceived emotion of the one or more types of the first person is also based on the audio signal, a pitch of the first person's speech, a vibrato ) or inflection, and determining speech perceived emotion of the first person based on a result of the audio signal processing of the audio signal. includes doing Such a method may further include altering the audio signal to increase the entrainment between the first person's spoken perceived emotion and the first person's semantic emotion.

선택적으로, 선행하는 측면 중 임의의 것에서, 변경된 비디오 신호를 산출하기 위해 비디오 신호를 변경하는 것은 얼굴 표정 또는 신체 포즈 중 적어도 하나에 대응하는 비디오 신호의 이미지 데이터를 수정하는 것을 포함하고, 변경된 오디오 신호를 산출하기 위해 오디오 신호를 변경하는 것은 피치, 비브라토 또는 어조 중 적어도 하나에 대응하는 비디오 신호의 오디오 데이터를 수정하는 것을 포함한다.Optionally, in any of the preceding aspects, modifying the video signal to produce a modified video signal comprises modifying image data of the video signal corresponding to at least one of a facial expression or a body pose, and wherein the modified audio signal Modifying the audio signal to produce ? includes modifying audio data of the video signal corresponding to at least one of pitch, vibrato or tone.

선택적으로, 선행하는 측면 중 임의의 것에서, 방법은 변경된 비디오 신호 및 변경된 오디오 신호를 비디오 채팅에 참여하고 있는 제2 인물과 연관된 서브시스템(subsystem)에 제공함으로써, 제2 인물로 하여금 제1 인물의 하나 이상의 타입의 인지된 감정 중 적어도 하나 및 제1 인물의 의미론적 감정 간의 증가된 동조를 갖는, 제1 인물의 이미지와 오디오를 보고 들을 수 있게 하는 것을 더 포함한다.Optionally, in any of the preceding aspects, the method provides the modified video signal and the modified audio signal to a subsystem associated with the second person participating in the video chat, thereby causing the second person to interact with the first person. Further comprising enabling viewing and hearing of images and audio of the first person with increased entrainment between at least one of the one or more types of perceived emotion and the semantic emotion of the first person.

선택적으로, 선행하는 측면 중 임의의 것에서, 오디오 신호에 기반하여 제1 인물의 의미론적 감정을 판정하는 것은 오디오 신호의 자연 언어 처리(natural language processing)를 수행하는 것과, 오디오 신호의 자연 언어 처리의 결과에 기반하여 제1 인물의 의미론적 감정을 판정하는 것을 포함한다.Optionally, in any of the preceding aspects, determining the semantic sentiment of the first person based on the audio signal is a combination of performing natural language processing of the audio signal and natural language processing of the audio signal. and determining the semantic emotion of the first person based on the result.

선택적으로, 선행하는 측면 중 임의의 것에서, 비디오 신호에 기반하여 제1 인물의 하나 이상의 타입의 인지된 감정을 판정하는 것은, 비디오 신호에 기반하여 제1 인물의 얼굴 표정의 긍정성(positiveness) 및 적극성(activeness)을 수량화하기(quantify) 위해 얼굴 복합순환 모델(facial circumplex model)을 사용하는 것 또는 비디오 신호에 기반하여 제1 인물의 신체 포즈의 긍정성 및 적극성을 수량화하기 위해 포즈 복합순환 모델(pose circumplex model)을 사용하는 것 중 적어도 하나를 포함한다. 또한, 오디오 신호에 기반하여 제1 인물의 의미론적 감정을 판정하는 것은 오디오 신호에 기반하여 제1 인물의 언어의 긍정성 및 적극성을 수량화하기 위해 언어 복합순환 모델(language circumplex model)을 사용하는 것을 포함한다. 추가적으로, 변경된 비디오 신호를 산출하기 위해 비디오 신호를 변경하는 것은 제1 인물의 얼굴 표정의 긍정성 및 적극성과, 제1 인물의 언어의 긍정성 및 적극성 간의 거리를 줄이기 위해 비디오 신호의 이미지 데이터를 변경하는 것 또는 제1 인물의 신체 포즈의 긍정성 및 적극성과, 제1 인물의 언어의 긍정성 및 적극성 간의 거리를 줄이기 위해 비디오 신호의 이미지 데이터를 변경하는 것 중 적어도 하나를 포함한다.Optionally, in any of the preceding aspects, determining the perceived emotion of one or more types of the first person based on the video signal comprises determining the positiveness of a facial expression of the first person based on the video signal and Using a facial circumplex model to quantify activeness or a pose circumplex model (to quantify the positivity and activeness of the body pose of the first person based on the video signal) pose circumplex model). In addition, determining the semantic emotion of the first person based on the audio signal is using a language circumplex model to quantify the positivity and aggressiveness of the first person's language based on the audio signal. include Additionally, modifying the video signal to produce a modified video signal modifies image data of the video signal to reduce the distance between the positivity and assertiveness of the facial expression of the first person and the positivity and assertiveness of the first person's language. or changing image data of the video signal to reduce a distance between positiveness and assertiveness of the body pose of the first person and positiveness and assertiveness of language of the first person.

선택적으로, 선행하는 측면 중 임의의 것에서, 제1 인물의 하나 이상의 타입의 인지된 감정을 판정하는 것은 오디오 신호에 또한 기반하며, 오디오 신호에 기반하여 제1 인물의 발화의 긍정성 및 적극성을 수량화하기 위해 발화 복합순환 모델(speech circumplex model)을 사용하는 것을 포함한다. 방법은 제1 인물의 발화의 긍정성 및 적극성과, 제1 인물의 언어의 긍정성 및 적극성 간의 거리를 줄이기 위해, 변경된 오디오 신호를 산출하기 위해 오디오 신호의 오디오 데이터를 변경하는 것을 더 포함할 수 있다.Optionally, in any of the preceding aspects, determining the perceived emotion of the one or more types of the first person is also based on the audio signal, and quantifies positivity and assertiveness of the first person's utterance based on the audio signal. This includes using a speech circumplex model to The method may further include modifying audio data of the audio signal to produce a modified audio signal to reduce a distance between positiveness and assertiveness of the first person's utterance and positiveness and assertiveness of the first person's speech. there is.

선택적으로, 선행하는 측면 중 임의의 것에서, 방법은 변경된 비디오 신호 및 변경된 오디오 신호를 비디오 채팅에 참여하고 있는 제2 인물과 연관된 서브시스템에 제공함으로써, 제2 인물로 하여금 제1 인물의 하나 이상의 타입의 인지된 감정 중 적어도 하나 및 제1 인물의 의미론적 감정 간의 증가된 동조를 갖는, 제1 인물의 이미지와 오디오를 보고 들을 수 있게 하는 것을 더 포함한다.Optionally, in any of the preceding aspects, the method provides the modified video signal and the modified audio signal to a subsystem associated with a second person engaged in a video chat, thereby causing the second person to perform one or more types of the first person. and enabling viewing and hearing of images and audio of the first person with increased entrainment between at least one of the perceived emotions of the first person and the semantic emotion of the first person.

본 개시의 하나의 다른 측면에 따르면, 서브시스템은 하나 이상의 인터페이스 및 하나 이상의 프로세서를 포함한다. 하나 이상의 인터페이스는 제2 인물과의 비디오 채팅에 참여하는 제1 인물의 비디오 신호 및 오디오 신호를 수신하도록 구성된다. 하나 이상의 프로세서는 하나 이상의 인터페이스에 통신가능하게 커플링되고(communicatively coupled), 비디오 신호에 기반하여 제1 인물의 하나 이상의 타입의 인지된 감정을 판정하고, 오디오 신호에 기반하여 제1 인물의 의미론적 감정을 판정하도록 구성된다. 하나 이상의 프로세서는 제1 인물의 하나 이상의 타입의 인지된 감정 중 적어도 하나 및 제1 인물의 의미론적 감정 간의 동조를 증가시키기 위해 비디오 신호를 변경하도록 또한 구성된다. 서브시스템은 비디오 신호를 획득하도록 구성된 하나 이상의 카메라(camera)와, 오디오 신호를 획득하도록 구성된 하나 이상의 마이크로폰(microphone)을 또한 포함할 수 있다.According to one other aspect of the present disclosure, a subsystem includes one or more interfaces and one or more processors. The one or more interfaces are configured to receive video signals and audio signals of a first person participating in a video chat with a second person. The one or more processors are communicatively coupled to the one or more interfaces, determine the perceived emotion of one or more types of the first person based on the video signal, and determine the semantic emotion of the first person based on the audio signal. It is configured to determine emotions. The one or more processors are also configured to alter the video signal to increase the entrainment between at least one of the one or more types of perceived emotion of the first person and the semantic emotion of the first person. The subsystem may also include one or more cameras configured to obtain video signals and one or more microphones configured to obtain audio signals.

선택적으로, 선행하는 측면 중 임의의 것에서, 하나 이상의 프로세서는 비디오 신호에 기반하여 제1 인물의 인지된 감정을 판정하고 오디오 신호에 기반하여 제1 인물의 의미론적 감정을 판정하도록 구성된 하나 이상의 신경망(neural network)을 구현한다.Optionally, in any of the preceding aspects, the one or more processors comprise one or more neural networks configured to determine a perceived emotion of the first person based on the video signal and to determine a semantic emotion of the first person based on the audio signal ( implement a neural network.

선택적으로, 선행하는 측면 중 임의의 것에서, 하나 이상의 프로세서는 제1 인물의 하나 이상의 타입의 인지된 감정 중 적어도 하나 및 제1 인물의 의미론적 감정 간의 동조를 증가시키기 위해 비디오 신호를 변경하도록 구성된 하나 이상의 신경망을 구현한다.Optionally, in any of the preceding aspects, the one or more processors are configured to alter the video signal to increase the entrainment between at least one of the one or more types of perceived emotion of the first person and the semantic emotion of the first person. Implement the above neural network.

선택적으로, 선행하는 측면 중 임의의 것에서, 비디오 신호에 기반하여 제1 인물의 하나 이상의 타입의 인지된 감정을 판정하기 위해서, 하나 이상의 프로세서는 비디오 신호에 기반하여 제1 인물의 얼굴 표정 또는 신체 포즈 중 적어도 하나를 검출하고, 제1 인물의 얼굴 표정 또는 신체 포즈 중 적어도 하나에 기반하여 제1 인물의 얼굴 표정 인지된 감정 또는 신체 포즈 인지된 감정 중 적어도 하나를 판정하도록 구성된다.Optionally, in any of the preceding aspects, to determine one or more types of perceived emotion of the first person based on the video signal, the one or more processors may include a facial expression or body pose of the first person based on the video signal. and determine at least one of facial expression perceived emotion or body pose perceived emotion of the first person based on at least one of the facial expression or body pose of the first person.

선택적으로, 선행하는 측면 중 임의의 것에서, 하나 이상의 프로세서는 제1 인물의 발화의 피치, 비브라토 또는 어조 중 적어도 하나를 판정하기 위해 오디오 신호의 오디오 신호 처리를 수행하고, 오디오 신호 처리의 결과에 기반하여 제1 인물의 발화 인지된 감정을 판정하고, 제1 인물의 발화 인지된 감정 및 제1 인물의 의미론적 감정 간의 동조를 증가시키기 위해 오디오 신호를 변경하도록 또한 구성된다.Optionally, in any of the preceding aspects, the one or more processors perform audio signal processing of the audio signal to determine at least one of pitch, vibrato or tone of the first person's speech, based on a result of the audio signal processing. to determine the first person's utterance perceived emotion, and modify the audio signal to increase the entrainment between the first person's utterance perceived emotion and the first person's semantic emotion.

선택적으로, 선행하는 측면 중 임의의 것에서, 하나 이상의 프로세서는 얼굴 표정 또는 신체 포즈 중 적어도 하나에 대응하는 비디오 신호의 이미지 데이터를 수정하여, 이로써 변경된 비디오 신호를 산출하기 위해 비디오 신호를 변경하고, 피치, 비브라토 또는 어조 중 적어도 하나에 대응하는 오디오 신호의 오디오 데이터를 수정하여, 이로써 변경된 오디오 신호를 산출하기 위해 오디오 신호를 변경하도록 구성된다.Optionally, in any of the preceding aspects, the one or more processors modify the video signal to modify image data of the video signal corresponding to at least one of a facial expression or a body pose, thereby yielding a modified video signal; , vibrato or tone, thereby modifying the audio signal to produce a modified audio signal.

선택적으로, 선행하는 측면 중 임의의 것에서, 하나 이상의 프로세서는 오디오 신호의 자연 언어 처리를 수행하고, 오디오 신호의 자연 언어 처리의 결과에 기반하여 제1 인물의 의미론적 감정을 판정하도록 구성된다.Optionally, in any of the preceding aspects, the one or more processors are configured to perform natural language processing of the audio signal and determine a semantic sentiment of the first person based on a result of the natural language processing of the audio signal.

선택적으로, 선행하는 측면 중 임의의 것에서, 하나 이상의 프로세서는 비디오 신호에 기반하여 제1 인물의 얼굴 표정의 긍정성 및 적극성을 수량화하기 위해 얼굴 복합순환 모델을 사용하고, 비디오 신호에 기반하여 제1 인물의 신체 포즈의 긍정성 및 적극성을 수량화하기 위해 포즈 복합순환 모델을 사용하고, 오디오 신호에 기반하여 제1 인물의 언어의 긍정성 및 적극성을 수량화하기 위해 언어 복합순환 모델을 사용하도록 구성된다. 추가적으로, 하나 이상의 프로세서는 제1 인물의 얼굴 표정의 긍정성 및 적극성과, 제1 인물의 언어의 긍정성 및 적극성 간의 거리를 줄이고, 제1 인물의 신체 포즈의 긍정성 및 적극성과, 제1 인물의 언어의 긍정성 및 적극성 간의 거리를 줄이기 위해 비디오 신호의 이미지 데이터를 변경하도록 구성된다.Optionally, in any of the preceding aspects, the one or more processors use the facial complex circulatory model to quantify positivity and aggressiveness of a facial expression of the first person based on the video signal; and use the pose complex cyclic model to quantify the positivity and aggressiveness of the body pose of the person, and use the language complex cyclic model to quantify the positivity and aggressiveness of the language of the first person based on the audio signal. Additionally, the one or more processors reduce a distance between the positivity and assertiveness of the first person's facial expression and the positivity and assertiveness of the first person's language, and reduce the distance between the positivity and assertiveness of the first person's body pose and the first person's positiveness and assertiveness. and change image data of the video signal to reduce the distance between positivity and assertiveness of the language of .

선택적으로, 선행하는 측면 중 임의의 것에서, 하나 이상의 프로세서는 오디오 신호에 기반하여 제1 인물의 발화의 긍정성 및 적극성을 수량화하기 위해 발화 복합순환 모델을 사용하고, 제1 인물의 발화의 긍정성 및 적극성과, 제1 인물의 언어의 긍정성 및 적극성 간의 거리를 줄이기 위해 오디오 신호의 오디오 데이터를 변경하도록 또한 구성된다.Optionally, in any of the preceding aspects, the one or more processors use an utterance complex cycle model to quantify positiveness and aggressiveness of utterances of the first person based on the audio signal, and and change the audio data of the audio signal to reduce the distance between the assertiveness and the positiveness and aggressiveness of the language of the first person.

선택적으로, 선행하는 측면 중 임의의 것에서, 서브시스템은 변경된 비디오 신호 및 변경된 오디오 신호를 비디오 채팅에 참여하고 있는 제2 인물과 연관된 서브시스템에 송신함으로써, 제2 인물로 하여금 제1 인물의 하나 이상의 타입의 인지된 감정 중 적어도 하나 및 제1 인물의 의미론적 감정 간의 증가된 동조를 갖는, 제1 인물의 비디오와 오디오를 보고 들을 수 있게 하도록 구성된 송신기를 포함한다.Optionally, in any of the foregoing aspects, the subsystem transmits modified video signals and modified audio signals to a subsystem associated with a second person engaged in a video chat, thereby causing the second person to perform one or more actions of the first person. and a transmitter configured to enable viewing and hearing video and audio of the first person with increased entrainment between at least one of the perceived emotions of the type and the semantic emotion of the first person.

본 개시의 하나의 다른 측면에 따르면, 비일시적(non-transitory) 컴퓨터 판독가능 매체는 하나 이상의 프로세서에 의해 실행되는 경우에 하나 이상의 프로세서로 하여금 다음 단계를 수행하게 하는 컴퓨터 명령어를 저장한다: 제2 인물과의 비디오 채팅에 참여하는 제1 인물의 비디오 신호 및 오디오 신호를 획득하는 것; 비디오 신호에 기반하여 제1 인물의 하나 이상의 타입의 인지된 감정을 판정하는 것; 오디오 신호에 기반하여 제1 인물의 의미론적 감정을 판정하는 것; 및 제1 인물의 하나 이상의 타입의 인지된 감정 중 적어도 하나 및 제1 인물의 의미론적 감정 간의 동조를 증가시키기 위해 비디오 신호를 변경하는 것. 비일시적 컴퓨터 판독가능 매체는 하나 이상의 프로세서에 의해 실행되는 경우에 하나 이상의 프로세서로 하여금 위에서 요약되고 아래에서 추가로 상세히 기술되는 방법의 추가적인 단계를 수행하게 하는 컴퓨터 명령어를 또한 저장할 수 있다.According to one other aspect of the present disclosure, a non-transitory computer readable medium stores computer instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps: acquiring a video signal and an audio signal of a first person participating in a video chat with the person; determining one or more types of perceived emotion of the first person based on the video signal; determining the semantic emotion of the first person based on the audio signal; and altering the video signal to increase the entrainment between at least one of the one or more types of perceived emotion of the first person and the semantic emotion of the first person. The non-transitory computer readable medium may also store computer instructions that, when executed by the one or more processors, cause the one or more processors to perform additional steps of the methods summarized above and described in further detail below.

이 개요는 상세한 설명에서 추가로 후술되는 개념의 발췌를 단순화된 형태로 소개하기 위해 제공된다. 이 개요는 청구된 주제(subject matter)의 핵심 특징 또는 필수적 특징을 식별하도록 의도되지 않고, 그것은 청구된 주제의 범위를 판정하는 데에서 보조로서 사용되도록 의도되지도 않는다. 청구된 주제는 배경에서 언급된 임의의 또는 모든 결점을 해결하는 구현에 한정되지 않는다.This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all of the deficiencies noted in the background.

본 개시의 측면이 예를 통하여 보여지며 비슷한 참조부호가 비슷한 요소를 나타내는 첨부된 도면에 의해 한정되지 않는다.
도 1은 제1 인물과 제2 인물이 비디오 채팅에 참여할 수 있게 하는 예시적인 시스템을 보여준다.
도 2, 도 3 및 도 4는 제1 및 제2 인물로 하여금 비디오 채팅에 참여할 수 있게 하고, 제2 인물이 듣고 보는 제1 인물의 오디오 및 비디오가 제1 인물의 실제의 오디오 및 비디오와 상이하도록 적어도 제1 인물의 오디오 및 비디오를 또한 수정하는, 본 기술의 다양한 실시예에 따른 시스템을 보여준다.
도 5는 비디오 채팅에 참여하는 인물의 오디오 및 비디오 신호를 수정하는 데에 사용될 수 있는, 본 기술의 실시예에 따른 수정 서브시스템을 보여준다.
도 6은 도 5에서 도입된 수정 서브시스템의 감정 검출기(emotion detector) 및 감정 수정기(emotion modifier)의 추가적인 세부사항을 보여준다.
도 7a는 일반적인 복합순환 모델을 보여준다.
도 7b는 얼굴 복합순환 모델을 보여준다.
도 7c는 포즈 복합순환 모델을 보여준다.
도 7d는 발화 복합순환 모델을 보여준다.
도 8은 상이한 타입의 인지된 감정 및 의미론적 감정이 복합순환 모델에 어떻게 맵핑될(mapped) 수 있는지, 인지된 감정 및 의미론적 감정 간의 거리가 어떻게 판정될 수 있는지, 그리고 상이한 타입의 인지된 감정 및 의미론적 감정 간의 동조를 증가시키기 위해 그러한 거리가 어떻게 감소될 수 있는지를 보여준다.
도 9는 인지된 감정 및 의미론적 감정 간의 동조를 증가시키기 위해 비디오 및 오디오 신호의 소정의 특징을 수정할 것인지를 판정하는 데에 인지된 감정 및 의미론적 감정 간의 거리가 어떻게 사용될 수 있는지를 설명하는 고수준(high level) 흐름도를 보여준다.
도 10은 본 기술의 소정의 실시예에 따른 방법을 요약하는 데에 사용되는 고수준 흐름도를 보여준다.
도 11은 예시적인 모바일 컴퓨팅 디바이스(이로써 본 기술의 실시예가 사용될 수 있음)의 예시적인 컴포넌트를 보여준다.Aspects of the present disclosure are shown by way of example and are not limited by the accompanying drawings in which like reference numerals indicate like elements.
1 shows an example system that allows a first person and a second person to engage in a video chat.
2, 3 and 4 allow first and second persons to participate in a video chat, wherein the audio and video of the first person heard and seen by the second person differs from the actual audio and video of the first person. A system is shown, according to various embodiments of the present technology, that also modifies audio and video of at least a first person to
5 shows a modification subsystem according to an embodiment of the present technology that may be used to modify audio and video signals of persons participating in a video chat.
FIG. 6 shows additional details of the emotion detector and emotion modifier of the modification subsystem introduced in FIG. 5 .
Figure 7a shows a general complex circulation model.
7b shows a facial complex circulation model.
7c shows a pose complex circulation model.
7d shows a firing complex cycle model.
8 shows how different types of perceived emotion and semantic emotion can be mapped to a complex circular model, how a distance between perceived emotion and semantic emotion can be determined, and different types of perceived emotion and how such distances can be reduced to increase entrainment between semantic emotions.
9 is a high-level illustration of how the distance between perceived emotion and semantic emotion can be used to determine whether to modify certain characteristics of video and audio signals to increase the entrainment between perceived emotion and semantic emotion. (high level) Shows a flow chart.
10 shows a high-level flow diagram used to summarize a method according to certain embodiments of the present technology.
11 shows example components of an example mobile computing device, in which embodiments of the present technology may be used.

본 기술의 소정의 실시예는 제2 인물과의 비디오 채팅에 참여하는 제1 인물의 비디오 및 오디오 신호를 변경하여서, 변경된 신호가 제2 인물을 위해 재생되는(played) 경우에, 제2 인물이 보고 듣는 바가 원래 포착된(captured) 비디오 및 오디오 신호와 상이하다. 제1 인물 및 제2 인물이 양자 모두 차량을 운전하고 있는 동안에 비디오 채팅에 참여하고 있는 경우에, 본 기술의 그러한 실시예는, 만약 비디오 채팅에 참여하는 동안에 제1 인물이 노상 분노를 겪는다면, 제1 인물의 화가 제2 인물에 전해지는 것을 방지할 수 있다. 차량을 운전하는 제1 인물이 하나 이상의 다른 인물과의 사업 관련 비디오 채팅에 참여하고 있는 경우에, 본 기술의 그러한 실시예는 제1 인물의 화가 그 다른 인물(들)에 의해 목격되는 것을 방지할 수 있는바, 이로써 제1 인물 및 그 하나 이상의 다른 인물 간의 사업 관계가 망쳐지거나 아니면 역효과를 받는 것을 막는다. 아래에서 더욱 상세히 기술되는, 소정의 실시예에 따르면, 제1 인물의 하나 이상의 타입의 인지된 감정은 제1 인물의 비디오 신호(그리고 잠재적으로 또한 오디오 신호)에 기반하여 판정될 수 있고, 제1 인물의 의미론적 감정은 제1 인물의 오디오 신호에 기반하여 판정될 수 있다. 제1 인물의 비디오 및 오디오 신호는 제1 인물의 결과적인 수정된 비디오 및 오디오가 제1 인물의 인지된 감정(들)보다 제1 인물의 의미론적 감정과 더 동조되도록 이후에 수정될(또한 변경될 것으로 지칭됨)될 수 있다. 더욱 구체적으로, 인물의 하나 이상의 타입의 인지된 감정 및 인물의 의미론적 감정 간의 차이를 감소시키기 위해 비디오 및 오디오 신호가 수정된다.Certain embodiments of the present technology modify the video and audio signals of a first person participating in a video chat with a second person so that, when the modified signals are played for the second person, the second person What you see and hear is different from the originally captured video and audio signals. If a first person and a second person are participating in a video chat while both are driving a vehicle, such an embodiment of the present technology is such that if the first person experiences road rage while participating in the video chat, It is possible to prevent the first person's anger from being transmitted to the second person. If a first person driving a vehicle is engaging in a business-related video chat with one or more other persons, such embodiments of the present technology may prevent the first person's upset from being witnessed by the other person(s). may be prevented from spoiling or otherwise adversely affecting the business relationship between the first person and the one or more other persons. According to certain embodiments, described in more detail below, the perceived emotion of one or more types of the first person may be determined based on the video signal (and potentially also the audio signal) of the first person; The semantic emotion of the person may be determined based on the audio signal of the first person. The video and audio signals of the first person may be subsequently modified (or altered) such that the resulting modified video and audio of the first person is more in tune with the first person's semantic emotion than the first person's perceived emotion(s). referred to as being) can be. More specifically, the video and audio signals are modified to reduce the difference between the perceived emotion of one or more types of the person and the semantic emotion of the person.

인지된 감정은, 본 문서에서 용어가 사용됨에 있어서, 일반적으로 제2 인물이 제2 인물의 감각을 사용하여, 가령, 제2 인물의 시각 및 청각을 사용하여 알게 되는, 제1 인물의 감정 상태에 관련된다. 대조적으로, 의미론적 감정은, 본 문서에서 용어가 사용됨에 있어서, 일반적으로 제2 인물이 제1 인물에 의해 발화된(spoken) 구두 언어(verbal language)(또한 발화된 언어로, 또는 더 간결하게 언어로 지칭됨)에 대한 제2 인물의 이해를 사용하여 알게 되는, 제1 인물의 감정 상태에 관련된다. 인지된 감정이 의미론적 감정과 실질적으로 동조될 수 있는 경우가 여러 번 있는데, 가령, 만일 제1 인물이 제2 인물과의 대화 중에 멋진 하루를 보내고 있다고 말하면서 웃고 있고 긍정적인 신체 언어(body language)를 갖는 경우이다. 그러나, 인지된 감정 및 의미론적 감정이 상당히 동조되지 않는 다른 경우가 있는데, 가령, 만일 제1 인물이 제2 인물과의 대화 중에 멋진 하루를 보내고 있다고 말하면서 찌푸리고 있고 부정적인 신체 언어(예컨대, 내려다보고 그의 팔짱을 낌)를 갖는 경우이다. 본 기술의 소정의 실시예에 따르면, 만일 제1 인물이 제2 인물과의 비디오 채팅 중에 멋진 하루를 보내고 있다고 말하면서 찡그리고 있고 부정적인 신체 언어(예컨대, 내려다보고 그의 팔짱을 낌)를 갖는 경우, 제1 인물의 비디오는, 비디오가 제2 인물을 위해 재생되는 경우에, 제1 인물의 신체 언어가 부정적인 신체 언어로부터 긍정적인 신체 언어로 바뀌었기에, 제1 인물의 신체 언어가 그가 사용한 긍정적인 발화된 언어와 더 동조되도록, 변경된다. 추가적으로, 제1 인물의 오디오는 또한, 가령, 제1 인물의 음성의 피치, 비브라토 및/또는 어조를 그가 사용한 긍정적인 발화된 언어와 더 동조되도록 바꾸기 위해, 변경될 수 있다.Perceived emotion, as the term is used in this document, is generally the emotional state of a first person that a second person becomes aware of using the second person's senses, eg, using the second person's sight and hearing. related to In contrast, semantic sentiment, as the term is used in this document, is generally the verbal language (also spoken language, or more succinctly) of a second person being spoken by a first person. It relates to the emotional state of a first person, which is known using the second person's understanding of (referred to as language). There are several instances where perceived emotions can actually be aligned with semantic emotions, such as if a first person is smiling and using positive body language during a conversation with a second person while telling them that they are having a great day. is the case with However, there are other cases where the perceived emotion and semantic emotion are significantly out of tune, such as if a first person says that he is having a wonderful day during a conversation with a second person, while frowning and using negative body language (e.g., looking down and seeing his arms folded). According to certain embodiments of the present technology, if a first person is frowning and has negative body language (e.g., looking down and crossing his arms) while saying that he is having a great day during a video chat with a second person, the first person The person's video shows that when the video is played for the second person, the first person's body language has changed from negative body language to positive body language, so the first person's body language is the positive spoken language he used. To be more in tune with, it is changed. Additionally, the first person's audio may also be altered, such as to change the pitch, vibrato and/or tone of the first person's voice to be more in tune with the positively spoken language he is using.

도 1은 제1 및 제2 인물이 비디오 채팅에 참여할 수 있게 하는 예시적인 시스템을 보여준다. 도 1에서, 블록(110A 및 110B)은 각자의 클라이언트 컴퓨팅 디바이스(이는 또한 본 문서에서 더욱 일반적으로 오디오-비디오(audio-video)(A-V) 서브시스템(120A 및 120B)으로 지칭됨)를 사용하여 비디오 채팅에 참여하고 있는 제1 및 제2 인물을 나타낸다. A-V 서브시스템(120A 및 120B)은 집합적으로 A-V 서브시스템(120)으로 또는 개별적으로 A-V 서브시스템(120)으로 지칭될 수 있다. 제1 및 제2 인물(110A 및 110B)은 집합적으로 인물(110)로 또는 개별적으로 인물(110)로 지칭될 수 있다. A-V 서브시스템(120A)은 제1 인물(110A)의 비디오 신호 및 오디오 신호를 획득하는 것이 가능하고, A-V 서브시스템(120B)은 제2 인물(110B)의 비디오 신호 및 오디오 신호를 획득하는 것이 가능하다. 따라서, A-V 서브시스템(120) 각각은 오디오 신호를 획득하는 데에 사용되는 적어도 하나의 마이크로폰과, 비디오 신호를 획득하는 데에 사용되는 적어도 하나의 카메라를 포함할 수 있다. 적어도 하나의 카메라는 초당 여러 2차원 RGB/NIR(Near Infrared)(근적외선) 이미지(가령, 초당 30개의 이미지)를 포착하는 데에 사용될 수 있는 이미지 센서(가령, CMOS 이미지 센서)를 포함하는 RGB/NIR 카메라일 수 있다. 적어도 하나의 추가의 카메라는, 가령, 포인트 클라우드(point cloud) 상의 3D 구조 또는 유사한 것을 재현하기 위해, 구조화된 빛 및/또는 비과시간(Time-Of-Flight: TOF) 센서를 사용하여, RGB/NIR 이미지보다는, 깊이 이미지(depth image)를 산출하는 깊이 카메라일 수 있다.1 shows an example system that allows first and second persons to engage in a video chat. In FIG. 1 , blocks 110A and 110B use respective client computing devices (also referred to herein more generally as audio-video (A-V) subsystems 120A and 120B) Indicates first and second persons participating in the video chat. A-V subsystems 120A and 120B may collectively be referred to as A-V subsystem 120 or individually as A-V subsystem 120 . First and second persons 110A and 110B may collectively be referred to as person 110 or individually as person 110 . The A-V subsystem 120A is capable of acquiring the video signal and audio signal of the first person 110A, and the A-V subsystem 120B is capable of obtaining the video signal and audio signal of the second person 110B. do. Accordingly, each of the A-V subsystems 120 may include at least one microphone used to acquire audio signals and at least one camera used to acquire video signals. The at least one camera includes an image sensor (eg, a CMOS image sensor) that can be used to capture several two-dimensional RGB/NIR (Near Infrared) images per second (eg, 30 images per second). It may be a NIR camera. The at least one additional camera may use a structured light and/or time-of-flight (TOF) sensor, for example to recreate a 3D structure on a point cloud or the like, RGB/ It may be a depth camera that produces a depth image rather than a NIR image.

또한, A-V 서브시스템(120A)은 제1 인물(110A)을 위해 제2 인물(가령, 120B)의 비디오 및 오디오를 재생하는 것이 가능할 수 있고, A-V 서브시스템(120B)은 제2 인물(110B)을 위해 제1 인물(가령, 120A)의 비디오 및 오디오를 재생하는 것이 가능할 수 있다. 따라서, A-V 서브시스템(120) 각각은 가청 사운드를 출력하는 데에 사용되는 적어도 하나의 오디오-스피커와, 비디오 이미지를 디스플레이하는 데에 사용되는 적어도 하나의 디스플레이를 포함할 수 있다. A-V 서브시스템(120A 및 120B) 중 하나 또는 양자 모두는 스마트폰, 태블릿 컴퓨터, 노트북 컴퓨터, 랩톱 컴퓨터, 또는 유사한 것과 같은, 그러나 이에 한정되지 않는, 객실내(in-cabin) 컴퓨터 시스템 또는 모바일(mobile) 컴퓨팅 디바이스일 수 있다. 오디오-비디오 서브시스템(120A 및 120B) 중 하나 또는 양자 모두, 또는 이의 일부는, 가령, 차량 엔터테인먼트 시스템의 일부로서, 차량 내에 구축된 마이크로폰, 카메라, 오디오 스피커 및/또는 디스플레이를 포함할 수 있음이 또한 가능하다.Further, the A-V subsystem 120A may be capable of reproducing video and audio of a second person (eg, 120B) for the first person 110A, and the A-V subsystem 120B may be capable of reproducing the video and audio of the second person 110B. For this purpose, it may be possible to reproduce video and audio of the first person (eg, 120A). Accordingly, each of the A-V subsystems 120 may include at least one audio-speaker used to output audible sound and at least one display used to display video images. One or both of the A-V subsystems 120A and 120B may be used in an in-cabin computer system or mobile, such as, but not limited to, a smartphone, tablet computer, notebook computer, laptop computer, or the like. ) may be a computing device. One or both, or portions of audio-video subsystems 120A and 120B may include microphones, cameras, audio speakers, and/or displays built into the vehicle, such as as part of a vehicle entertainment system. Also possible.

제1 및 제2 인물(110A 및 110B)이 그들 각자의 A-V 서브시스템(120A 및 120B)을 사용하여 비디오 채팅에 참여하고 있는 경우에, A-V 서브시스템(120A)의 적어도 하나의 마이크로폰은 제1 인물(110A)의 오디오 신호를 획득하고, A-V 서브시스템(120A)의 적어도 하나의 카메라는 제1 인물(110A)의 비디오 신호를 획득한다. 유사하게, A-V 서브시스템(120B)의 적어도 하나의 마이크로폰은 제2 인물(110B)의 오디오 신호를 획득하고, A-V 서브시스템(120B)의 적어도 하나의 카메라는 제2 인물(110B)의 비디오 신호를 획득한다. A-V 서브시스템(120A)에 의해 획득된, 제1 인물(110A)의 오디오 및 비디오 신호는 하나 이상의 통신 네트워크(130)를 통해 A-V 서브시스템(120B)에 발신된다. 유사하게, A-V 서브시스템(120B)에 의해 획득된, 제2 인물(110B)의 오디오 및 비디오 신호는 하나 이상의 통신 네트워크(130)를 통해 A-V 서브시스템(120A)에 발신된다.When first and second persons 110A and 110B are participating in a video chat using their respective A-V subsystems 120A and 120B, at least one microphone of A-V subsystem 120A is connected to the first person. 110A, and at least one camera of the A-V subsystem 120A acquires a video signal of the first person 110A. Similarly, at least one microphone of A-V subsystem 120B obtains an audio signal of second person 110B, and at least one camera of A-V subsystem 120B obtains a video signal of second person 110B. Acquire The audio and video signals of first person 110A, obtained by A-V subsystem 120A, are transmitted to A-V subsystem 120B via one or more communication networks 130 . Similarly, the audio and video signals of second person 110B, obtained by A-V subsystem 120B, are transmitted to A-V subsystem 120A via one or more communication networks 130.

통신 네트워크(들)(130)는 임의의 유선 또는 무선 로컬 영역 네트워크(Local Area Network: LAN) 및/또는 광역 네트워크(Wide Area Network: WAN), 예컨대 인트라넷(intranet), 엑스트라넷(extranet), 또는 인터넷(Internet), 또는 이의 조합일 수 있으나, 이에 한정되지 않는다. 통신 네트워크(130)는 A-V 서브시스템(120) 간의 통신 능력을 제공하는 것으로 충분하고, 선택적인 다른 디바이스 및 시스템이 있다. 몇몇 구현에서, 통신 네트워크(들)(130)는 송신 제어 프로토콜/인터넷 프로토콜(Transmission Control Protocol/Internet Protocol: TCP/IP)을 사용하여 정보를 전송하기 위해 하이퍼텍스트 전송 프로토콜(HyperText Transport Protocol: HTTP)을 사용한다. HTTP는 A-V 서브시스템(120)으로 하여금 통신 네트워크(들)(130)를 통해 이용가능한 다양한 리소스를 액세스하도록 허용한다. 그러나, 본 문서에 기술된 다양한 구현은 임의의 특정한 프로토콜의 사용에 한정되지 않는다.Communications network(s) 130 may be any wired or wireless local area network (LAN) and/or wide area network (WAN), such as an intranet, extranet, or It may be the Internet, or a combination thereof, but is not limited thereto. The communication network 130 suffices to provide communication capability between the A-V subsystems 120, and there are other devices and systems that are optional. In some implementations, communication network(s) 130 may use the HyperText Transport Protocol (HTTP) to transport information using Transmission Control Protocol/Internet Protocol (TCP/IP). Use HTTP allows A-V subsystem 120 to access various resources available over communication network(s) 130 . However, the various implementations described herein are not limited to use of any particular protocol.

A-V 서브시스템(120A)의 적어도 하나의 오디오-스피커는 제1 인물(110A)이 청취할 수 있는, 제2 인물(110B)(가령, 이에 의해 발화된 단어)의 가청 사운드를 출력하기 위해 제2 인물(110B)의 오디오 신호를 사용한다. A-V 서브시스템(120A)의 적어도 하나의 디스플레이는 제1 인물(110A)이 볼 수 있는, 제2 인물(110B)의 비디오 이미지를 디스플레이하기 위해 제2 인물(110B)의 비디오 신호를 사용한다. 유사하게, A-V 서브시스템(120B)의 적어도 하나의 오디오-스피커는 제2 인물(110B)이 청취할 수 있는, 제1 인물(110A)(가령, 이에 의해 발화된 단어)의 가청 사운드를 출력하기 위해 제1 인물(110A)의 오디오 신호를 사용한다. A-V 서브시스템(120B)의 적어도 하나의 디스플레이는 제2 인물(110B)이 볼 수 있는, 제1 인물(110A)의 비디오 이미지를 디스플레이하기 위해 제1 인물(110A)의 비디오 신호를 사용한다.At least one audio-speaker of the A-V subsystem 120A is configured to output audible sound of a second person 110B (eg, words uttered by it), which is audible to the first person 110A. The audio signal of the person 110B is used. At least one display of A-V subsystem 120A uses the video signal of second person 110B to display a video image of second person 110B that first person 110A can view. Similarly, at least one audio-speaker of A-V subsystem 120B is configured to output audible sound of first person 110A (e.g., words uttered by it) that second person 110B can hear. For this purpose, the audio signal of the first person 110A is used. At least one display of A-V subsystem 120B uses the video signal of first person 110A to display a video image of first person 110A, which is viewable by second person 110B.

종래에, (A-V 서브시스템(120A)에 의해 획득된) 제1 인물(110A)의 오디오 및 비디오 신호의 수정되지 않은 버전이 (제2 인물(110B)에 근접하여 있는 A-V 서브시스템(120B)을 사용하여) 제2 인물(110B)에 제1 인물(110A)의 오디오 및 비디오를 출력하고 디스플레이하는 데에 사용된다. 그러므로, 만일 제1 인물(110A)이 제2 인물(110B)과의 비디오 채팅에 참여할 때 화난 얼굴 표정(가령, 찡그린 이마), 화난 신체 포즈(가령, 꽉 쥔 곧추선 주먹), 그리고 음성의 화난(가령, 높은) 톤(tone)을 가지면, 제2 인물(110B)은 제1 인물(110A)의 화난 얼굴 표정과 화난 신체 포즈를 볼 것이고 제1 인물(110A)의 화난 톤을 들을 것이다. 용어 신체 포즈는, 본 문서에서 사용되는 바와 같이, 손 포즈를 또한 망라함에 유의한다.Conventionally, an unmodified version of the audio and video signals of a first person 110A (obtained by the A-V subsystem 120A) is transmitted to the A-V subsystem 120B (which is proximate to a second person 110B). used) to output and display the audio and video of the first person 110A to the second person 110B. Therefore, if a first person 110A engages in a video chat with a second person 110B, an angry facial expression (eg, furrowed forehead), an angry body pose (eg, clenched fists), and an angry voice With a (eg, high) tone, the second person 110B will see the angry facial expression and angry body pose of the first person 110A and will hear the angry tone of the first person 110A. Note that the term body pose, as used herein, also encompasses hand pose.

본 기술의 소정의 실시예에 따르면, 제1 인물(110A)의 오디오 및 비디오 신호는 A-V 서브시스템(120B)에 제공되기 전에 수정되는데, 이는 제2 인물(110B)이 듣고 보는 제1 인물(110A)의 오디오 및 비디오가 제1 인물(110A)이 실제로 보이는 바 및 들리는 바와는 상이하다는 것을 초래한다. 제1 인물(110A)의 오디오 및 비디오 신호에 대한 그러한 수정은 오디오 및 비디오 신호를 획득하는 동일한 A-V 서브시스템에 의해 수행될 수 있다. 더욱 구체적으로, 도 2에 도시된 바와 같이, A-V 및 수정 서브시스템(220A)은 제1 인물의 오디오 및 비디오 신호를 획득하고 그러한 신호를, 제2 인물(110B)에 근접하여 있는 A-V 서브시스템(120B)에 수정된 오디오 및 비디오 신호를 제공하는 통신 네트워크(들)(130)에 그러한 신호를 제공하기 전에 수정할 수 있다. 대안적으로, 제1 인물(110A)의 오디오 및 비디오 신호에 대한 그러한 수정은 제1 인물(110A)의 오디오 및 비디오 신호를 획득하는 A-V 서브시스템(120A)과는 별개인 추가의 서브시스템에 의해 수행될 수 있다. 예를 들어, 도 3에 도시된 바와 같이, 수정 서브시스템(320A)은 제1 인물(110A)의 오디오 및 비디오 신호를 수신할 수 있고, 수정 서브시스템(320A)은 그러한 신호를, 제2 인물(110B)에 근접하여 있는 A-V 서브시스템(120B)에 수정된 오디오 및 비디오 신호를 제공하는 통신 네트워크(들)(130)에 그러한 신호를 제공하기 전에 수정할 수 있다. 다른 옵션은 (제1 인물(110A)의 오디오 및 비디오 신호를 획득하는) A-V 서브시스템(120A)이 제1 인물(110A)의 오디오 및 비디오 신호를 하나 이상의 통신 네트워크(130)를 통해 수정 서브시스템(420A)에 제공하는 것이며, 이후에 수정 서브시스템(420)이 그러한 신호를 수정한 후에, 수정 서브시스템(420)은 제1 인물(110A)의 수정된 오디오 및 비디오 신호를 통신 네트워크(들)(130)를 통해 제2 인물(110B)와 근접하여 있는 A-V 서브시스템(120B)에 제공할 수 있다. 다른 변형이 또한 가능하며 본 문서에 기술된 실시예의 범위 내에 있다. 도 1 내지 도 4에 도시되지 않으나, 제2 인물(110B)의 비디오 및 오디오 신호가 또한 제2 인물의 인지된 감정이 제2 인물(110B)의 의미론적 감정과 더 동조되는 것으로 보이도록 그러한 신호를 수정하기 위해 유사한 수정 서브시스템에 제공될 수 있다.According to some embodiments of the present technology, first person 110A's audio and video signals are modified before being presented to A-V subsystem 120B, which is heard and viewed by second person 110B. ) results in a difference between what the first person 110A actually sees and hears. Such modifications to the audio and video signals of first person 110A may be performed by the same A-V subsystem that acquires the audio and video signals. More specifically, as shown in FIG. 2 , the A-V and correction subsystem 220A obtains audio and video signals of a first person and converts those signals to an A-V subsystem (proximate to a second person 110B) ( 120B) to modify the modified audio and video signals prior to providing them to the communications network(s) 130. Alternatively, such modifications to the audio and video signals of first person 110A may be performed by an additional subsystem separate from the A-V subsystem 120A that obtains the audio and video signals of first person 110A. can be performed For example, as shown in FIG. 3 , modification subsystem 320A may receive audio and video signals of a first person 110A, and modification subsystem 320A may transmit those signals to a second person. 110B may be modified prior to providing such signals to communication network(s) 130 that provide the modified audio and video signals to an A-V subsystem 120B proximate to 110B. Another option is for the A-V subsystem 120A (which acquires the audio and video signals of first person 110A) to modify the audio and video signals of first person 110A over one or more communication networks 130. 420A, which then, after modification subsystem 420 modifies those signals, allows modification subsystem 420 to send the modified audio and video signals of first person 110A to the communications network(s). Through 130, it can be provided to the A-V subsystem 120B that is close to the second person 110B. Other variations are also possible and within the scope of the embodiments described herein. Although not shown in FIGS. 1-4 , the video and audio signals of the second person 110B are also such that the second person's perceived emotions appear to be more in tune with the semantic emotions of the second person 110B. can be provided to a similar modification subsystem to correct .

(도 1, 도 3 및 도 4에서) AV 서브시스템(120A)에 의해 또는 (도 2에서) A-V 및 수정 서브시스템(220A)에 의해 포착되거나 다른 식으로 획득된, 제1 인물(110A)의 오디오 및 비디오 신호는 또한 제1 인물(110A)의 포착된 오디오 및 비디오 신호로 지칭될 수 있다. 도 5는 포착된 오디오 및 비디오 신호를 AV 서브시스템(120A)으로부터 수신하거나, AV 및 수정 서브시스템(220A)의 일부인 수정 서브시스템(520)을 도시한다. 도 5에 도시된 바와 같이, 수정 서브시스템(520)은 감정 검출 블록(530)(이는 또한 감정 검출기(530)로 지칭될 수 있음) 및 감정 수정 블록(540)(이는 또한 감정 수정기(540)로 지칭될 수 있음)을 포함한다. 감정 검출기(530)는, 예를 들어, 제1 인물(110A)의 부정적, 긍정적 및/또는 중립적 감정을 검출할 수 있다. 예시적인 부정적 감정은 화, 초조함, 산만해짐 및 실망함을 포함하나, 이에 한정되지 않는다. 감정 수정기(540)는, 예를 들어, 수정된 오디오 및 비디오 신호 내의 제1 인물(110A)의 하나 이상의 타입의 인지된 감정이 중립적이거나 긍정적인 감정이도록 오디오 및 비디오 신호를 수정할 수 있다. 예시적인 중립적 또는 긍정적 감정은 행복함, 차분함, 기민함 및 즐거움을 포함하나, 이에 한정되지 않는다. 본 기술의 구체적인 실시예에 따른, 감정 검출기(530) 및 감정 수정기(540)의 추가적인 세부사항은 도 6을 참조하여 아래에서 논의된다.of first person 110A, captured or otherwise obtained by AV subsystem 120A (in FIGS. 1, 3 and 4) or by A-V and correction subsystem 220A (in FIG. 2). The audio and video signals may also be referred to as the captured audio and video signals of the first person 110A. 5 shows a modification subsystem 520 that receives captured audio and video signals from AV subsystem 120A, or is part of AV and modification subsystem 220A. As shown in FIG. 5 , correction subsystem 520 includes emotion detection block 530 (which may also be referred to as emotion detector 530 ) and emotion correction block 540 (which may also be referred to as emotion corrector 540 ). )). The emotion detector 530 may detect, for example, negative, positive, and/or neutral emotions of the first person 110A. Exemplary negative emotions include, but are not limited to, anger, nervousness, distraction, and disappointment. Emotion modifier 540 may, for example, modify the audio and video signals so that the perceived emotions of one or more types of first person 110A within the modified audio and video signals are neutral or positive emotions. Exemplary neutral or positive emotions include, but are not limited to, happy, calm, alert, and joyous. Additional details of emotion detector 530 and emotion modifier 540, in accordance with specific embodiments of the present technology, are discussed below with reference to FIG. 6 .

도 6을 참조하면, 감정 검출기(530)는 얼굴 검출 블록(610)(또한 얼굴 검출기(610)로 지칭됨) 및 얼굴 표정 인식 블록(612)(또한 얼굴 표정 인식기(612)로 지칭됨)을 포함하는 것으로 도시된다. 감정 검출기(530)는 또한 골격 검출 블록(614)(또한 골격 검출기(614)로 지칭됨) 및 포즈 인식 블록(616)(또한 포즈 인식기(616)로 지칭됨)을 포함하는 것으로 도시된다. 도 6에 도시된 바와 같이, 얼굴 검출기(610) 및 골격 검출기(614)는 비디오 신호(602)를 수신하는 것으로 도시된다. 비디오 신호(602)는, 가령, AV 서브시스템(120A), 그리고 더욱 구체적으로, 이의 하나 이상의 카메라에 의해 포착된 제1 인물(110A)의 비디오 신호일 수 있다. 여전히 도 6을 참조하면, 감정 검출기(530)는 또한 오디오 신호 처리 블록(624)(또한 오디오 신호 프로세서(624) 또는 오디오 신호 분석기(624)로 지칭됨) 및 자연 언어 처리 블록(626)(또한 자연 언어 프로세서(626) 또는 자연 언어 분석기(626)로 지칭됨)을 포함하는 것으로 도시된다. 도 6에 도시된 바와 같이, 오디오 신호 분석기(624) 및 자연 언어 분석기(626)는 오디오 신호(622)를 수신한다. 오디오 신호(622)는, 가령, AV 서브시스템(120A), 또는 더욱 구체적으로, 이의 마이크로폰에 의해 포착된 제1 인물(110A)의 오디오 신호일 수 있다. 비디오 신호(602) 및 오디오 신호(622)는, 달리 구체적으로 진술되지 않는 한, 디지털 신호인 것으로 상정된다. 인터페이스(603 및 623)는 비디오 신호(602) 및 오디오 신호(622)를, 가령, 각각 카메라 및 마이크로폰으로부터, 또는 하나 이상의 다른 서브시스템으로부터 수신할 수 있다.Referring to FIG. 6 , emotion detector 530 includes face detection block 610 (also referred to as face detector 610 ) and facial expression recognition block 612 (also referred to as facial expression recognizer 612 ). are shown to contain. Emotion detector 530 is also shown as including a skeleton detection block 614 (also referred to as skeleton detector 614 ) and a pose recognition block 616 (also referred to as pose recognizer 616 ). As shown in FIG. 6 , face detector 610 and bone detector 614 are shown receiving video signal 602 . Video signal 602 may be, for example, a video signal of first person 110A captured by AV subsystem 120A, and more specifically, one or more cameras thereof. Still referring to FIG. 6 , emotion detector 530 also includes audio signal processing block 624 (also referred to as audio signal processor 624 or audio signal analyzer 624) and natural language processing block 626 (also referred to as audio signal processor 624 or audio signal analyzer 624). Referred to as natural language processor 626 or natural language analyzer 626). As shown in FIG. 6 , audio signal analyzer 624 and natural language analyzer 626 receive audio signal 622 . Audio signal 622 may be, for example, the audio signal of first person 110A captured by AV subsystem 120A, or more specifically, its microphone. Video signal 602 and audio signal 622 are assumed to be digital signals unless specifically stated otherwise. Interfaces 603 and 623 may receive video signals 602 and audio signals 622, such as from a camera and microphone, respectively, or from one or more other subsystems.

소정의 실시예에 따르면, 얼굴 검출기(610)는 이미지 내에서 인물의 얼굴을 검출할 수 있고, 또한 이미지 내에서 얼굴 특징(facial feature)을 검출할 수 있다. 이미 개발된 (또는 장차 개발되는) 컴퓨터 비전(computer vision) 기법이 얼굴 검출기(610)에 의해 그러한 얼굴 특징을 검출하는 데에 사용될 수 있다. 예를 들어, 이미지 내의 얼굴을 검출하는 데에 HSV(Hue-Saturation-Value) 색상 모델 또는 어떤 다른 컴퓨터 비전 기법이 사용될 수 있다. 눈, 코, 입술, 턱, 뺨, 눈썹, 이마 및/또는 유사한 것과 같은, 그러나 이에 한정되지 않는, 얼굴 특징을 식별하는 데에 특징 검출 모델 또는 어떤 다른 컴퓨터 비전 기법이 사용될 수 있다. 특정한 얼굴 영역에서, 예컨대 이마 위에서, 입가에서, 그리고/또는 눈 주위에서, 주름을 검출하는 데에 특징 검출이 또한 사용될 수 있다. 소정의 실시예에서, 인물의 얼굴 및 그의 얼굴 특징은 바운딩 박스(bounding box)를 사용하여 식별될 수 있다. 사용자의 얼굴 위의 눈과 같이, 식별될 몇몇 특징은 다른 특징 내에 포함될 수 있는데, 그런 경우에 포함하는 특징(가령, 얼굴)을 우선 식별하고 포함된 특징(가령, 한 쌍의 눈 중의 각각의 눈)을 이후에 식별하기 위해 연이은 바운딩 박스가 사용될 수 있다. 다른 실시예에서, 각각의 구별되는 특징을 식별하는 데에 단일의 바운딩 박스가 사용될 수 있다. 소정의 실시예에서, 이들 얼굴 특징을 식별하고 바운딩 박스를 생성하는 데에 하나 이상의 알고리즘 라이브러리, 예컨대 OpenCV (http://opencv.willowgarage.com/wiki/) 컴퓨터 비전 라이브러리 및/또는 Dlib 알고리즘 라이브러리 (http://dlib.net/)가 사용될 수 있다. 소정의 실시예에서, 바운딩 박스는 직사각형일 필요는 없고, 오히려, 다른 형상일 수 있는데, 예컨대 타원형이나 이에 한정되지 않는다. 소정의 실시예에서, 얼굴 특징(가령, 눈, 코, 입술 등)의 검출에서 신뢰 수준(confidence level)을 증가시키기 위해 머신 러닝(machine learning) 기법, 예컨대 부스팅(boosting)이 사용될 수 있다. 더욱 일반적으로, 이미지로부터 얼굴 특징을 검출하기 위해 심층 신경망(Deep Neural Network: DNN) 및/또는 다른 컴퓨터 모델을 훈련하는 데에 데이터 세트가 사용될 수 있고, 훈련된 DNN 및/또는 다른 컴퓨터 모델은 그 후에 얼굴 특징 인식을 위해 사용될 수 있다.According to some embodiments, face detector 610 may detect a person's face within an image, and may also detect facial features within an image. Already developed (or future developed) computer vision techniques may be used by face detector 610 to detect such facial features. For example, a Hue-Saturation-Value (HSV) color model or some other computer vision technique may be used to detect faces in an image. A feature detection model or some other computer vision technique may be used to identify facial features, such as, but not limited to, eyes, nose, lips, chin, cheeks, eyebrows, forehead, and/or the like. Feature detection can also be used to detect wrinkles in specific facial regions, such as on the forehead, around the mouth, and/or around the eyes. In some embodiments, a person's face and its facial features may be identified using a bounding box. Some features to be identified, such as eyes on a user's face, may be included within other features, in which case the containing feature (eg, face) is first identified and the included feature (eg, each eye of a pair of eyes) ) can then be used to identify consecutive bounding boxes. In another embodiment, a single bounding box may be used to identify each distinct feature. In certain embodiments, one or more algorithm libraries are used to identify these facial features and create bounding boxes, such as the OpenCV (http://opencv.willowgarage.com/wiki/) computer vision library and/or the Dlib algorithm library ( http://dlib.net/) may be used. In certain embodiments, the bounding box need not be rectangular, but rather may be of another shape, such as but not limited to an oval. In certain embodiments, machine learning techniques, such as boosting, may be used to increase the confidence level in the detection of facial features (eg, eyes, nose, lips, etc.). More generally, the data set can be used to train a deep neural network (DNN) and/or other computer model to detect facial features from images, and the trained DNN and/or other computer model can do that. It can be used later for facial feature recognition.

일단 얼굴 특징이 얼굴 검출기(610)에 의해 식별되었으면(또한 검출된 것으로 지칭됨), 얼굴 표정 인식기(612)는 인물의 얼굴 표정을 판정할 수 있다. 일반적으로, 인간의 얼굴은 위에서 지적된 바와 같이, 턱, 입, 눈 및 코와 같은 상이한 부분으로 이루어진다. 그런 얼굴 특징의 형상, 구조 및 크기는 상이한 얼굴 표정에 따라 달라질 수 있다. 추가적으로, 소정의 얼굴 표정과 함께, 특정한 얼굴 위치에서의 주름이 바뀔 수 있다. 예를 들어, 상이한 얼굴 표정을 구별하기 위해 인물의 눈 및 입의 형상이 이용될 수 있는데, 인물의 이마 및/또는 유사한 것 위의 주름이 그러할 수 있는 것과 같다. 인물의 검출된 얼굴 표정에 적어도 부분적으로 기반하여, 인물의 하나 이상의 타입의 인지된 감정이 도 6에서의 인지된 감정 검출기(632)에 의해 판정될 수 있다. 검출된 얼굴 표정에 적어도 부분적으로 기반하여 검출될 수 있는 예시적인 인지된 감정은 행복함, 차분함, 기민함 및 즐거움뿐만 아니라, 화, 초조함, 산만해짐 및 실망함을 포함하나, 이에 한정되지 않는다. 인지된 감정을 수량화하기 위한 소정의 기법이 아래에서 기술된다.Once facial features have been identified (also referred to as detected) by face detector 610, facial expression recognizer 612 can determine the facial expression of the person. Generally, the human face consists of different parts such as the chin, mouth, eyes and nose, as pointed out above. The shape, structure and size of such facial features may vary for different facial expressions. Additionally, along with a given facial expression, wrinkles at specific facial locations may change. For example, the shape of a person's eyes and mouth may be used to distinguish different facial expressions, as may wrinkles on a person's forehead and/or the like. Based at least in part on the detected facial expressions of the person, one or more types of perceived emotion of the person may be determined by the perceived emotion detector 632 in FIG. 6 . Exemplary perceived emotions that may be detected based at least in part on the detected facial expressions include, but are not limited to, happy, calm, alert, and joyous, as well as angry, nervous, distracted, and disappointed. Certain techniques for quantifying perceived emotion are described below.

골격 검출기(614)는 팔, 손, 팔꿈치, 손목 및/또는 유사한 것과 같은, 그러나 이에 한정되지 않는, 인간의 신체 부분 및 관절을 식별하기 위해 골격 검출 모델 또는 어떤 다른 컴퓨터 비전 기법을 사용할 수 있다. 포즈 분석기(616)는 특정한 포즈, 예컨대 인물이 차량을 운전하는 동안에 그의 양손 모두로 차량의 운전대(steering wheel)를 잡고 있는지, 또는 인물이 차량을 운전하는 동안에 그의 팔 하나를 그의 손을 주먹 쥐고서 들어올렸는지를 검출할 수 있다. 이미지로부터 인간의 포즈를 검출하기 위해 심층 신경망(Deep Neural Network: DNN) 및/또는 다른 컴퓨터 모델을 훈련하는 데에 데이터 세트가 사용될 수 있고, 훈련된 DNN 및/또는 다른 컴퓨터 모델은 그 후에 포즈 인식을 위해 사용될 수 있다.Skeleton detector 614 may use a skeletal detection model or some other computer vision technique to identify human body parts and joints, such as, but not limited to, arms, hands, elbows, wrists, and/or the like. Pose analyzer 616 may be used to determine a particular pose, such as whether a person is holding the vehicle's steering wheel with both hands while driving a vehicle, or whether a character is lifting one of his arms with his hand clasped into a fist while driving a vehicle. You can detect if it is. The data set can be used to train a deep neural network (DNN) and/or other computer model to detect human poses from images, which the trained DNN and/or other computer model then recognizes the pose. can be used for

일단 인간의 신체 부분이 이미지 내에서 골격 검출기(614)에 의해 검출되면, 포즈 인식기(616)는 인물의 포즈를 판정할 수 있다. 일반적으로, 인간의 신체는 머리, 목, 몸통, 상완(upper arm), 팔꿈치, 전완(forearm), 손목, 손 등과 같은 상이한 부분으로 이루어진다. 어떤 포즈와 함께, 그러한 신체 부분의 전반적 및 상대적 위치 및 배향이 바뀔 수 있다. 예를 들어, 인물이 차량을 운전하고 있는 동안에, 그 인물은 흔히 그의 양손 모두를 차량의 운전대 위에 둘 것이나, 가령, 다른 차량의 운전자가 그 인물로 하여금 갑자기 멈추고/거나, 방향을 틀고/거나, 유사한 것을 하게 하였기 때문에, 그 인물이 화가 난 경우 그의 팔 하나를 들어올리고 주먹을 쥘 수 있다. 도 6으로부터 알 수 있는 바와 같이, 포즈 인식기(616)로부터 인지된 감정 검출기(632)로의 선에 의해 나타내어지는 바와 같이, 검출된 포즈가 인물의 인지된 감정을 판정하는 데에 또한 사용될 수 있다.Once the human body part is detected within the image by bone detector 614, pose recognizer 616 can determine the pose of the person. Generally, the human body consists of different parts such as head, neck, torso, upper arm, elbow, forearm, wrist, hand and the like. With any pose, the overall and relative positions and orientations of those body parts may change. For example, while a person is driving a vehicle, the person will often place both of their hands on the vehicle's steering wheel, but the driver of another vehicle, for example, may cause the person to suddenly stop, swerve, and/or Made to do something similar, so the character can raise one of his arms and make a fist if he's angry. As can be seen from FIG. 6 , the detected pose can also be used to determine the perceived emotion of the person, as indicated by the line from the pose recognizer 616 to the perceived emotion detector 632 .

위에서 지적된 바와 같이, 도 6에서 오디오 신호 분석기(624) 및 자연 언어 분석기(626)는 오디오 신호(622)를 수신한다. 오디오 신호(622)는, 가령, AV 서브시스템(120A)에 의해 포착된 제1 인물(110A)의 오디오 신호일 수 있다. 오디오 신호 분석기(624)는 인물의 감정 상태에 따라서 달라질 수 있는 오디오 신호(622)의 다양한 특징을 검출하기 위해 오디오 신호(622)를 분석할 수 있다. 그러한 오디오 특징의 예는 피치, 비브라토 및 어조를 포함한다. 피치는 신호의 주파수에 관련되고, 따라서, 주파수로서 수량화될 수 있다. 인물의 음성의 피치에서의 변화는 흔히 인물의 각성 상태(arousal state), 또는 더욱 일반적으로 감정 상태(emotional state)와 상관된다(correlated). 예를 들어, 피치의 증가는 흔히 화, 환희 또는 두려움과 같은 크게 각성된 상태와 상관되는 한편, 피치의 감소는 흔히 슬픔 또는 침착과 같은 낮은 각성 상태와 상관된다. 비브라토는 주어진 속도 및 깊이와 함께 일어나는, 인물의 음성의 피치(가령, 기본 주파수(fundamental frequency))의 주기적 변조(periodic modulation)이다. 비브라토는 또한 난조(jitter)에 관련되고, 그것은 흔히 감정의 변화와 상관된다. 피치, 그리고 따라서 비브라토에서의 증가된 요동(fluctuation)은, 가령, 행복, 고뇌 또는 두려움의 증가를 나타낼 수 있다. 어조는 각각의 발언(utterance)의 시작에서의 피치의 급속한 수정인데, 이는 그것의 타겟(target)을 몇 개의 세미톤(semitone)만큼 지나치나(overshoot) 신속히 정상 값으로 감쇠한다. 어조의 사용은 피치의 증가된 변형으로 이어지는데, 이는 높은 감정 강도 및 긍정적 유의성(valence)과 연관된다. 도 6으로부터 알 수 있는 바와 같이, 오디오 신호 분석기(624)로부터 인지된 감정 검출기(632)로의 선에 의해 나타내어지는 바와 같이, 오디오 신호 분석기(624)에 의해 수행된 오디오 신호 분석의 결과가 인물의 인지된 감정을 판정하는 데에 또한 사용될 수 있다. 위의 논의로부터 알 수 있는 바와 같이, 특정한 오디오 특징에서의 소정의 변화는 긍정적인 감정(가령, 행복) 또는 부정적인 감정(가령, 화)의 증가 어느 쪽이든 나타낼 수 있다. 예를 들어, 행복 또는 두려움의 증가는 양자 모두 피치의 증가를 야기할 수 있다. 그러나, 여러 음성 특징을 단독으로 또는 얼굴 표정 및/또는 신체 포즈와 조합하여 분석함으로써, 인물의 상대적으로 정확한 인지된 감정을 판정하는 것이 가능하다.As pointed out above, in FIG. 6 audio signal analyzer 624 and natural language analyzer 626 receive audio signal 622 . Audio signal 622 may be, for example, the audio signal of first person 110A captured by AV subsystem 120A. The audio signal analyzer 624 may analyze the audio signal 622 to detect various characteristics of the audio signal 622 that may vary depending on the emotional state of the person. Examples of such audio characteristics include pitch, vibrato and tone. Pitch is related to the frequency of a signal, and thus can be quantified as a frequency. A change in the pitch of a person's voice is often correlated with the person's arousal state, or more generally the emotional state. For example, an increase in pitch often correlates with a state of high arousal such as anger, ecstasy or fear, while a decrease in pitch often correlates with a state of low arousal such as sadness or calm. Vibrato is a periodic modulation of the pitch (eg, fundamental frequency) of a person's voice that occurs with a given speed and depth. Vibrato is also related to jitter, which is often correlated with changes in mood. Increased fluctuations in pitch, and thus vibrato, may indicate, for example, increased happiness, anguish or fear. Tone is a rapid modification of pitch at the beginning of each utterance, which overshoots its target by several semitones but quickly decays to normal values. The use of tone leads to increased variation in pitch, which is associated with high emotional intensity and positive valence. As can be seen from FIG. 6, the result of the audio signal analysis performed by the audio signal analyzer 624, as indicated by the line from the audio signal analyzer 624 to the perceived emotion detector 632, is the It can also be used to determine perceived emotion. As can be seen from the discussion above, a given change in a particular audio characteristic can indicate either an increase in positive emotion (eg, happiness) or negative emotion (eg, anger). For example, an increase in happiness or fear can both cause an increase in pitch. However, by analyzing several voice characteristics alone or in combination with facial expressions and/or body poses, it is possible to determine a relatively accurate perceived emotion of a person.

자연 언어 분석기(626)는 오디오 신호(622)의 자연 언어 처리(Natural Language Processing: NLP)를 수행하는데, 자연 언어 분석기(626)로부터 의미론적 감정 검출기(634)로의 선에 의해 나타내어지는 바와 같이, 그 결과는 인물의 의미론적 감정을 판정하는 데에 사용된다. 자연 언어 분석기(626)에 의해 수행되는 NLP는 인물의 발화의 텍스트 표현(textual representation)을 제공하는 발화 인식을 포함할 수 있다. 자연적인 발화에서 연이은 단어 간에 어떤 멈춤(pause)도 거의 없고, 따라서 발화 분절화(speech segmentation)가 발화 인식의 하위과제(subtask)일 수 있는데, 발화 분절화는 인물의 사운드 클립(sound clip)을 여러 단어로 분리하는 것을 수반한다. 자연 언어 분석기(626)는 단일 언어, 또는 여러 상이한 언어, 예컨대, 몇 개만 예로 들면, 영어, 중국어, 스페인어, 프랑스어 및 독일어를 인식하도록 구성될 수 있다. 자연 언어 분석기(626)가 여러 상이한 언어를 위해 NPL을 수행하는 것이 가능한 경우에, 자연 언어 분석기(626)의 출력은 인물이 발화하고 있는 특정한 언어의 표시(indication)를 포함할 수 있다.The natural language analyzer 626 performs natural language processing (NLP) of the audio signal 622, as represented by the line from the natural language analyzer 626 to the semantic emotion detector 634: The result is used to determine the character's semantic emotion. The NLP performed by the natural language analyzer 626 may include utterance recognition to provide a textual representation of a person's utterances. There is rarely any pause between successive words in natural speech, so speech segmentation can be a subtask of speech recognition, in which a sound clip of a person is divided into several words. entails separating into Natural language analyzer 626 may be configured to recognize a single language, or several different languages, such as English, Chinese, Spanish, French, and German, to name only a few. In cases where natural language analyzer 626 is capable of performing NPL for several different languages, the output of natural language analyzer 626 may include an indication of the particular language the person is speaking.

인지된 감정 검출기(632)는 얼굴 표정 분석기(612), 포즈 인식기(616) 및 오디오 신호 분석기(624)의 출력에 기반하여 인물과 연관된 하나 이상의 타입의 인지된 감정을 판정하기 위해 하나 이상의 룩업 테이블(Look Up Table: LUT)을 사용할 수 있다. 얼굴 표정 분석기(612)의 출력은 인물의 비디오 신호(602)에 기반하여 판정된 인물의 하나 이상의 얼굴 표정 특징을 지정할 수 있고, 포즈 인식기(616)의 출력은 인물의 비디오 신호(602)에 기반하여 판정된 인물의 하나 이상의 신체 포즈를 지정할 수 있고, 오디오 신호 분석기(624)의 출력은 오디오 신호(622)에 기반하여 판정된 하나 이상의 오디오 특징을 지정할 수 있다. LUT를 사용하는 것 대신에, 또는 이에 추가하여, 인지된 감정 검출기(632)는, 얼굴 표정 훈련 데이터, 신체 포즈 훈련 데이터, 발화 훈련 데이터 및/또는 다른 인지된 감정 훈련 데이터를 포함할 수 있는 인지된 감정 훈련 데이터에 기반하여 훈련된 하나 이상의 DNN 및/또는 하나 이상의 다른 컴퓨터 모델에 의해 구현될 수 있다.Perceived emotion detector 632 includes one or more lookup tables to determine one or more types of perceived emotion associated with a person based on the outputs of facial expression analyzer 612, pose recognizer 616, and audio signal analyzer 624. (Look Up Table: LUT) can be used. The output of the facial expression analyzer 612 may specify one or more facial expression characteristics of the determined person based on the video signal 602 of the person, and the output of the pose recognizer 616 may specify based on the video signal 602 of the person. One or more body poses of the determined person may be designated by the above, and an output of the audio signal analyzer 624 may designate one or more audio characteristics determined based on the audio signal 622 . Instead of, or in addition to, using a LUT, the perceived emotion detector 632 may include facial expression training data, body pose training data, speech training data, and/or other perceived emotion training data. may be implemented by one or more DNNs and/or one or more other computer models trained based on emotional training data.

의미론적 감정 검출기(634)는 자연 언어 분석기(626)의 출력에 기반하여 인물과 연관된 인지된 감정을 판정하기 위해 하나 이상의 룩업 테이블(Look Up Table: LUT)을 사용할 수 있다. 자연 언어 분석기(626)의 출력은 오디오 신호(622)에 기반하여 판정된 바와 같은 인물에 의해 발화된 단어 및 문장을 지정할 수 있고, 언어가 발화되고 있음을 또한 나타낼 수 있다. LUT를 사용하는 것 대신에, 또는 이에 추가하여, 의미론적 감정 검출기(634)는 의미론적 감정 훈련 데이터에 기반하여 훈련된 하나 이상의 DNN 및/또는 다른 컴퓨터 모델에 의해 구현될 수 있다.The semantic emotion detector 634 may use one or more look up tables (LUTs) to determine a perceived emotion associated with a person based on the output of the natural language analyzer 626 . The output of natural language analyzer 626 may designate words and sentences spoken by the person as determined based on audio signal 622 and may also indicate that language is being spoken. Instead of, or in addition to, using LUTs, semantic emotion detector 634 may be implemented by one or more DNNs and/or other computer models trained based on semantic emotion training data.

여전히 도 6을 참조하면, 인지된 감정 검출기(632) 및 의미론적 감정 검출기(634)의 출력은 또한 감정 수정 블록(540)(이는 또한 감정 수정기(540)로 지칭될 수 있음)에 제공되는 것으로 도시된다. 감정 수정기(540)는 또한 포착된 비디오 신호(602) 및 포착된 오디오 신호(622)를 수신하는 것으로 도시된다. 감정 수정기(540)는 얼굴 표정 수정 블록(642), 포즈 수정 블록(646) 및 오디오 수정 블록(648)을 포함하는 것으로 도시되는데, 이들은 또한 각각 얼굴 표정 수정기(642), 포즈 수정기(646) 및 오디오 수정기(648)로 지칭될 수 있다. 위에서 지적된 바와 같이, 인지된 감정 검출기(632)는 검출된 얼굴 표정, 검출된 신체 포즈(비디오 신호(602)에 기반하여 판정됨) 및 검출된 오디오 특징(가령, 피치, 비브라토 및 어조)(오디오 신호(622)에 기반하여 판정됨)에 기반하여 인물의 하나 이상의 타입의 인지된 감정을 판정할 수 있다. 또한 위에서 지적된 바와 같이, 의미론적 감정 검출기(634)는 인물의 의미론적 감정을 NLP를 사용하여 그의 발화된 언어에 기반하여 판정한다.Still referring to FIG. 6 , the outputs of the perceived emotion detector 632 and the semantic emotion detector 634 are also provided to an emotion modification block 540 (which may also be referred to as an emotion modifier 540 ). is shown as Emotion modifier 540 is also shown receiving a captured video signal 602 and a captured audio signal 622 . Emotion modifier 540 is shown as including facial expression modification block 642, pose modification block 646 and audio modification block 648, which also include facial expression modifier 642, pose modifier ( 646) and audio modifier 648. As noted above, the perceived emotion detector 632 includes detected facial expressions, detected body poses (determined based on video signal 602), and detected audio features (e.g., pitch, vibrato, and tone) ( based on the audio signal 622) to determine one or more types of perceived emotion of the person. Also as pointed out above, semantic emotion detector 634 determines a person's semantic emotion based on their uttered language using NLP.

본 기술의 소정의 실시예에 따르면, 얼굴 표정 수정기(642)는 (인지된 감정 검출기(632)에 의해 판정된 바와 같은) 인물의 얼굴 표정 인지된 감정 및 (의미론적 감정 검출기(634)에 의해 판정된 바와 같은) 인물의 의미론적 감정 간의 동조를 증가시키기 위해 비디오 신호(602)의 얼굴 표정 이미지 데이터를 수정한다. 본 기술의 소정의 실시예에 따르면, 포즈 수정기(646)는 (인지된 감정 검출기(632)에 의해 판정된 바와 같은) 인물의 신체 포즈 인지된 감정 및 (의미론적 감정 검출기(634)에 의해 판정된 바와 같은) 인물의 의미론적 감정 간의 동조를 증가시키기 위해 비디오 신호(602)의 이미지 데이터를 수정한다. 본 기술의 소정의 실시예에 따르면, 오디오 수정기(648)는 (인지된 감정 검출기(632)에 의해 판정된 바와 같은) 인물의 발화 인지된 감정 및 (의미론적 감정 검출기(634)에 의해 판정된 바와 같은) 인물의 의미론적 감정 간의 동조를 증가시키기 위해 오디오 신호(622)의 오디오 데이터를 수정한다. 감정 수정기(540)는 수정된 비디오 신호(652) 및 수정된 오디오 신호(662)를 출력하는 것으로 도시된다.In accordance with certain embodiments of the present technology, facial expression modifier 642 determines the person's facial expression perceived emotion (as determined by perceived emotion detector 632) and semantic emotion detector 634. Modify the facial expression image data of the video signal 602 to increase the entrainment between the semantic emotion of the person (as determined by According to certain embodiments of the present technology, the pose modifier 646 is configured by the character's body pose (as determined by the perceived emotion detector 632) and the perceived emotion (as determined by the semantic emotion detector 634). The image data of the video signal 602 is modified to increase the entrainment between the semantic emotion of the person (as determined). According to certain embodiments of the present technology, the audio modifier 648 includes the person's utterances (as determined by the perceived emotion detector 632) and the perceived emotion (as determined by the semantic emotion detector 634). Modify the audio data of the audio signal 622 to increase the entrainment between the semantic emotions of the person (as described above). Emotion modifier 540 is shown outputting a modified video signal 652 and a modified audio signal 662 .

본 기술의 소정의 실시예는, 환경적 요인에 반응하고/거나 이에 의해 야기되는 인물의 감정은 대화의 맥락에 반응하고/거나 이에 의해 야기되는 인물의 감정과는 소정의 특징 공간(feature space)에 있어서 인식가능하게 상이하고, 환경적 요인에 반응하고/거나 이에 의해 야기되는 감정 및 대화의 맥락에 반응하고/거나 이에 의해 야기되는 감정 간의 차이는 수량화될 수 있다는 가정에 의존한다. 소정의 실시예에 따르면, 인지된 감정 및 의미론적 감정 간의 차이를 수량화하는 데에 사용되는 특징 공간은, 처음에 James Russell에 의해 개발되고 Journal of Personality and Social Psychology, Vol. 39(6), Dec. 1980, pages 1161-1178에서 "A circumplex model of affect"라는 표제의 글로 발표된 각성/유의성 복합순환 모델(arousal/valance circumplex model)에 의해 정의된 특징 공간이다. 각성/유의성 복합순환 모델(이는 또한 더 간결하게 복합순환 모델로 지칭될 수 있음)은 각성 및 유의성 차원을 포함하는 2차원 원형 공간에 감정이 분포되는 것을 제안한다. 각성은 수직 축에 대응하고 유의성은 수평 축에 대응하는 한편, 원의 중심은 중립적 유의성 및 각성의 중간 수준에 대응한다. 이 모델에서, 감정 상태는 유의성 및 각성의 임의의 수준에, 또는 이들 요인 중 하나 또는 양자 모두의 중립 수준에 나타내어질 수 있다. James Russell 및 Lisa Feldman Barrett은 나중에 수정된 각성/유의성 복합순환 모델을 개발하였는데, 이는 그들이 Journal of Personality and Social Psychology, Vol. 76(5), May 1999, pages 805-819에서 "Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant"라는 표제의 글에서 발표하였다.Certain embodiments of the present technology allow a character's emotions in response to and/or caused by environmental factors to be different from a character's emotions in response to and/or caused by the context of a conversation in a certain feature space. relies on the assumption that differences between emotions that respond to and/or are caused by environmental factors and emotions that respond to and/or are caused by the context of a conversation can be quantified. According to certain embodiments, the feature space used to quantify the difference between perceived and semantic emotion was originally developed by James Russell and described in Journal of Personality and Social Psychology, Vol. 39(6), Dec. It is a feature space defined by the arousal/valance circumplex model published in an article titled "A circumplex model of affect" in 1980, pages 1161-1178. The arousal/significance complex cyclical model (which may also be more concisely referred to as the complex cyclical model) proposes that emotions are distributed in a two-dimensional circular space containing the arousal and significance dimensions. Arousal corresponds to the vertical axis and significance corresponds to the horizontal axis, while the center of the circle corresponds to neutral significance and intermediate levels of arousal. In this model, emotional states can be represented at any level of significance and arousal, or at a neutral level of one or both of these factors. James Russell and Lisa Feldman Barrett later developed a modified arousal/significance complex cycle model, which they published in Journal of Personality and Social Psychology, Vol. 76(5), May 1999, pages 805-819 in an article entitled "Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant".

본 발명의 소정의 실시예에 따르면, 인지된 감정 검출기(632)는 얼굴 표정, 신체 포즈 및 발화에 기반하여, 3개의 타입의 인지된 감정을 각기 판정하기 위해 하나 이상의 각성/유의성 복합순환 모델을 사용한다. 더욱 구체적으로, 소정의 실시예에서 인물의 얼굴 표정과 연관된 각성 및 유의성을 판정하는 데에 얼굴 복합순환 모델이 사용되고; 인물의 신체 포즈와 연관된 각성 및 유의성을 판정하는 데에 포즈 복합순환 모델이 사용되고; 인물의 발화와 연관된 각성 및 유의성을 정의하는 데에 발화 복합순환 모델이 사용된다. 유의성 차원은 수평 축 상에 나타내어지고 긍정적 및 부정적 유의성 사이에 걸쳐 있다. (수평 축을 따른) 긍정적 및 부정적 유의성은 또한, 각각, 유쾌한 감정 및 불쾌한 감정으로, 또는 더욱 일반적으로 긍정성으로 알려져 있다. 각성 차원은 수평 "유의성" 축과 교차하는 수직 축 상에 나타내어지고, 활성화됨 및 비활성화됨 사이에 걸쳐 있다. (수직 축을 따른) 활성화된 각성 및 비활성화된 각성은 또한, 각각, 강렬한 각성 및 강렬하지 않은 각성으로, 또는 더욱 일반적으로 적극성으로 알려져 있다. 일반적인 복합순환 모델이 도 7a에 예시되고, 얼굴 복합순환 모델이 도 7b에 예시되고, 포즈 복합순환 모델이 도 7c에 예시되고, 발화 복합순환 모델이 도 7d에 예시된다.According to some embodiments of the present invention, the perceived emotion detector 632 uses one or more arousal/significance complex cyclic models to determine three types of perceived emotion, respectively, based on facial expressions, body poses, and utterances. use. More specifically, in certain embodiments, a facial complex circulatory model is used to determine arousal and significance associated with a person's facial expression; A pose complex cycle model is used to determine the arousal and significance associated with a person's body pose; The utterance complex cycle model is used to define the arousal and significance associated with the utterance of a person. The significance dimension is shown on the horizontal axis and spans between positive and negative significance. Positive and negative significance (along the horizontal axis) are also known as pleasant emotion and unpleasant emotion, respectively, or more generally as positivity. The arousal dimension is shown on the vertical axis intersecting the horizontal “significance” axis and spans between activated and deactivated. Active arousal and inactive arousal (along the vertical axis) are also known as intense arousal and inactive arousal, respectively, or more generally as aggression. A general complex circulation model is illustrated in FIG. 7A, a facial complex circulation model is illustrated in FIG. 7B, a pose complex circulation model is illustrated in FIG. 7C, and a speech complex circulation model is illustrated in FIG. 7D.

본 기술의 소정의 실시예에 따르면, 얼굴 표정 검출, 포즈 검출 및 발화 검출 알고리즘으로부터 생성된 특징 벡터가 DNN에 입력된다. 얼굴 표정 검출은 도 6을 참조하여 위에서 논의된, 얼굴 검출기(610) 및 얼굴 표정 분석기(612)에 의해 수행될 수 있다. 얼굴 표정 검출의 결과는 하나 이상의 얼굴 특징 벡터일 수 있다. 포즈 검출은 골격 검출기(614) 및 포즈 분석기(616)에 의해 수행될 수 있다. 포즈 검출의 결과는 하나 이상의 포즈 특징 벡터일 수 있다. 발화 검출은 오디오 신호 분석기(624)에 의해 수행될 수 있다. 발화 검출의 결과는 하나 이상의 발화 특징 벡터일 수 있다. 소정의 실시예에 따르면, 전술된 특징 벡터는 함께 결부되고(concatenated) DNN에 공급된다. 그러한 DNN은 도 6에서의 인지된 감정 검출기(632)를 구현하는 데에 사용될 수 있다.According to certain embodiments of the present technology, feature vectors generated from facial expression detection, pose detection and utterance detection algorithms are input into a DNN. Facial expression detection may be performed by face detector 610 and facial expression analyzer 612, discussed above with reference to FIG. 6 . The result of facial expression detection may be one or more facial feature vectors. Pose detection may be performed by bone detector 614 and pose analyzer 616 . The result of pose detection may be one or more pose feature vectors. Speech detection may be performed by the audio signal analyzer 624. The result of utterance detection may be one or more utterance feature vectors. According to some embodiments, the feature vectors described above are concatenated together and fed into a DNN. Such a DNN can be used to implement the perceived emotion detector 632 in FIG. 6 .

본 기술의 소정의 실시예에 따르면, 인지된 감정 검출기(632)를 구현하는 DNN의 출력은 {aro _f , val _f , aro _p , val _p , aro _s , val _s } 로 표기된 6개의 값인데, "aro"는 각성을 지칭하고 "val"은 유의성을 지칭하며, 아래첨자 f, p 및 s는 각각 얼굴, 포즈 및 발화를 지칭한다. 따라서, 인물의 얼굴 표정을 나타내는 각성 값 및 유의성 값, 인물의 신체 포즈를 나타내는 각성 값 및 유의성 값, 그리고 인물의 발화를 나타내는 각성 값 및 유의성 값이 있다. 소정의 실시예에 따르면, 아래에서 추가로 상세히 설명될 바와 같이, 이들 값은 인물의 얼굴 표정, 신체 포즈 및/또는 발화를 수정하는 데에 사용된다. 수정 및 변경이라는 용어는 본 문서에서 상호교환가능하게 사용된다.According to certain embodiments of the present technology, the output of the DNN implementing the perceived emotion detector 632 is six values denoted { aro _f , val _f , aro _p , val _p , aro _s , val _s } " aro " refers to arousal and " val " refers to significance, and the subscripts f , p and s refer to face, pose and utterance, respectively. Accordingly, there are an arousal value and a significance value indicating a facial expression of a person, an arousal value and a significance value indicating a body pose of a person, and an arousal value and a significance value indicating a speech of a person. According to certain embodiments, as described in further detail below, these values are used to modify a person's facial expression, body pose, and/or speech. The terms modification and change are used interchangeably in this document.

본 기술의 소정의 실시예에 따르면, 인물의 의미론적 감정을 수량화하기 위해서, 딥 러닝(deep learning) 기반의 자연 언어 처리(Natural Language Processing: NLP) 알고리즘이 적용된다. 주요 발상은 인식된 발화로써 감정의 맥락 의존성(context-dependence)을 판정하는 것이다. 자연 언어 처리에서, 텍스트 인스턴스(textual instance)가 흔히 특징 공간 내의 벡터로서 나타내어진다. 특징의 수는 흔히 수십만 개만큼 많을 수 있고, 전통적으로, 이들 특징은 알려진 의미를 갖는다. 예를 들어, 인스턴스가 훈련 데이터 내에서 이전에 관측된 특정 단어를 갖는지, 단어가 정서 사전(sentiment lexicon) 내에 긍정적/부정적 용어로서 나열되어 있는지 및 기타 등등이다. NLP 알고리즘을 사용함으로써, 인물의 의미론적 감정이 추정될 수 있다. 소정의 실시예에 따르면, 의미론적 감정 검출기(634)를 구현하는 DNN의 출력은 집합적으로 Emo _sem 으로 나타내어질 수 있는, {aro _sem , val _sem }로 표기된 2개의 값이다. 소정의 실시예에 따르면, 아래에서 추가로 상세히 설명될 바와 같이, 인물의 의미론적 감정 Emo_sem은 인물의 얼굴 표정, 신체 포즈 및/또는 발화에 대응하는 이미지 데이터 및 오디오 데이터를 수정하는 데에 사용된다.According to certain embodiments of the present technology, in order to quantify a person's semantic emotion, a natural language processing (NLP) algorithm based on deep learning is applied. The main idea is to determine the context-dependence of emotion as a recognized utterance. In natural language processing, textual instances are often represented as vectors in a feature space. The number of features can often be as large as hundreds of thousands, and traditionally, these features have a known meaning. For example, whether an instance has a particular word previously observed in the training data, whether the word is listed as a positive/negative term in a sentiment lexicon, and so forth. By using NLP algorithms, a person's semantic emotion can be estimated. According to some embodiments, the output of the DNN implementing the semantic emotion detector 634 is two values denoted { aro _sem , val _sem }, which can be collectively represented as Emo _sem . According to certain embodiments, as described in further detail below, the semantic emotion Emo _sem of a person is used to modify image data and audio data corresponding to the person's facial expressions, body poses and/or utterances. do.

각각의 인지된 감정 Emo ⁱ _perc 은 복합순환 모델 상의 포인트(point)를 표기하는데, i = {face, pose, speech}이다. 이는 도 8에 도시된 바와 같이, 복합순환 모델 상으로의 여러 타입의 인지된 감정의 맵핑을 가능하게 한다. 의미론적 감정 Emo _sem 은 또한 복합순환 모델 상의 포인트를 표기하는데, 이는 도 8에 또한 도시된 바와 같이 동일한 복합순환 모델에 맵핑될 수 있다. 도 8을 참조하면, 802로 라벨링된(labeled) "X"는 인물의 인지된 얼굴 감정에 대응하고, 804로 라벨링된 "X"는 인물의 인지된 포즈 감정에 대응하고, 806으로 라벨링된 "X"는 인물의 인지된 발화 감정에 대응한다. X(802, 804 및 806)의 위치는 위에서 논의된, {aro _f , val _f , aro _p , val _p , aro _s , val _s }로 표기된 6개의 값에 의해 정의된다. 더욱 구체적으로, 802로 라벨링된 "X"의 위치는 값 aro _f 및 val _f 에 의해 정의되고, 804로 라벨링된 "X"의 위치는 값 aro _p 및 val _p 에 의해 정의되고, 806으로 라벨링된 "X"의 위치는 값 aro _s 및 val _s 에 의해 정의된다. 여전히 도 8을 참조하면, 808로 라벨링된 점은 인물의 의미론적 감정 Emo _sem 에 대응한다. 808로 라벨링된 점의 위치는 값 aro _sem 및 val _sem 에 의해 정의된다.Each perceived emotion Emo ⁱ _perc denotes a point on the complex cycle model, where i = { face, pose, speech }. This allows mapping of different types of perceived emotion onto a complex circulatory model, as shown in FIG. 8 . The semantic emotion Emo _sem also marks a point on the complex cycle model, which can be mapped to the same complex cycle model as also shown in FIG. 8 . Referring to FIG. 8 , "X" labeled 802 corresponds to the perceived facial emotion of the person, "X" labeled 804 corresponds to the perceived pose emotion of the person, and "X" labeled 806 corresponds to the perceived facial emotion of the person. X" corresponds to the person's perceived utterance emotion. The location of X 802, 804 and 806 is defined by the six values discussed above, denoted { aro _f , val _f , aro _p , val _p , aro _s , val _s }. More specifically, the location of "X" labeled 802 is defined by the values aro _f and val _f , the location of "X" labeled 804 is defined by the values aro _p and val _p , and labeled 806 The position of "X" is defined by the values aro _s and val _s . Still referring to FIG. 8 , the dot labeled 808 corresponds to the semantic emotion Emo _sem of the person. The location of the point labeled 808 is the value aro _sem and val _sem .

위에서 지적된 바와 같이, 적극성은 각성의 척도이고, 긍정성은 유의성의 척도이다. 인지된 감정 중 임의의 것 및 의미론적 감정 간의 거리 는 아래의 식을 사용하여 계산될 수 있다:As pointed out above, assertiveness is a measure of arousal, and positivity is a measure of significance. Any of the perceived emotions and semantic sentiment distance between can be calculated using the formula below:

인지된 감정 및 의미론적 감정 간의 거리는 인지된 감정 및 의미론적 감정이 얼마나 가깝게 동조되는지를 나타낸다. 예를 들어, 특정한 인지된 감정(가령, 신체 포즈) 및 의미론적 감정 간의 거리가 상대적으로 작은 경우, 그것은 인지된 감정이 의미론적 감정과 실질적으로 동조됨을 나타낸다. 역으로, 특정한 인지된 감정(가령, 신체 포즈) 및 의미론적 감정 간의 거리가 상대적으로 큰 경우, 그것은 인지된 감정이 의미론적 감정과 실질적으로 동조되지 않음을 나타낸다. 소정의 실시예에 따르면, 각각의 타입의 판정된 인지된 감정에 대해, 인지된 감정 및 의미론적 감정 간의 거리가 판정된다. 이는 3개의 거리 값이 판정되는 것을 초래할 것인데, 얼굴 표정, 신체 포즈 및 발화 각각에 대해 하나이다. 판정된 거리가 지정된 거리 임계(threshold)를 초과하는 경우에, 인지된 감정이 의미론적 감정과 실질적으로 동조되지 않음이 판정될 것이고, 그 판정에 응답하여, 각자의 특징(가령, 얼굴 표정, 신체 포즈 또는 발화)이 인지된 감정 및 의미론적 감정 간의 동조를 증가시키기 위해 수정된다. 더 예를 들면, (도 8에서 802로 라벨링된 "X"에 의해 나타내어진) 얼굴 인지된 감정 및 (도 8에서 808로 라벨링된 점에 의해 나타내어진) 의미론적 감정 간의 판정된 거리가 지정된 거리 임계보다 큰 경우이면, 비디오 신호의 얼굴 이미지 데이터는 얼굴 인지된 감정이 의미론적 감정과 더 동조된 수정된 비디오 신호를 산출하기 위해 수정된다. 역으로, 만일 얼굴 인지된 감정 및 의미론적 감정 간의 판정된 거리가 지정된 거리 임계보다 작은(또한 거리 임계 내로 지칭됨) 경우이면, 비디오 신호의 얼굴 이미지 데이터는 수정되지 않는다. 이렇게 각자의 거리를 판정하고 판정된 거리를 거리 임계와 비교하는 것은 또한 신체 포즈는 물론 발화에 대해서도 판정된다. 그러한 비교의 결과는 비디오 신호의 신체 포즈 데이터 및/또는 오디오 신호의 발화 데이터를 수정할 것인지 여부를 판정하는 데에 사용된다.perceived emotion and semantic sentiment The distance between perceived and semantic emotions indicates how closely they are tuned. For example, if the distance between a particular perceived emotion (eg, body pose) and a semantic emotion is relatively small, it indicates that the perceived emotion is substantially aligned with the semantic emotion. Conversely, if the distance between a particular perceived emotion (eg, body pose) and semantic emotion is relatively large, it indicates that the perceived emotion is not substantially aligned with the semantic emotion. According to certain embodiments, for each type of determined perceived emotion, a distance between the perceived emotion and the semantic emotion is determined. This will result in three distance values being determined, one for each of the facial expression, body pose and utterance. If the determined distance exceeds a specified distance threshold, it will be determined that the perceived emotion is substantially out of sync with the semantic emotion, and in response to that determination, the individual's characteristics (e.g., facial expression, body pose or utterance) is modified to increase the alignment between perceived emotion and semantic emotion. By way of further example, the determined distance between a facial perceived emotion (represented by “X” labeled 802 in FIG. 8 ) and a semantic emotion (represented by a dot labeled 808 in FIG. 8 ) is a specified distance If greater than the threshold, the face image data of the video signal is modified to yield a modified video signal in which the face perceived emotion is more in tune with the semantic emotion. Conversely, if the determined distance between the face perceived emotion and the semantic emotion is smaller than a specified distance threshold (also referred to as within the distance threshold), the face image data of the video signal is not modified. This determining the respective distance and comparing the determined distance to the distance threshold is also determined for the body pose as well as the utterance. The result of such comparison is used to determine whether to modify the body pose data of the video signal and/or speech data of the audio signal.

전술된 거리 판정 및 비교는 도 9의 흐름도에 요약된다. 도 9를 참조하면, 단계(902)에서, 인지된 감정(얼굴, 포즈, 발화) 중 하나 및 의미론적 감정 간의 거리가 판정되고, 더욱 구체적으로, 가령, 위에서 논의된 식을 사용하여, 계산된다. 단계(904)에서, 계산된 거리는 거리 임계와 비교되고, 단계(906)에서, 계산된 거리가 거리 임계 내인지(즉, 그보다 작은지)의 판정이 있다. 만일 계산된 거리가 거리 임계 내가 아닌 경우(즉, 단계(906)에서의 판정에 대한 대답이 아니요(No)인 경우)이면, 흐름은 단계(908)로 가고, 흐름이 단계(910)로 진행하기 전에, 관련 신호 또는 이의 부분이 단계(908)에서 수정된다. 만일 계산된 거리가 거리 임계 내인 경우(즉, 단계(906)에서의 판정에 대한 대답이 예(Yes)인 경우)이면, 관련 신호 또는 이의 부분에 대한 어떤 수정도 없이 흐름은 단계(910)로 간다. 위의 요약된 단계는 얼굴, 포즈 및 발화를 포함하는 상이한 타입의 인지된 감정 각각에 대해 수행될 수 있다.The distance determination and comparison described above is summarized in the flowchart of FIG. 9 . Referring to FIG. 9 , in step 902, the distance between one of the perceived emotions (face, pose, utterance) and the semantic emotion is determined, and more specifically calculated, eg, using the equations discussed above . In step 904, the calculated distance is compared to a distance threshold, and in step 906, a determination is made whether the calculated distance is within (i.e., less than) the distance threshold. If the calculated distance is not within the distance threshold (i.e., the answer to the determination at step 906 is No), flow goes to step 908 and flow proceeds to step 910. Before doing so, the relevant signal or portion thereof is modified in step 908 . If the calculated distance is within the distance threshold (i.e., the answer to the determination at step 906 is Yes), the flow proceeds to step 910 without any modification to the associated signal or portion thereof. Goes. The steps outlined above can be performed for each of the different types of perceived emotion including face, pose and speech.

본 기술의 소정의 실시예에 따르면, 제1 인물의 비디오 및 오디오는 합성 이미지/오디오를 이의 원래의 버전을 대체하기 위해 생성함으로써 수정된다. 더욱 구체적으로, 제1 인물의 원래 획득된 비디오 및 오디오 신호는, 제2 인물(또는 여러 다른 인물)이 보거나 듣는 경우에 제1 인물의 의미론적 감정과 더 동조된 인지된 감정을 갖는 수정된 비디오 및 오디오 신호를 산출하기 위해 수정된다. 소정의 실시예에 따르면, 생성된 이미지/오디오의 인지된 감정은 의미론적 감정에 가능한 한 가깝게 접근하여야 한다.According to certain embodiments of the present technology, the first person's video and audio are modified by creating a composite image/audio to replace its original version. More specifically, the first person's original acquired video and audio signals, when viewed or heard by a second person (or several other persons), have a modified video that has a perceived emotion that is more in tune with the first person's semantic emotion. and modified to yield an audio signal. According to certain embodiments, the perceived emotion of the generated image/audio should approximate semantic emotion as closely as possible.

도로 도 6을 참조하면, 더욱 구체적으로 인지된 감정 수정기(540)로 또한 지칭될 수 있는 감정 수정기(540)는 얼굴 표정 수정기(642), 포즈 수정기(646) 및 오디오 수정기(648)를 포함하는 것으로 도시된다. 모듈로 또한 지칭될 수 있는 수정기(642, 646 및 648) 각각은 포착된 오디오 및 비디오 신호 내의 특정한 데이터를 수정함으로써 합성 이미지 또는 합성 오디오를 생성하기 위해 알고리즘을 사용한다. 간단한 예를 들면, 제1 인물의 의미론적 감정이 행복함이나, 그의 얼굴 인지된 감정은 초조함이고, 그의 포즈 인지된 감정은 당황함이고, 그의 발화 인지된 감정은 긴장함이라고 판정된다고 가정하자. 본 기술의 실시예를 사용하여, 얼굴 이미지 데이터는 인물의 얼굴 표정이 (초조함보다는) 행복함이도록 수정되고, 포즈 이미지 데이터는 인물의 신체 포즈가 (당황함보다는) 행복함이도록 수정되고, 오디오 데이터는 인물의 발화가 (초조함보다는) 행복함이도록 수정된다. 그러한 수정은 비디오 채팅에서 어떤 구별가능한 뒤처짐(lag)도 없도록 실시간 또는 거의 실시간으로 행해져야 한다.Referring again to FIG. 6 , emotion modifier 540, which may also be more specifically referred to as perceived emotion modifier 540, includes facial expression modifier 642, pose modifier 646, and audio modifier ( 648). Modifiers 642, 646 and 648, which may also be referred to as modules, each use an algorithm to create a composite image or composite audio by modifying certain data within the captured audio and video signals. As a simple example, suppose that it is determined that the first person's semantic emotion is happy, her face-recognized emotion is nervous, her pose-recognized emotion is puzzled, and her utterance-recognized emotion is nervous. Using an embodiment of the present technology, facial image data is modified such that the person's facial expression is happy (rather than nervous), pose image data is modified such that the person's body pose is happy (rather than embarrassed), and audio data is modified The character's speech is modified to be happy (rather than nervous). Such modifications should be done in real-time or near real-time so that there is no appreciable lag in the video chat.

위에서 지적된 바와 같이, 수정된 비디오 및 오디오 신호를 산출하기 위해, 포착된 비디오 및 오디오 신호를 수정하는 데에 하나 이상의 DNN 및/또는 다른 컴퓨터 모델이 사용될 수 있다. 특정한 실시예에 따르면, 그러한 수정을 수행하기 위해 생성적 적대적 망(Generative Adversarial Network: GAN)이 사용된다. GAN은 경쟁에서 하나가 다른 것에 맞서 격돌하는(따라서 용어 "적대적"의 사용) 2개의 신경망, 즉 생성적 신경망(generative neural network) 및 판별적 신경망(discriminative neural network)을 포함하는 심층 신경망 아키텍처이다. 따라서 생성적 신경망 및 판별적 신경망은 GAN 신경망의 하위망(subnetwork)으로 간주될 수 있다. 생성적 신경망은 후보를 생성하는 반면에 판별적 신경망은 이를 평가한다. 경쟁은 데이터 분포(data distribution)의 측면에서 동작한다. 생성적 신경망은 잠재 공간(latent space)으로부터 관심 있는 데이터 분포로 맵핑하는 것을 학습할 수 있는 반면에, 판별적 신경망은 생성적 신경망에 의해 산출된 후보를 참된 데이터 분포와 구별할 수 있다. 생성적 신경망의 훈련 목표는 판별적 신경망의 오차율(error rate)을 증가시키는 것(즉, 판별자가 생각하기에 합성되지 않은(참된 데이터 분포의 일부인) 신규 후보를 산출함으로써 판별자 신경망을 "기만하는" 것)일 수 있다. 알려진 데이터세트가 판별적 신경망을 위한 초기 훈련 데이터로서의 역할을 한다. 판별적 신경망을 훈련하는 것은, 그것이 수용가능한 정확도를 달성할 때까지, 그것에게 훈련 데이터세트로부터의 샘플을 제시하는 것을 수반할 수 있다. 생성적 신경망은 그것이 판별적 신경망을 기만하는 데에 성공하는지에 기반하여 훈련될 수 있다. 생성적 신경망은 사전정의된 잠재 공간(가령, 다변수 정규 분포)로부터 샘플링된 랜덤화된(randomized) 입력으로 시드 받을(seeded) 수 있다. 그 후에, 생성적 신경망에 의해 합성된 후보는 판별적 신경망에 의해 평가될 수 있다. 판별적 신경망이 합성 이미지를 플래그표시하는(flagging) 데에 더 숙련되게 되는 한편, 생성적 신경망이 더 나은 이미지를 산출하도록 두 개의 망 모두에서 역전파(backpropagation)가 적용될 수 있다. 생성적 신경망은, 가령, 역합성곱 신경망(deconvolutional neural network)일 수 있고, 판별적 신경망은, 가령, 합성곱(convolutional neural network)일 수 있다. GAN은 그것이 비디오 채팅 동안에 신호를 수정하는 데에 사용되기 전에 훈련되어야 한다.As pointed out above, one or more DNNs and/or other computer models may be used to modify the captured video and audio signals to produce modified video and audio signals. According to a particular embodiment, a Generative Adversarial Network (GAN) is used to perform such modifications. A GAN is a deep neural network architecture that includes two neural networks that pit one against the other in competition (hence the use of the term “adversarial”): a generative neural network and a discriminative neural network. Thus, generative and discriminant neural networks can be considered as subnetworks of GAN neural networks. A generative neural network generates candidates while a discriminant neural network evaluates them. Competition works in terms of data distribution. A generative neural network can learn to map from a latent space to a data distribution of interest, whereas a discriminant neural network can distinguish candidates produced by a generative neural network from true data distributions. The training goal of a generative network is to increase the error rate of the discriminant network (i.e. to "deceive" the discriminator network by yielding new candidates that the discriminator thinks are not synthesized (part of the true data distribution). " thing) can be. A known dataset serves as the initial training data for the discriminant neural network. Training a discriminant neural network may involve presenting it with samples from a training dataset until it achieves acceptable accuracy. A generative neural network can be trained based on whether it succeeds in fooling a discriminant neural network. A generative neural network may be seeded with randomized inputs sampled from a predefined latent space (eg, a multivariate normal distribution). Candidates synthesized by the generative neural network can then be evaluated by the discriminant neural network. Backpropagation can be applied in both networks so that the generative network yields better images, while the discriminant network becomes more adept at flagging synthetic images. A generative neural network can be, for example, a deconvolutional neural network, and a discriminant neural network can be, for example, a convolutional neural network. A GAN must be trained before it can be used to modify signals during video chat.

도로 도 6을 참조하면, 얼굴 표정 수정기(642)는 GAN에 의해 구현될 수 있다. 더욱 구체적으로, GAN은 인물의 현실적인 이미지(여기서 이미지는 인물의 얼굴 및 포즈 인지된 감정을 인물의 의미론적 감정과 더 동조되게 하기 위해 수정되었음)를 디스플레이하는 데에 사용될 수 있는 수정된 비디오 신호를 산출하기 위해 비디오 신호 내의 이미지 데이터를 수정하는 데에 사용될 수 있다. GAN은 인물의 발화 인지된 감정이 인물의 의미론적 감정과 더 동조되도록 오디오 신호를 수정하는 데에 또한 사용될 수 있다. 특정한 실시예에서, StarGAN이 이미지 및/또는 오디오 수정을 수행하는 데에 사용될 수 있다. Y. Choi et al, CVPR, 2018에 의한 "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation"라는 제목의 글은 현실적인 방식으로 사람들의 얼굴 표정을 수정하는 데에 StarGAN이 어떻게 사용되었는지를 논의한다. 추가적인 및/또는 대안적인 타입의 신경망 및/또는 다른 타입의 컴퓨터 모델의 사용은 또한 본 문서에 기술된 실시예의 범위 내에 있다.Referring back to FIG. 6 , the facial expression modifier 642 may be implemented by a GAN. More specifically, GANs generate modified video signals that can be used to display realistic images of a person, where the image has been modified to make the person's face and pose perceived emotions more in sync with the person's semantic emotions. It can be used to modify image data within a video signal to produce GANs can also be used to modify audio signals so that the perceived emotion of a person's speech is more in tune with the semantic emotion of the person. In certain embodiments, StarGAN may be used to perform image and/or audio modifications. An article titled "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation" by Y. Choi et al, CVPR, 2018 describes how StarGAN can be used to modify people's facial expressions in a realistic way. discuss what happened The use of additional and/or alternative types of neural networks and/or other types of computer models is also within the scope of the embodiments described herein.

여전히 도 6을 참조하면, GAN은 포즈 수정기(646)를 구현하는 데에 또한 사용될 수 있다. 대안적으로, 사전훈련된 시각적 생성기 모델(pretrained visual generator model)이 포즈 수정기(646)를 구현하는 데에 사용될 수 있다. 도 6에 도시된 바와 같이, 원래의 비디오 신호(602)는 골격 검출기(614)에 제공된다. 원래의 비디오 신호(602)는 또한 원래의 이미지 스트림(stream)(602)으로 지칭될 수 있다. 골격 검출기(614)는 원래의 이미지 스트림으로부터 골격 정보를 추출한다. 골격 정보는 벡터 X로서 나타내어질 수 있는데, 이는 뼈대 내의 모든 관절 위치를 저장한다. 실시예에 따르면, 벡터 X는 벡터 e로써 나타내어지는 의미론적 감정 신호와 조합된다. 이들 두 벡터는 벡터 X이도록 결부되고 사전훈련된 시각적 생성기 모델을 위한 입력으로서 사용될 수 있다. 사전훈련된 시각적 생성기 모델은, 가령, 합성곱 층, 최대풀링(maxpooling) 층, 역합성곱 층 및 배치 정규화(batch normalization) 층으로써 구현될 수 있으나, 이에 한정되지 않는다. 사전훈련된 시각적 생성기 모델의 출력은 의미론적 감정과 더 동조된 수정된 신체 포즈를 포함하는 수정된 비디오 신호(652)를 생성하는 데에 사용될 수 있다.Still referring to FIG. 6 , a GAN may also be used to implement pose modifier 646 . Alternatively, a pretrained visual generator model may be used to implement pose modifier 646. As shown in FIG. 6 , the original video signal 602 is provided to a skeleton detector 614 . Original video signal 602 may also be referred to as original image stream 602 . Skeleton detector 614 extracts bone information from the original image stream. Bone information can be represented as a vector X, which stores all joint positions within the bone. According to an embodiment, vector X is combined with a semantic emotion signal represented by vector e. These two vectors can be concatenated to be vector X and used as input for a pretrained visual generator model. Pretrained visual generator models can be implemented as, but not limited to, convolutional layers, maxpooling layers, deconvolutional layers, and batch normalization layers, for example. The output of the pretrained visual generator model can be used to generate a modified video signal 652 that includes a modified body pose that is more in tune with the semantic emotion.

여전히 도 6을 참조하면, 위에서 이미 지적된 바와 같이, 원래의 오디오 신호(622)는 오디오 신호 분석기(624)에 제공되는 것으로 도시된다. 인물의 발화에 대응하는 인지된 감정을 그의 의미론적 감정과 더 동조되게 하기 위해, 수정될 수 있는 오디오 신호의 특징은 피치, 비브라토 및 어조를 포함하나, 이에 한정되지 않는다. 더욱 구체적으로, 피치가 전이될(shifted) 수 있는데, 피치 전이는 원래의 음성 신호의 피치에 인자 α를 곱하는 것을 뜻한다. 증가된 피치(α > 1)는 흔히 행복과 같은 크게 각성된 상태와 상관되는 반면에, 감소된 피치(α < 1)는 슬픔과 같은 낮은 유의성과 상관된다. 비브라토는 주어진 속도 및 깊이와 함께 일어나는, 음성의 피치(기본 주파수)의 주기적 변조이다. 난조에 또한 관련된 비브라토는 높은 각성에 상관된 것으로 빈번히 보고되고 단일 모음으로도 감정의 중요한 표지(marker)이다. 비브라토는 발화에 대응하는 인지된 감정을 변경하기 위해 수정될 수 있다. 어조는 각각의 발언의 시작에서의 피치의 급속한 수정(가령, ~500 ms)인데, 이는 그것의 타겟을 몇 개의 세미톤만큼 지나치나 신속히 정상 값으로 감쇠한다. 어조의 사용은 피치의 증가된 변형으로 이어지는데, 이는 높은 감정 강도 및 긍정적 유의성과 연관된다. 어조는 또한 발화에 대응하는 인지된 감정을 변경하기 위해 수정될 수 있다. 오디오 신호는 또한 발화에 대응하는 인지된 감정을 변경하기 위해 필터링될(filtered) 수 있는데, 필터링은 주파수 스펙트럼의 소정의 영역의 에너지 기여를 강조하거나 약화시키는 프로세스를 뜻한다. 예를 들면, 높은 각성 감정은 음성 사운드를 더 날카롭고 더 선명하게 만드는, 증가된 고주파수 에너지와 연관되는 경향이 있다. 인물의 의미론적 감정이 발화에 대응하는 인물의 인지된 감정보다 덜 활성화된 각성에 대응하는 경우에, 오디오 신호 내의 고주파수 에너지는 인지된 감정을 의미론적 감정과 더 동조되도록 하기 위해, 필터링을 사용하여 약화될 수 있다. 수정된 오디오 신호의 감정적 톤은 인식가능하여야 하고, 음성은 자연스럽게 들려야 하며 합성으로 인지되어서는 안 된다. 위에서 지적된 바와 같이, 수정 및 변경이라는 용어는 본 문서에서 상호교환가능하게 사용된다.Still referring to FIG. 6 , as already pointed out above, the original audio signal 622 is shown as being provided to an audio signal analyzer 624 . Characteristics of the audio signal that may be modified include, but are not limited to, pitch, vibrato, and tone, to bring the perceived emotion corresponding to a person's utterance to more in sync with their semantic emotion. More specifically, the pitch can be shifted, which means multiplying the pitch of the original speech signal by a factor α. Increased pitch (α > 1) often correlates with highly aroused states such as happiness, whereas decreased pitch (α < 1) correlates with low significance such as sadness. Vibrato is a periodic modulation of the pitch (fundamental frequency) of the voice that occurs with a given rate and depth. Vibrato, also related to hunting, is frequently reported to be correlated with high arousal and is an important marker of emotion even as a single vowel. Vibrato can be modified to change the perceived emotion corresponding to the utterance. Tone is a rapid modification (e.g., ˜500 ms) of pitch at the beginning of each utterance that misses its target by a few semitones but quickly decays to normal values. The use of tone leads to increased variation in pitch, which is associated with high emotional intensity and positive significance. Tone can also be modified to change the perceived emotion corresponding to the utterance. The audio signal may also be filtered to alter the perceived emotion corresponding to the utterance, filtering being the process of emphasizing or attenuating the energy contribution of certain regions of the frequency spectrum. For example, high arousal emotions tend to be associated with increased high-frequency energy, making speech sound sharper and clearer. In cases where a person's semantic emotion corresponds to less activated arousal than the person's perceived emotion corresponding to the utterance, the high-frequency energy in the audio signal is filtered using filtering to make the perceived emotion more in tune with the semantic emotion. may be weakened. The emotional tone of the modified audio signal should be recognizable, and the voice should sound natural and should not be perceived as synthetic. As pointed out above, the terms modification and change are used interchangeably in this document.

위에서 기술된 신경망을 구현하기 위해 하나 이상의 프로세서가 사용될 수 있다. 여러 프로세서가 사용되는 경우에, 그것들은 병치되거나(collocated) 널리 분포되거나, 이의 조합일 수 있다.One or more processors may be used to implement the neural network described above. Where multiple processors are used, they may be collocated, widely distributed, or a combination thereof.

본 기술의 소정의 실시예에 따른 방법을 요약하기 위해 도 10의 고수준 흐름도가 이제 사용될 것이다. 도 10을 참조하면, 단계(1002)는 제2 인물과의 비디오 채팅에 참여하는 제1 인물의 비디오 신호 및 오디오 신호를 획득하는 것을 수반한다. 도 1 내지 도 4를 도로 참조하면, 단계(1002)는 A-V 서브시스템(가령, 120A), 또는 더욱 구체적으로, A-V 서브시스템의 하나 이상의 카메라 및 하나 이상의 마이크로폰, 또는 어떤 다른 서브시스템 또는 시스템에 의해 수행될 수 있다.The high-level flow diagram of FIG. 10 will now be used to summarize a method according to certain embodiments of the present technology. Referring to FIG. 10 , step 1002 involves obtaining video and audio signals of a first person participating in a video chat with a second person. Referring again to FIGS. 1-4 , step 1002 is performed by an A-V subsystem (e.g., 120A), or more specifically, one or more cameras and one or more microphones of the A-V subsystem, or any other subsystem or system. can be performed

도 10을 다시 참조하면, 단계(1004)는 비디오 신호에 기반하여 제1 인물의 하나 이상의 타입의 인지된 감정을 판정하는 것을 수반한다. 단계(1006)는 오디오 신호에 기반하여 제1 인물의 의미론적 감정을 판정하는 것을 수반한다. 단계(1004)에서 판정될 수 있는 인지된 감정의 타입은 위에서 기술된 바와 같은, 얼굴 표정 인지된 감정, 신체 포즈 인지된 감정 및 발화 인지된 감정을 포함한다.Referring again to FIG. 10 , step 1004 involves determining one or more types of perceived emotion of the first person based on the video signal. Step 1006 involves determining the semantic sentiment of the first person based on the audio signal. Types of perceived emotion that may be determined in step 1004 include facial expression perceived emotion, body pose perceived emotion, and utterance perceived emotion, as described above.

간단히 도 5 및 도 6을 도로 참조하면, 가령, 감정 검출기(530)에 의해, 또는 더욱 구체적으로, 이의 인지된 감정 검출기(632)에 의해, 다양한 타입의 인지된 감정이 판정될 수 있다. 더욱 구체적으로, 단계(1002)에서 획득된 비디오 신호에 기반하여 제1 인물의 얼굴 표정 및 신체 포즈가 판정될 수 있고, 이에 기반하여 제1 인물의 얼굴 표정 인지된 감정 및 신체 포즈 인지된 감정이 판정될 수 있다. 추가적으로, 제1 인물의 발화의 피치, 비브라토 또는 어조 중 적어도 하나를 판정하기 위해 단계(1002)에서 획득된 오디오 신호의 오디오 신호 처리가 수행될 수 있고, 오디오 신호 처리의 결과에 기반하여 제1 인물의 발화 인지된 감정이 판정될 수 있다. 추가적인 및/또는 대안적인 변형이 또한 가능하며 본 문서에 기술된 실시예의 범위 내에 있다.Referring also briefly to FIGS. 5 and 6 , various types of perceived emotions can be determined, such as by emotion detector 530 , or more specifically by its perceived emotion detector 632 . More specifically, based on the video signal obtained in step 1002, the facial expression and body pose of the first person may be determined, and based on this, the perceived emotion of the facial expression and the perceived body pose of the first person may be determined. can be judged. Additionally, audio signal processing of the audio signal obtained in step 1002 may be performed to determine at least one of pitch, vibrato or tone of speech of the first person, and based on a result of the audio signal processing, the first person The utterance of the perceived emotion can be determined. Additional and/or alternative modifications are also possible and within the scope of the embodiments described herein.

특정한 실시예에 따르면, 단계(1004)에서, 비디오 신호에 기반하여 제1 인물의 얼굴 표정의 긍정성 및 적극성을 수량화하기 위해 얼굴 복합순환 모델이 사용되고, 비디오 신호에 기반하여 제1 인물의 신체 포즈의 긍정성 및 적극성을 수량화하기 위해 포즈 복합순환 모델이 사용되고, 오디오 신호에 기반하여 제1 인물의 발화의 긍정성 및 적극성을 수량화하기 위해 발화 복합순환 모델이 사용된다.According to a particular embodiment, in step 1004, the facial complex circulation model is used to quantify the positivity and aggressiveness of the facial expression of the first person based on the video signal, and the body pose of the first person based on the video signal. A pose complex cyclic model is used to quantify the positivity and aggressiveness of , and a utterance complex cyclic model is used to quantify the positivity and aggressiveness of the first person's utterance based on the audio signal.

단계(1006)에서 판정되는 의미론적 감정은, 가령, 감정 검출기(530)에 의해, 또는 더욱 구체적으로, 이의 의미론적 감정 검출기(634)에 의해, 판정될 수 있다. 위에서 추가로 상세히 설명된 바와 같이, 단계(1006)는 오디오 신호의 자연 언어 처리를 수행하는 것과, 오디오 신호의 자연 언어 처리의 결과에 기반하여 제1 인물의 의미론적 감정을 판정하는 것을 수반할 수 있다. 특정한 실시예에 따르면, 단계(1006)에서, 제1 인물의 의미론적 감정은 오디오 신호에 기반하여 제1 인물의 언어의 긍정성 및 적극성을 수량화하기 위해 언어 복합순환 모델을 사용하여 오디오 신호를 기반으로 판정된다.The semantic emotion determined in step 1006 may be determined, for example, by emotion detector 530 or, more specifically, by its semantic emotion detector 634 . As described in further detail above, step 1006 may involve performing natural language processing of the audio signal and determining the semantic sentiment of the first person based on a result of the natural language processing of the audio signal. there is. According to a particular embodiment, in step 1006, the first person's semantic emotion is determined based on the audio signal using a language complex circular model to quantify the positivity and assertiveness of the first person's language based on the audio signal. is judged as

도 10을 다시 참조하면, 단계(1008)는 제1 인물의 하나 이상의 타입의 인지된 감정 중 적어도 하나 및 제1 인물의 의미론적 감정 간의 동조를 증가시키기 위해 비디오 신호 및 오디오 신호를 변경하는 것을 수반한다. 위에서 상세히 기술된 바와 같이, 단계(1008)를 수행하기 위해 하나 이상의 컴퓨터 구현된 신경망이 사용될 수 있다. 단계(1008)를 수행하기 위해 대안적으로 또는 추가적으로 다른 타입의 컴퓨터 구현된 모델이 사용될 수 있다. 단계(1008)는 비디오 신호에 포함된 이미지 데이터의 얼굴 표정 및 신체 포즈를 수정하는 것을, 또 오디오 신호에 포함된 오디오 데이터의 피치, 비브라토 또는 어조 중 적어도 하나를 수정하는 것도 수반할 수 있다. Referring back to FIG. 10 , step 1008 involves altering the video and audio signals to increase the entrainment between at least one of the one or more types of perceived emotion of the first person and the semantic emotion of the first person. do. As described in detail above, one or more computer implemented neural networks may be used to perform step 1008 . Alternatively or additionally, other types of computer implemented models may be used to perform step 1008 . Step 1008 may involve modifying facial expression and body pose of image data included in the video signal, and also modifying at least one of pitch, vibrato or tone of audio data included in the audio signal.

특정한 실시예에 따르면, 단계(1008)는 제1 인물의 얼굴 표정의 긍정성 및 적극성과, 제1 인물의 언어의 긍정성 및 적극성 간의 거리를 줄이기 위해 비디오 신호에 포함된 이미지 데이터를 변경하는 것을 수반한다. 단계(1008)는 제1 인물의 신체 포즈의 긍정성 및 적극성과, 제1 인물의 언어의 긍정성 및 적극성 간의 거리를 줄이기 위해 비디오 신호에 포함된 이미지 데이터를 변경하는 것을 또한 수반할 수 있다. 또한, 단계(1008)는 제1 인물의 발화의 긍정성 및 적극성과, 제1 인물의 언어의 긍정성 및 적극성 간의 거리를 줄이기 위해 오디오 신호에 포함된 오디오 데이터를 변경하는 것을 또한 수반할 수 있다.According to a particular embodiment, step 1008 involves altering image data included in the video signal to reduce the distance between the positivity and assertiveness of the first person's facial expression and the positivity and assertiveness of the first person's language. accompanies Step 1008 may also involve altering image data included in the video signal to reduce the distance between the positivity and assertiveness of the first person's body pose and the positivity and assertiveness of the first person's speech. Further, step 1008 may also involve modifying audio data included in the audio signal to reduce the distance between the positivity and assertiveness of the first person's utterance and the positivity and assertiveness of the first person's speech. .

여전히 도 10을 참조하면, 단계(1010)는 변경된 비디오 신호 및 변경된 오디오 신호를 비디오 채팅에 참여하고 있는 제2 인물과 연관된(가령, 근접하여 있는) 서브시스템(가령, 디바이스)에 제공하는(가령, 송신하는) 것을 수반하는바, 이로써 제2 인물로 하여금 제1 인물의 하나 이상의 타입의 인지된 감정 중 적어도 하나 및 제1 인물의 의미론적 감정 간의 증가된 동조를 갖는, 제1 인물의 수정된 이미지와 오디오를 보고 들을 수 있게 한다.Still referring to FIG. 10 , step 1010 involves providing (eg, a device) the modified video signal and the modified audio signal to a subsystem (eg, device) associated with (eg, in close proximity to) a second person participating in a video chat. . Allows you to see and hear images and audio.

가령, 도 9 및 도 10을 참조하여, 위에서 기술된 방법은 적어도 부분적으로, 스마트폰, 태블릿 컴퓨터, 노트북 컴퓨터, 랩톱 컴퓨터, 또는 유사한 것과 같은, 그러나 이에 한정되지 않는, 객실내 컴퓨터 시스템 또는 모바일 컴퓨팅 디바이스에 의해 수행될 수 있다. 그러한 방법의 단계는 오로지 모바일 컴퓨팅 디바이스에 의해, 또는 하나 이상의 통신 네트워크를 통해 하나 이상의 서버와 네트워크를 통해 통신하는 모바일 컴퓨팅 디바이스에 의해 수행될 수 있다. 도 11은 예시적인 모바일 컴퓨팅 디바이스(이로써 본 기술의 실시예가 사용될 수 있음)의 예시적인 컴포넌트를 보여준다. 그러한 모바일 컴퓨팅 디바이스는, 가령, A-V 서브시스템(가령, 도 1 내지 도 4에서의 120A 또는 220A)을 구현하기 위해 사용될 수 있으나, 이에 한정되지 않는다.For example, with reference to FIGS. 9 and 10 , the method described above can be performed at least in part on a mobile computing or in-cabin computer system, such as, but not limited to, a smartphone, tablet computer, notebook computer, laptop computer, or the like. can be performed by the device. The steps of such a method may be performed solely by a mobile computing device, or by a mobile computing device communicating over a network with one or more servers through one or more communication networks. 11 shows example components of an example mobile computing device, in which embodiments of the present technology may be used. Such a mobile computing device may be used, for example, but not limited to, to implement an A-V subsystem (eg, 120A or 220A in FIGS. 1-4).

도 11은 예시적인 모바일 컴퓨팅 디바이스(1102)를 보여주는데 이로써 본 문서에 기술된 본 기술의 실시예가 사용될 있다. 모바일 컴퓨팅 디바이스(1102)는 아이폰(iPhoneTM), 블랙베리(BlackberryTM), 안드로이드(AndriodTM) 기반의 또는 윈도우즈(WindowsTM) 기반의 스마트폰과 같은, 그러나 이에 한정되지 않는 스마트폰일 수 있다. 모바일 컴퓨팅 디바이스(1102)는 대안적으로, 아이패드(iPadTM), 안드로이드(AndriodTM) 기반의 또는 윈도우즈(WindowsTM) 기반의 태블릿과 같은, 그러나 이에 한정되지 않는 태블릿 컴퓨팅 디바이스일 수 있다. 다른 예를 들면, 모바일 컴퓨팅 디바이스(1102)는 아이팟 터치(iPod TouchTM) 또는 유사한 것일 수 있다. 11 shows an exemplary mobile computing device 1102 whereby embodiments of the technology described herein may be used. The mobile computing device 1102 may be a smartphone such as, but not limited to, an iPhone(TM), Blackberry(TM), Android(TM) based or Windows(TM) based smartphone. The mobile computing device 1102 may alternatively be a tablet computing device such as, but not limited to, an iPad(TM), an Android(TM) based or a Windows(TM) based tablet. For another example, mobile computing device 1102 may be an iPod Touch(TM) or the like.

도 11의 블록도를 참조하면, 모바일 컴퓨팅 디바이스(1102)는 카메라(1104), 가속도계(1106), 자력계(1108), 자이로스코프(1110), 마이크로폰(1112), (터치 스크린 디스플레이일 수 있거나 그렇지 않을 수 있는) 디스플레이(1114), 프로세서(1116), 메모리(1118), 송수신기(1120), 스피커(1122) 및 드라이브 유닛(1124)을 포함하는 것으로 도시된다. 이들 요소 각각은 버스(1128)에 연결되는 것으로 도시되는데, 이는 다양한 컴포넌트로 하여금 서로 통신하고 하나의 요소로부터 다른 요소로 데이터를 전송할 수 있게 한다. 요소 중 몇몇은 버스(1128)를 사용하지 않고서 서로 통신할 수 있음이 또한 가능하다.Referring to the block diagram of FIG. 11 , mobile computing device 1102 includes camera 1104, accelerometer 1106, magnetometer 1108, gyroscope 1110, microphone 1112, (which may or may not be a touch screen display) display 1114 (which may not be present), processor 1116, memory 1118, transceiver 1120, speaker 1122 and drive unit 1124. Each of these elements is shown connected to a bus 1128, which allows the various components to communicate with each other and transfer data from one element to another. It is also possible that some of the elements can communicate with each other without using the bus 1128.

카메라(1104)는 모바일 컴퓨팅 디바이스(1102)를 사용하는 인물의 이미지를 포함하는 비디오 신호를 획득하는 데에 사용될 수 있다. 마이크로폰(1112)은 모바일 컴퓨팅 디바이스(1102)를 사용하는 인물이 말하는 바를 나타내는 오디오 신호를 산출하는 데에 사용될 수 있다.Camera 1104 may be used to acquire a video signal comprising an image of a person using mobile computing device 1102 . The microphone 1112 may be used to produce an audio signal representative of what the person using the mobile computing device 1102 is saying.

가속도계(1106)는 기준 프레임(frame of reference)에 대한 선가속도(linear acceleration)를 측정하기 위해 사용될 수 있고, 따라서, 수평 또는 지면에 대한 모바일 디바이스(1102)의 각도를 검출하기 위해서뿐만 아니라 모바일 컴퓨팅 디바이스(1102)의 움직임을 검출하기 위해 사용될 수 있다. 자력계(1108)는 자북(magnetic north)의 방향 및 자북에 대한 방위를 판정하기 위해 나침반(compass)으로서 사용될 수 있다. 자이로스코프(1110)는 모바일 컴퓨팅 디바이스(1102)의 수직 및 수평 배향 양자 모두를 검출하기 위해 사용될 수 있고, 가속도계(1106) 및 자력계(1108)와 함께, 모바일 컴퓨팅 디바이스(1102)의 배향에 대한 매우 정확한 정보를 획득하기 위해 사용될 수 있다. 모바일 컴퓨팅 디바이스(1102)가 주변 광 센서(ambient light sensor) 및/또는 근접 센서(proximity sensor)와 같은, 그러나 이에 한정되지 않는, 추가적인 센서 요소를 포함하는 것이 또한 가능하다.Accelerometer 1106 can be used to measure linear acceleration relative to a frame of reference, and thus to detect the angle of mobile device 1102 relative to the horizontal or ground, as well as for mobile computing purposes. It can be used to detect movement of device 1102. Magnetometer 1108 can be used as a compass to determine the direction of magnetic north and bearing with respect to magnetic north. The gyroscope 1110 can be used to detect both vertical and horizontal orientations of the mobile computing device 1102 and, along with the accelerometer 1106 and magnetometer 1108, can be used to determine the orientation of the mobile computing device 1102. It can be used to obtain accurate information. It is also possible for the mobile computing device 1102 to include additional sensor elements, such as, but not limited to, an ambient light sensor and/or a proximity sensor.

터치 스크린 타입의 디스플레이일 수 있거나 그렇지 않을 수 있는 디스플레이(1114)는 사용자에게 항목(가령, 이미지, 옵션, 명령 등등)을 시각적으로 디스플레이하고 사용자로부터 입력을 수용하기 위해 사용자 인터페이스(user interface)로서 사용될 수 있다. 디스플레이(1114)는 또한 모바일 컴퓨팅 디바이스(1102)의 사용자가 비디오 채팅에 참여할 수 있게 하는 데에 사용될 수 있다. 또한, 모바일 컴퓨팅 디바이스(1102)는 사용자로부터 입력을 수용하는, 키(key), 버튼(button), 트랙패드(track-pad), 트랙볼(trackball) 또는 유사한 것과 같은, 추가적인 요소를 포함할 수 있다.Display 1114, which may or may not be a touch screen type display, will be used as a user interface to visually display items (eg, images, options, commands, etc.) to the user and to accept input from the user. can Display 1114 can also be used to allow a user of mobile computing device 1102 to engage in a video chat. Mobile computing device 1102 may also include additional elements, such as keys, buttons, track-pads, trackballs, or the like, that accept input from a user. .

메모리(1118)는 카메라(1104)를 사용하여 포착된 이미지를 저장하기 위해서뿐만 아니라, 모바일 컴퓨팅 디바이스(1102)를 제어하는 소프트웨어 및/또는 펌웨어를 저장하기 위해 사용될 수 있으나, 이에 한정되지 않는다. 비휘발성(non-volatile) 및 휘발성(volatile) 메모리를 포함하는, 다양한 상이한 타입의 메모리가 모바일 컴퓨팅 디바이스(1102)에 포함될 수 있다. 가령, 하드 드라이브(hard drive)이나 이에 한정되지 않는 드라이브 유닛(1124)은 또한 카메라(1104)를 사용하여 포착된 이미지를 저장하기 위해서뿐만 아니라, 모바일 컴퓨팅 디바이스(1102)를 제어하는 소프트웨어를 저장하기 위해 사용될 수 있으나, 이에 한정되지 않는다. 메모리(1118) 및 디스크 유닛(1124)은 본 문서에 기술된 방법론 및/또는 기능 중 하나 이상을 체현하는 실행가능 명령어(가령, 앱)의 하나 이상의 세트가 저장된 머신 판독가능 매체(machine readable medium)를 포함할 수 있다. 드라이브 유닛(1124)을 대신해서, 또는 드라이브 유닛에 추가하여, 모바일 컴퓨팅 디바이스는 솔리드 스테이트 저장 디바이스(solid-state storage device), 예컨대 플래시 메모리(flash memory) 또는 임의의 형태의 비휘발성 메모리를 포함하는 것을 포함할 수 있다. 본 문서에서 사용되는 바와 같은 용어 "머신 판독가능 매체"는 온갖 형태로 된, 단일 매체로서든 또는 다중 매체로서든, 모든 형태의 저장 매체를 포함하도록 취해져야 한다: 가령, 중앙화된(centralized) 또는 분산된(distributed) 데이터베이스 및/또는 연관된 캐시(cache) 및 서버; 하나 이상의 저장 디바이스, 예컨대 (가령, 자기적 및 광학적 드라이브 및 저장 메커니즘을 포함하는) 저장 디바이스, 그리고 (주 메모리, 캐시 스토리지(프로세서의 내부든 또는 외부든), 또는 버퍼이든) 메모리 디바이스 또는 모듈의 하나 이상의 인스턴스. 용어 "머신 판독가능 매체" 또는 "컴퓨터 판독가능 매체"는 머신에 의한 실행을 위한, 그리고 머신으로 하여금 방법론 중 임의의 것을 수행하게 하는 명령어의 시퀀스(sequence)를 저장하거나 인코딩하는 것이 가능한 임의의 유형적인(tangible) 비일시적 매체를 포함하도록 취해질 것이다. 용어 "비일시적 매체"는 모든 형태의 저장 드라이브(광학적, 자기적 등등) 및 모든 형태의 메모리 디바이스(가령, DRAM, (모든 저장 설계의) Flash, SRAM, MRAM, 위상 변화(phase change) 등은 물론, 나중의 인출을 위한 임의의 타입의 정보를 저장하도록 설계된 모든 다른 구조를 명시적으로 포함한다.Memory 1118 may be used, but is not limited to, to store images captured using camera 1104, as well as software and/or firmware controlling mobile computing device 1102. A variety of different types of memory may be included in the mobile computing device 1102, including non-volatile and volatile memory. Drive unit 1124 , such as but not limited to a hard drive, may also be used to store images captured using camera 1104 as well as software controlling mobile computing device 1102 . It can be used for, but is not limited to. Memory 1118 and disk unit 1124 may be a machine readable medium having stored thereon one or more sets of executable instructions (e.g., apps) that embody one or more of the methodologies and/or functions described herein. can include Instead of, or in addition to, the drive unit 1124, the mobile computing device includes a solid-state storage device, such as flash memory or any form of non-volatile memory. may include The term "machine-readable medium" as used herein shall be taken to include all forms of storage media, whether as a single medium or as multiple media, in any form: e.g., centralized or distributed databases and/or associated caches and servers; One or more storage devices, such as storage devices (including, for example, magnetic and optical drives and storage mechanisms), and memory devices or modules (whether main memory, cache storage (whether internal or external to the processor), or buffers) one or more instances. The term “machine readable medium” or “computer readable medium” refers to any type capable of storing or encoding a sequence of instructions for execution by a machine and causing the machine to perform any of the methodologies. will be taken to include tangible, non-transitory media. The term "non-transitory medium" includes all forms of storage drives (optical, magnetic, etc.) and all forms of memory devices (e.g., DRAM, Flash (of any storage design), SRAM, MRAM, phase change, etc.). Of course, it explicitly includes any other structure designed to store any type of information for later retrieval.

안테나(1126)에 연결된 송수신기(1120)는, 가령, 와이파이(Wi-Fi), 셀룰러 통신 또는 모바일 위성 통신을 사용하여 무선으로 데이터를 송신하고 수신하기 위해 사용될 수 있다. 모바일 컴퓨팅 디바이스(1102)는 또한 블루투스(Bluetooth) 및/또는 다른 무선 기술을 사용하여 무선 통신을 수행하는 것이 가능할 수가 있다. 모바일 컴퓨팅 디바이스(1102)가 여러 타입의 송수신기 및/또는 여러 타입의 안테나를 포함하는 것이 또한 가능하다. 송수신기(1120)는 송신기 및 수신기를 포함할 수 있다.Transceiver 1120 coupled to antenna 1126 may be used to transmit and receive data wirelessly using, for example, Wi-Fi, cellular communications, or mobile satellite communications. The mobile computing device 1102 may also be capable of performing wireless communications using Bluetooth and/or other wireless technologies. It is also possible that the mobile computing device 1102 includes different types of transceivers and/or different types of antennas. The transceiver 1120 may include a transmitter and a receiver.

스피커(1122)는 모바일 컴퓨팅 디바이스(1102)로 하여금 모바일 전화로서 동작할 수 있게 하기 위해서뿐만 아니라, 청각 명령, 피드백 및/또는 지시자를 사용자에게 제공하고, 레코딩(가령, 음악 레코딩)을 재생하기 위해 사용될 수 있다. 스피커(1122)는 또한 모바일 컴퓨팅 디바이스(1102)의 사용자가 비디오 채팅에 참여할 수 있게 하는 데에 사용될 수 있다.Speaker 1122 is used to enable mobile computing device 1102 to operate as a mobile phone, as well as to provide auditory commands, feedback and/or indicators to a user, and to play recordings (eg, music recordings). can be used The speaker 1122 can also be used to allow a user of the mobile computing device 1102 to engage in a video chat.

프로세서(1116)는, 가령, 메모리(1118) 및/또는 드라이브 유닛(1124) 내에 저장된 소프트웨어 및/또는 펌웨어의 제어 하에, 모바일 컴퓨팅 디바이스(1102)의 다양한 다른 요소를 제어하기 위해 사용될 수 있다. 여러 프로세서(1116), 가령, 중앙 처리 유닛(Central Processing Unit: CPU) 및 그래픽 처리 유닛(Graphics Processing Unit: GPU)이 있는 것이 또한 가능하다. 프로세서(들)(1116)는 프로세서(들)로 하여금 본 문서에 기술된 본 기술의 실시예를 구현하는 데에 사용되는 단계를 수행하게 하기 위해 (비일시적 컴퓨터 판독가능 매체에 저장된) 컴퓨터 명령어를 실행할 수 있다.Processor 1116 may be used to control various other elements of mobile computing device 1102, such as under the control of software and/or firmware stored in memory 1118 and/or drive unit 1124. It is also possible to have multiple processors 1116, such as a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU). Processor(s) 1116 may provide computer instructions (stored on non-transitory computer readable media) to cause the processor(s) to perform steps used to implement embodiments of the technology described herein. can run

본 문서에 기술된 본 기술의 소정의 실시예는 하드웨어, 소프트웨어, 또는 하드웨어 및 소프트웨어 양자 모두의 조합을 사용하여 구현될 수 있다. 사용되는 소프트웨어는 본 문서에 기술된 기능을 수행하도록 프로세서 중 하나 이상을 프로그래밍하기 위해 전술된 프로세서 판독가능 저장 디바이스 중 하나 이상에 저장된다. 프로세서 판독가능 저장 디바이스는 휘발성 및 비휘발성 매체, 탈거가능(removable) 및 비탈거가능(non-removable) 매체와 같은 컴퓨터 판독가능 매체를 포함할 수 있다. 한정이 아니고, 예로서, 컴퓨터 판독가능 매체는 컴퓨터 판독가능 저장 매체 및 통신 매체를 포함할 수 있다. 컴퓨터 판독가능 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 다른 데이터와 같은 정보의 저장을 위해 임의의 방법 또는 기술로 구현될 수 있다. 컴퓨터 판독가능 저장 매체의 예는 RAM, ROM, EEPROM, 플래시 메모리 또는 다른 메모리 기술, CD-ROM, 디지털 다기능 디스크(Digital Versatile Disk: DVD) 또는 다른 광학 디스크 스토리지, 자기적 카세트(magnetic cassette), 자기적 테이프, 자기적 디스크 스토리지 또는 다른 자기적 저장 디바이스, 또는 임의의 다른 매체(요망되는 정보를 저장하는 데에 사용될 수 있고 컴퓨터에 의해 액세스될 수 있음)를 포함한다. 컴퓨터 판독가능 매체 또는 매체들은 전파되거나 변조되거나 일시적인 신호를 포함하지 않는다.Certain embodiments of the subject technology described herein may be implemented using hardware, software, or a combination of both hardware and software. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. Processor readable storage devices may include computer readable media such as volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may include computer readable storage media and communication media. A computer readable storage medium may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage device, or any other medium that can be used to store desired information and can be accessed by a computer. The computer readable medium or media does not contain a propagated, modulated or transitory signal.

통신 매체는 전형적으로 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 다른 데이터를 반송파(carrier wave) 또는 다른 전송 메커니즘과 같은 전파되거나 변조되거나 일시적인 데이터 신호 내에 체현하며 임의의 정보 전달 매체를 포함한다. 용어 "변조된 데이터 신호"는 신호 내의 정보를 인코딩하는 그러한 방식으로 그것의 특성 중 하나 이상이 설정되거나 변경된 신호를 의미한다. 한정이 아니고, 예로서, 통신 매체는 유선 매체, 예컨대 유선 네트워크 또는 직접 배선(direct-wired) 연결, 그리고 무선 매체, 예컨대 RF 및 다른 무선 매체를 포함한다. 상기 중 임의의 것의 조합이 또한 컴퓨터 판독가능 매체의 범위 내에 포함된다.Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a way as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.

대안적인 실시예에서, 소프트웨어 중 일부 또는 전부는 전용 하드웨어 로직 컴포넌트에 의해 대체될 수 있다. 한정 없이, 예를 들어, 사용될 수 있는 예시적인 타입의 하드웨어 로직 컴포넌트는 필드 프로그램가능 게이트 어레이(Field-programmable Gate Array: FPGA), 애플리케이션 특정 집적 회로(Application-Specific Integrated Circuit: ASIC), 애플리케이션 특정 표준 제품(Application-Specific Standard Product: ASSP), 시스템 온 칩 시스템(System-On-a-Chip system: SOC), 복합 프로그램가능 로직 디바이스(Complex Programmable Logic Device: CPLD), 특수 목적 컴퓨터(special purpose computer) 등을 포함한다. 하나의 실시예에서, 하나 이상의 실시예를 구현하는 (저장 디바이스 상에 저장된) 소프트웨어가 하나 이상의 프로세서를 프로그래밍하는 데에 사용된다. 하나 이상의 프로세서는 하나 이상의 컴퓨터 판독가능 매체/저장 디바이스, 주변기기 및/또는 통신 인터페이스와의 통신이 될 수 있다.In alternative embodiments, some or all of the software may be replaced by dedicated hardware logic components. Exemplary types of hardware logic components that may be used include, for example, without limitation, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standards Application-Specific Standard Product (ASSP), System-On-a-Chip system (SOC), Complex Programmable Logic Device (CPLD), special purpose computer Include etc. In one embodiment, software (stored on a storage device) implementing one or more embodiments is used to program the one or more processors. One or more processors may be in communication with one or more computer readable media/storage devices, peripherals and/or communication interfaces.

본 주제는 많은 상이한 형태로 체현될 수 있고 본 문서에 개진된 실시예에 한정되는 것으로서 해석되어서는 안 된다는 점이 이해된다. 오히려, 이들 실시예는 이 주제가 면밀하고 전적일 것이고 당업자에게 본 개시를 온전히 전할 것이도록 제공된다. 실제로, 본 주제는 부기된 청구항에 의해 정의된 바와 같은 주제의 범위 및 사상 내에 포함되는, 이들 실시예의 대안, 수정 및 균등물을 포섭하도록 의도된다. 나아가, 본 주제의 이하 상세한 설명에서, 본 주제의 면밀한 이해를 제공하기 위해서 여러 구체적인 세부사항이 개진된다. 그러나, 본 주제는 그러한 구체적인 세부사항 없이 실시될 수 있음은 통상의 기술자에게 명확할 것이다.It is understood that this subject matter may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and exhaustive, and will fully convey the disclosure to those skilled in the art. Indeed, this subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, included within the scope and spirit of subject matter as defined by the appended claims. Further, in the following detailed description of the subject matter, numerous specific details are set forth in order to provide a thorough understanding of the subject matter. However, it will be clear to those skilled in the art that the subject matter may be practiced without such specific details.

본 개시의 실시예에 따른 방법, 장치(시스템) 및 컴퓨터 프로그램 제품의 흐름도 예시 및/또는 블록도를 참조하여 본 개시의 측면이 본 문서에 기술된다. 흐름도 예시 및/또는 블록도의 각각의 블록과, 흐름도 예시 및/또는 블록도 내의 블록의 조합은 컴퓨터 프로그램 명령어에 의해 구현될 수 있음이 이해될 것이다. 이들 컴퓨터 프로그램 명령어는 머신을 산출하기 위해 일반 목적 컴퓨터(general purpose computer), 특수 목적 컴퓨터, 또는 다른 프로그램가능 데이터 처리 장치의 프로세서에 제공될 수 있어서, 컴퓨터 또는 다른 프로그램가능 명령어 실행 장치의 프로세서를 통해 실행되는 명령어는 흐름도 및/또는 블록도 블록 또는 블록들 내에 지정된 기능/행위를 구현하기 위한 메커니즘을 창출한다.Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present disclosure. It will be appreciated that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks within the flowchart illustrations and/or block diagrams, may be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing device to produce a machine, so that the processor of the computer or other programmable instruction execution device The executed instructions create a mechanism for implementing the function/behavior specified within the flow diagram and/or block diagram block or blocks.

본 개시의 설명은 예시 및 설명의 목적으로 제시되었으나, 철저하거나 본 개시를 개시된 형태로 한정하도록 의도되지는 않는다. 본 개시의 범위 및 사상을 벗어나지 않고서 많은 수정 및 변형이 통상의 기술자에게 명백할 것이다. 본 문서에서의 개시의 측면은 개시의 원리 및 실제적인 적용을 가장 잘 설명하기 위해서, 그리고 통상의 다른 기술자로 하여금 고려되는 특정 사용에 알맞은 다양한 수정과 함께 본 개시를 이해할 수 있도록 하기 위해서 선택되고 기술되었다.The description of this disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or to limit the disclosure to the form disclosed. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of this disclosure. Aspects of the disclosure in this document have been selected and described in order to best explain the principles and practical applications of the disclosure, and to enable others skilled in the art to understand the disclosure with various modifications suited to the particular use contemplated. It became.

본 개시는 다양한 실시예와 함께 기술되었다. 그러나, 개시된 실시예에 대한 다른 변형 및 수정이 도면, 개시 및 부기된 청구항의 연구로부터 이해되고 초래될 수 있고, 그러한 변형 및 수정은 부기된 청구항에 의해 망라되는 것으로 해석되어야 한다. 청구항에서, 단어 "포함하는"은 다른 요소 또는 단계를 배제하지 않으며, 부정관사 "일" 또는 "한"은 복수를 배제하지 않는다.The present disclosure has been described in conjunction with various embodiments. However, other variations and modifications to the disclosed embodiments may be understood and result from a study of the drawings, disclosure and appended claims, and such variations and modifications should be construed as being encompassed by the appended claims. In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “an” or “an” does not exclude a plurality.

이 문서의 목적으로, 도면에 묘사된 다양한 특징의 규모는 반드시 축척에 따라 그려지지는 않을 수 있음에 유의하여야 한다.For the purposes of this document, it should be noted that the scale of various features depicted in the drawings may not necessarily be drawn to scale.

이 문서의 목적으로, "일 실시예", "하나의 실시예", "몇몇 실시예" 또는 "다른 실시예"에 대한 명세서에서의 참조는 상이한 실시예 또는 동일한 실시예를 기술하는 데에 사용될 수 있다.For purposes of this document, references in the specification to “one embodiment,” “one embodiment,” “some embodiments,” or “another embodiment” may be used to describe a different embodiment or the same embodiment. can

이 문서의 목적으로, 연결은 직접적 연결 또는 (가령, 하나 이상의 다른 부분을 통한) 간접적 연결일 수 있다. 몇몇 경우에, 요소가 다른 요소와 연결되거나 커플링된 것으로 지칭될 때, 그 요소는 그 다른 요소에 직접적으로 연결되거나 개재 요소를 통해 그 다른 요소에 간접적으로 연결될 수 있다. 요소가 다른 요소에 직접적으로 연결된 것으로 지칭될 때이면, 그 요소 및 그 다른 요소 간에 어떤 개재 요소도 없다. 두 디바이스는 만일 그것들이 그것들 사이에서 전자 신호를 통신할 수 있도록 직접적으로 또는 간접적으로 연결된 경우 "통신이 되는" 것이다.For the purposes of this document, a linkage may be a direct linkage or an indirect linkage (eg, through one or more other parts). In some cases, when an element is referred to as being connected or coupled to another element, the element may be directly connected to the other element or indirectly connected to the other element through an intervening element. When an element is referred to as being directly connected to another element, there are no intervening elements between that element and the other element. Two devices are "in communication" if they are directly or indirectly connected such that electronic signals can be communicated between them.

이 문서의 목적으로, 용어 "기반하여"는 "적어도 부분적으로 기반하여"로서 읽힐 수 있다.For the purposes of this document, the term "based on" may be read as "based at least in part on."

이 문서의 목적으로, 추가적인 맥락 없이, "제1" 객체, "제2" 객체 및 "제3" 객체와 같은 수적 용어의 사용은 객체의 순서화를 시사하지 않을 수 있고, 대신에 상이한 객체를 식별하기 위한 식별 목적으로 사용될 수 있다.For the purposes of this document, without further context, the use of numerical terms such as "first" object, "second" object, and "third" object may not imply an ordering of objects, but instead identify different objects. can be used for identification purposes.

전술한 상세한 설명은 예시 및 설명의 목적으로 제시되었다. 본 문서에서 청구된 주제를 개시된 정확한 형태(들)에 한정하거나 철저하도록 의도되지는 않는다. 위의 교시에 비추어 볼 때 많은 수정 및 변형이 가능하다. 기술된 실시예는 개시된 기술의 원리 및 그것의 실제적인 적용을 가장 잘 설명하여 이로써 다른 당업자로 하여금 다양한 실시예에서, 그리고 고려되는 특정 사용에 알맞은 다양한 수정과 함께 본 기술을 가장 잘 활용할 수 있도록 하기 위해서 선택되고 기술되었다. 범위는 여기에 부기된 청구항에 의해 정의된다고 의도된다.The foregoing detailed description has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the claimed subject matter in this document to the precise form(s) disclosed. Many modifications and variations are possible in light of the above teaching. The described embodiments are intended to best explain the principles of the disclosed technology and its practical applications, thereby enabling others skilled in the art to best utilize the technology in various embodiments and with various modifications suited to the particular use contemplated. selected and described for It is intended that the scope be defined by the claims appended hereto.

구조적 특징 및/또는 방법론적 행위에 특정적인 말로 주제가 기술되었으나, 부기된 청구항에 정의된 주제는 반드시 위에 기술된 특정한 특징 또는 행위에 한정되지는 않는다고 이해되어야 한다. 오히려, 위에서 기술된 특정한 특징 및 행위는 청구항을 구현하는 예시적 형태로서 개시된다.Although subject matter has been described in terms specific to structural features and/or methodological acts, it is to be understood that subject matter defined in appended claims is not necessarily limited to the particular features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

obtaining a video signal and an audio signal of a first person participating in a video chat with a second person;
determining one or more types of perceived emotion of the first person based on the video signal;
determining a semantic emotion of the first person based on the audio signal;
modifying the video signal to increase alignment between at least one of the one or more types of perceived emotion of the first person and the semantic emotion of the first person.
method.

According to claim 1,
Determining the one or more types of perceived emotion of the first person based on the video signal comprises:
detecting a facial expression of the first person based on the video signal;
determining a facial expression perceived emotion of the first person based on the facial expression of the first person; or
detecting a body pose of the first person based on the video signal;
determining a body pose perceived emotion of the first person based on the body pose of the first person; or
detecting a facial expression and body pose of the first person based on the video signal;
Determining a facial expression-recognized emotion and a body pose-recognized emotion of the first person based on the facial expression and the body pose of the first person,
method.

According to claim 2,
Determining the one or more types of perceived emotion of the first person is also based on the audio signal, the pitch, vibrato or inflection of the first person's speech. performing audio signal processing of the audio signal to determine at least one of; and determining speech perceived emotion of the first person based on a result of the audio signal processing of the audio signal. contains steps,
the method further comprising modifying the audio signal to increase entrainment between the utterance perceived emotion of the first person and the semantic emotion of the first person;
method.

According to claim 3,
modifying the audio signal to produce a modified audio signal comprises modifying audio data of the video signal corresponding to at least one of the pitch, vibrato or tone;
The method,
further comprising providing the modified audio signal to a subsystem associated with the second person participating in the video chat.
method.

According to claim 1,
modifying the video signal to produce a modified video signal comprises modifying image data of the video signal corresponding to at least one of a facial expression or a body pose;
The method,
providing the modified video signal to a subsystem associated with the second person participating in the video chat.
method.

According to claim 1,
Determining the semantic emotion of the first person based on the audio signal comprises:
performing natural language processing of the audio signal;
determining the semantic emotion of the first person based on a result of the natural language processing of the audio signal;
method.

According to claim 1,
determining one or more types of perceived emotion of the first person based on the video signal;
Using a facial circumplex model to quantify positivity and aggressiveness of a facial expression of the first person based on the video signal; or
At least one of using a pose circumplex model to quantify positiveness and aggressiveness of a body pose of the first person based on the video signal;
The step of determining the semantic emotion of the first person based on the audio signal comprises a language circumplex model to quantify the positivity and activeness of the language of the first person based on the audio signal. Including the step of using,
Modifying the video signal to yield a modified video signal comprises:
modifying image data of the video signal to reduce a distance between the positivity and the aggressiveness of the facial expression of the first person and the positivity and the aggressiveness of the language of the first person; or
changing at least one of image data of the video signal to reduce a distance between the positivity and the aggressiveness of the body pose of the first person and the positivity and the aggressiveness of the language of the first person; including,
method.

According to claim 7,
The step of determining the one or more types of perceived emotion of the first person is also based on the audio signal, and based on the audio signal, the utterance complex cycle to quantify the positivity and aggressiveness of the utterance of the first person. Using a speech circumplex model,
The method is configured to reduce a distance between the positivity and assertiveness of the utterance of the first person and the positivity and the aggressiveness of the language of the first person, to calculate a modified audio signal, the audio of the audio signal. Further comprising changing the data,
method.

According to claim 8,
providing the modified video signal and the modified audio signal to a subsystem associated with the second person participating in the video chat.
method.

one or more interfaces configured to receive video signals and audio signals of a first person participating in a video chat with a second person;
communicatively coupled to the one or more interfaces;
determine one or more types of perceived emotion of the first person based on the video signal;
determine the semantic emotion of the first person based on the audio signal;
one or more processors configured to modify the video signal to increase entrainment between at least one of the one or more types of perceived emotion of the first person and the semantic emotion of the first person.
subsystem.

According to claim 10,
one or more cameras configured to acquire the video signal;
Further comprising one or more microphones configured to acquire the audio signal.
subsystem.

According to claim 10,
the one or more processors comprising one or more neural networks configured to determine the one or more types of perceived emotion of the first person based on the video signal and the semantic emotion of the first person based on the audio signal; to implement,
subsystem.

According to claim 10,
wherein the one or more processors implement one or more neural networks configured to modify the video signal to increase entrainment between at least one of the one or more types of perceived emotion of the first person and the semantic emotion of the first person. ,
subsystem.

According to claim 10,
To determine the one or more types of perceived emotion of the first person based on the video signal, the one or more processors comprising:
detecting a facial expression of the first person based on the video signal;
determine a facial expression perceived emotion of the first person based on the facial expression of the first person; or
detecting a body pose of the first person based on the video signal;
determine a body pose perceived emotion of the first person based on the body pose of the first person; or
detecting a facial expression and a body pose of the first person based on the video signal;
Determine a facial expression perceived emotion and a body pose perceived emotion of the first person based on the facial expression and the body pose of the first person.
subsystem.

According to claim 14,
The one or more processors also
performing audio signal processing on the audio signal to determine at least one of pitch, vibrato, or tone of speech of the first person, and determining a perceived emotion of speech of the first person based on a result of the audio signal processing; do,
Modify the audio signal to increase entrainment between the utterance perceived emotion of the first person and the semantic emotion of the first person.
subsystem.

According to claim 15,
The one or more processors,
Modify the audio signal to produce a modified audio signal by modifying audio data of the audio signal corresponding to at least one of the pitch, vibrato or tone.
subsystem.

According to claim 10,
The one or more processors,
modifying the video signal to produce a modified video signal by modifying image data of the video signal corresponding to at least one of a facial expression or a body pose;
subsystem.

According to claim 10,
The one or more processors,
performing natural language processing of the audio signal;
configured to determine the semantic emotion of the first person based on a result of the natural language processing of the audio signal;
subsystem.

According to claim 10,
The one or more processors,
A facial complex circulation model is used to quantify the positivity and aggressiveness of the facial expression of the first person based on the video signal, or
configured to use a pose complex cyclic model to quantify positiveness and aggressiveness of a body pose of the first person based on the video signal;
The one or more processors,
Using a language complex circulation model to quantify the positivity and aggressiveness of the language of the first person based on the audio signal;
a distance between the positivity and the aggressiveness of the facial expression of the first person and the positivity and the aggressiveness of the language of the first person, or the positivity and the aggressiveness of the body pose of the first person and change image data of the video signal to reduce at least one of a distance between the positivity and the assertiveness of the language of the first person.
subsystem.

According to claim 19,
The one or more processors also
Based on the audio signal, a utterance complex cycle model is used to quantify the positivity and aggressiveness of the first person's utterance;
Modifying audio data of the audio signal to reduce a distance between the positivity and assertiveness of the utterance of the first person and the positivity and the aggressiveness of the language of the first person,
subsystem.

According to claim 20,
a transmitter configured to transmit a modified video signal and a modified audio signal to a subsystem associated with the second person participating in the video chat.
subsystem.

A computer readable storage medium storing computer instructions which, when executed by one or more processors, cause the one or more processors to perform a method according to any one of claims 1 to 9.

delete