EP3942552A1 - Methods and systems that provide emotion modifications during video chats - Google Patents

Methods and systems that provide emotion modifications during video chats

Info

Publication number
EP3942552A1
EP3942552A1 EP19718999.6A EP19718999A EP3942552A1 EP 3942552 A1 EP3942552 A1 EP 3942552A1 EP 19718999 A EP19718999 A EP 19718999A EP 3942552 A1 EP3942552 A1 EP 3942552A1
Authority
EP
European Patent Office
Prior art keywords
person
audio signal
emotion
video signal
perceived
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP19718999.6A
Other languages
German (de)
French (fr)
Inventor
Yueh Ning KU
Yuan Ma
Yitian WU
Lei Yang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of EP3942552A1 publication Critical patent/EP3942552A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording

Definitions

  • the disclosure generally relates methods and systems for use during video chats, and in specific embodiments, relates to methods and systems that alter video and audio signals of a person participating in a video chat to produce altered video and audio signals in which there is an increase in alignment between one or more of the person’s perceived emotions and the person’s semantic emotion.
  • a chat can be voice chat, or a video chat.
  • a voice chat refers to a communication that is solely audio, meaning that the two people participating in the voice chat can hear one another, but cannot see each other.
  • a video chat refers to a communication that includes both audio and video of the two people participating in the video chat, meaning that the two people participating in the video chat can both hear one another and see one another.
  • Videotelephony technologies which provide for the reception and transmission of audio and video signals, can be used to perform video chats.
  • Exemplary videotelephony products include FaceTime available from Apple Inc., Google Duo and Google Hangouts both available from Google LLC, Skype available from Microsoft Corp., and WeChat available from Tencent Corp., just to name a few. Indeed, there have been surveys that found that ten percent of the drivers indicated that they have used their smartphone to video chat while driving. That percentage is likely to increase in the future, especially as semi-autonomous and fully-autonomous vehicles become more common.
  • Road rage which is aggressive or angry behavior exhibited by a driver of a vehicle, is very common. Indeed, surveys have found that a significant majority of drivers have expressed significant anger while driving in the past year. Road rage can lead to numerous types of direct adverse effects. For example, for the driver of a vehicle and their passenger(s), road rage can lead to altercations, assaults and collisions that result in serious physical injuries or even death. Road rage can also lead to certain non-direct adverse effects.
  • a first person driving a first vehicle is participating in a video chat with a second person driving a second vehicle, during which the first driver experiences road rage
  • the anger of the first person may be passed onto the second person and/or otherwise distract the second person, which may increase the chances of the second person being involved in a collision.
  • a first person driving a first vehicle is participating in a business related video chat with one or more other persons, during which the first person experiences road rage, the business relations between the first person and the one or more other persons can be ruined or otherwise adversely affected.
  • a method includes obtaining a video signal and an audio signal of a first person participating in a video chat with a second person, determining one or more types of perceived emotions of the first person based on the video signal, and determining a semantic emotion of the first person based on the audio signal. The method also includes altering the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
  • the determining the one or more types of perceived emotions of the first person based on the video signal includes detecting at least one of a facial expression or a body pose of the first person based on the video signal, and determining at least one of a facial expression perceived emotion or a body pose perceive emotion of the first person based on the at least one of the facial expression or the body pose of the first person.
  • the determining the one or more types of perceived emotions of the first person is also based on the audio signal and includes performing audio signal processing of the audio signal to determine at least one of pitch, vibrato or inflection of speech of the first person, and determining a speech perceived emotion of the first person based on results of the audio signal processing of the audio signal.
  • Such a method can further comprise altering the audio signal to increase alignment between the speech perceived emotion of the first person and the semantic emotion of the first person.
  • the altering the video signal to produce the altered video signal includes modifying image data of the video signal corresponding at least one of a facial expression or a body pose
  • the altering the audio signal to produce the altered audio signal includes modifying audio data of the video signal corresponding at least one of the pitch, vibrato, or inflection.
  • the method further comprises providing the altered video signal and the altered audio signal to a subsystem associated with the second person that is participating in the video chat, to thereby enable the second person to view and listen to images and audio of the first person having increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
  • the determining the semantic emotion of the first person based on the audio signal includes performing natural language processing of the audio signal, and determining the semantic emotion of the first person based on results of the natural language processing of the audio signal.
  • the determining one or more types of perceived emotions of the first person based on the video signal includes at least one of using a facial circumplex model to quantify a positiveness and an activeness of a facial expression of the first person based on the video signal, or using a pose circumplex model to quantify a positiveness and an activeness of a body pose of the first person based on the video signal.
  • the determining the semantic emotion of the first person based on the audio signal includes using a language circumplex model to quantify a positiveness and an activeness of language of the first person based on the audio signal.
  • the altering the video signal to produce the altered video signal includes at least one of altering image data of the video signal to reduce a distance between the positiveness and the activeness of the facial expression of the first person and the positiveness and the activeness of the language of the first person, or altering image data of the video signal to reduce a distance between the positiveness and the activeness of the body pose of the first person and the positiveness and the activeness of the language of the first person.
  • the determining the one or more types of perceived emotions of the first person is also based on the audio signal and includes using a speech circumplex model to quantify a positiveness and an activeness of speech of the first person based on the audio signal.
  • the method can further comprise altering audio data of the audio signal to produce an altered audio signal to reduce a distance between the positiveness and activeness of the speech of the first person and the positiveness and the activeness of the language of the first person.
  • the method further comprises providing the altered video signal and the altered audio signal to a subsystem associated with the second person that is participating in the video chat, to thereby enable the second person to view and listen to images and audio of the first person that have increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
  • a subsystem comprises one or more interfaces and one or more processors.
  • the one or more interfaces are configured to receive a video signal and an audio signal of a first person participating in a video chat with a second person.
  • the one or more processors are communicatively coupled to the one or more interfaces and are configured to determine one or more types of perceived emotions of the first person based on the video signal, and determine a semantic emotion of the first person based on the audio signal.
  • the one or more processors are also configured to alter the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
  • the subsystem can also include one or more cameras configured to obtain the video signal, and one or more microphones configured to obtain the audio signal.
  • the one or more processors implement one or more neural networks that are configured to determine the perceived emotion of the first person based on the video signal and determine the semantic emotion of the first person based on the audio signal.
  • the one or more processors implement one or more neural networks that are configured to alter the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
  • the one or more processors are configured to detect at least one of a facial expression or a body pose of the first person based on the video signal, and determine at least one of a facial expression perceived emotion or a body pose perceive emotion of the first person based on the at least one of the facial expression or the body pose of the first person.
  • the one or more processors are also configured to perform audio signal processing of the audio signal to determine at least one of pitch, vibrato or inflection of speech of the first person, and determine a speech perceived emotion of the first person based on results of the audio signal processing, and alter the audio signal to increase alignment between the speech perceived emotion of the first person and the semantic emotion of the first person.
  • the one or more processors are configured to modify image data of the video signal corresponding at least one of a facial expression or a body pose, to thereby alter the video signal to produce the altered video signal; and modify audio data of the audio signal corresponding at least one of the pitch, vibrato, or inflection, to thereby alter the audio signal to produce the altered audio signal.
  • the one or more processors are configured to perform natural language processing of the audio signal, and determine the semantic emotion of the first person based on results of the natural language processing of the audio signal.
  • the one or more processors are configured to use a facial circumplex model to quantify a positiveness and an activeness of a facial expression of the first person based on the video signal, use a pose circumplex model to quantify a positiveness and an activeness of a body pose of the first person based on the video signal, and use a language circumplex model to quantify a positiveness and an activeness of language of the first person based on the audio signal.
  • the one or more processors are configured to alter image data of the video signal to reduce a distance between the positiveness and the activeness of the facial expression of the first person and the positiveness and the activeness of the language of the first person, and reduce a distance between the positiveness and the activeness of the body pose of the first person and the positiveness and the activeness of the language of the first person.
  • the one or more processors are also configure to use a speech circumplex model to quantify a positiveness and an activeness of speech of the first person based on the audio signal, and alter audio data of the audio signal to reduce a distance between the positiveness and activeness of the speech of the first person and the positiveness and the activeness of the language of the first person.
  • the subsystem comprises a transmitter configured to transmit the altered video signal and the altered audio signal to a subsystem associated with the second person that is participating in the video chat, to thereby enable the second person to view and listen to video and audio of the first person having increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
  • a non-transitory computer-readable medium storing computer instructions that when executed by one or more processors cause the one or more processors to perform the steps of: obtaining a video signal and an audio signal of a first person participating in a video chat with a second person; determining one or more types of perceived emotions of the first person based on the video signal; determining a semantic emotion of the first person based on the audio signal; and altering the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
  • the non-transitory computer-readable medium can also store computer instructions that when executed by one or more processors cause the one or more processors to perform additional steps of the methods summarized above, and described in additional detail below.
  • FIG. 1 illustrates an exemplary system that enables first and second persons to participate in a video chat.
  • FIGS. 2, 3, and 4 illustrate systems according to various embodiments of the present technology that enable first and second persons to participate in a video chat, and which also modify the audio and video of at least the first person so that the audio and video of the first person that are heard and seen by the second person differ from actual audio and video of the first person.
  • FIG. 5 illustrates a modification subsystem, according to an embodiment of the present technology, which can be used to modify audio and video signals of a person participating in a video chat.
  • FIG. 6 illustrates additional details of an emotion detector and an emotion modifier of the modification subsystem introduced in in FIG. 5.
  • FIG. 7A illustrates a general circumplex model
  • FIG. 7B illustrates a facial circumplex model
  • FIG. 7C illustrates a pose circumplex model
  • FIG. 7D illustrates a speech circumplex model.
  • FIG. 8 illustrates how different types of perceived emotions and a semantic emotion can be mapped to a circumplex model, how distances between perceived emotions and the semantic emotion can be determined, and how such distances can be reduced to increase alignment between the different types of perceived emotions and a semantic emotion.
  • FIG. 9 illustrates a high level flow diagram that explains how distances between perceived emotions and a semantic emotion can be used to determine whether to modify certain features of video and audio signals to increase the alignment between the perceived emotions and the semantic emotion.
  • FIG. 10 illustrates a high level flow diagram that is used to summarize methods according to certain embodiments of the present technology.
  • FIG. 1 1 illustrates an exemplary components of an exemplary mobile computing device with which embodiments of the present technology can be used.
  • Certain embodiments of the present technology alter video and audio signals of a first person participating in a video chat with a second person, such that when the altered signals are played for the second person, what is seen and heard by the second person differs from the originally captured video and audio signals.
  • first and second persons are participating in a video chat while both of them are driving vehicles, such embodiments of the present technology can prevent the anger of the first person from being passed onto the second person, in the event that the first person experiences road rage while participating in the video chat.
  • a first person driving a vehicle is participating in a business related video chat with one or more other persons
  • such embodiments of the present technology can prevent the anger of the first person from being witnessed by the other person(s), thereby avoiding the business relations between the first person and the one or more other persons from being ruined or otherwise adversely affected.
  • one or more types of perceived emotions of the first person can be determined based on a video signal (and potentially also an audio signal) of the first person, and a semantic emotion of the first person can be determined based on the audio signal of the first person.
  • Video and audio signals of the first person can then be modified (also referred to as altered) such that the resulting modified video and audio of the first person is more aligned with the semantic emotion of the first person than the perceived emotion(s) of the first person. More specifically, video and audio signals are modified to reduce differences between one or more types of perceived emotions of a person and the semantic emotion of the person.
  • a perceived emotion generally relates to a first person’s emotional state that a second person becomes aware of using the second person’s senses, e.g., using the sight and hearing of the second person.
  • semantic emotion generally relates to a first person’s emotional state that a second person becomes aware of using the second person’s understanding of the verbal language (also referred to as spoken language, or more succinctly as language) that is spoken by first person.
  • verbal language also referred to as spoken language, or more succinctly as language
  • the video of the first person is altered such that when the video is played for the second person, the body language of the first person has been changed from negative body language to positive body language, so that the first person’s body language is more aligned with the positive spoken language that they used.
  • the audio of the first person may also be altered, e.g., to change pitch, vibrato and/or inflection of the first person’s voice to be more aligned with the positive spoken language that they used.
  • FIG. 1 illustrates an exemplary system that enables first and second persons to participate in a video chat.
  • blocks 1 10A and 1 10B are representative of first and second persons that are participating in a video chat using respective client computing devices, which are also referred to herein more generally as audio-video (A-V) subsystems 120A and 120B.
  • the A-V subsystems 120A and 120B can be referred to collectively as the A-V subsystems 120, or individually as an A-V subsystem 120.
  • the first and second persons 1 10A and 1 10B can be referred to collectively as the persons 1 10, or individually as a person 1 10.
  • each of the A-V subsystems 120 can include at least one microphone that is used to obtain an audio signal, and at least one camera that is used to obtain a video signal.
  • At least one camera can be an RGB/NIR (Near Infrared) camera including an image sensor (e.g., CMOS image sensor) that can be used to capture multiple two dimensional RGB/NIR images per second (e.g., 30 images per second).
  • At least one further camera can be a depth camera that produces depth images, rather than RGB/NIR images, e.g., using structured light and/or time-of-flight (TOF) sensors to recreate a 3D structure on point clouds, or the like.
  • the A-V subsystem 120A can be capable playing video and audio of a second person (e.g., 120B) for the first person 1 10A
  • the A-V subsystem 120B can be capable playing video and audio of a first person (e.g., 120A) for the second person 1 10B
  • each of the A-V subsystems 120 can include at least one audio-speaker that is used to output audible sounds, and at least one display that used to display video images.
  • One or both of the A-V subsystems 120A and 120B can be an in-cabin computer system or mobile computing devices, such as, but not limited to, smartphones, tablet computers, notebook computers, laptop computers, or the like.
  • one or both of the audio-video subsystems 120A and 120B, or portions thereof, can include a microphone, camera, audio speaker, and/or display, that is build into a vehicle, e.g., as part of a vehicles entertainment system.
  • At least one microphone of the A-V subsystem 120A obtains an audio signal of the first person 110A, and at least one camera of the A-V subsystem 120A obtains a video signal of the first person 1 10A.
  • at least one microphone of the A-V subsystem 120B obtains an audio signal of the second person 1 10B, and at least one camera of the A- V subsystem 120B obtains a video signal of the second person 1 10B.
  • the audio and video signals of the first person 1 10A, which are obtained by the A-V subsystem 120A, are sent via one or more communication networks 130 to the A-V subsystem 120B.
  • the audio and video signals of the second person 1 1 OB, which are obtained by the A-V subsystem 120B are sent via one or more communication networks 130 to the A-V subsystem 120B.
  • the communication network(s) 130 can be any wired or wireless local area network (LAN) and/or wide area network (WAN), such as an intranet, an extranet, or the Internet, or a combination thereof, but are not limited thereto. It is sufficient that the communication network 130 provides communication capability between the A-V subsystems 120, and optional other devices and systems. In some implementations, the communication network(s) 130 use the HyperText Transport Protocol (HTTP) to transport information using the Transmission Control Protocol/Internet Protocol (TCP/IP). HTTP permits A-V subsystems 120 to access various resources available via the communication network(s) 130.
  • HTTP HyperText Transport Protocol
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • At least one audio-speaker of the A-V subsystem 120A uses the audio signal of the second person 1 10B to output audible sounds of (e.g., words spoken by) the second person 1 10B, which can be listened to by the first person 1 10A.
  • At least one display of the A-V subsystem 120A uses the video signal of the second person 110B to display video images of the second person 1 10B, which can be viewed by the first person 1 10A.
  • at least one audio-speaker of the A-V subsystem 120B uses the audio signal of the first person 1 10A to output audible sounds of (e.g. , words spoken by) the first person 1 10A, which can be listened to by the second person 1 10B.
  • At least one display of the A-V subsystem 120B uses the video signal of the first person 110A to display video images of the first person 1 10A, which can be viewed by the second person 1 10B.
  • non-modified versions of the audio and video signals of the first person 1 10A (which were obtained by the A-V subsystem 120A) are used to output and display audio and video of the first person 1 10A to the second person 1 10B (using the A-V subsystem 120B which is in proximity to the second person 1 10B).
  • the second person 1 10B will see the angry facial expression and angry body pose of the first person 1 10A and will hear the angry tone of the first person 1 10A.
  • body pose as used herein, also encompasses hand pose.
  • the audio and video signals of the first person 1 10A are modified prior to being provided to the A-V subsystem 120B, which results in audio and video of the first person 1 10A that is listened to and seen by the second person 1 10B, being different than what the first person 1 10A actually looked like and sounded like.
  • Such modifications to the audio and video signals of the first person 1 10A can be performed by the same A-V subsystem that obtains that audio and video signals. More specifically, as shown in FIG.
  • an A-V and modification subsystem 220A can obtain audio and video signals of the first person and modify such signals before providing such signals to the communication network(s) 130 that provide the modified audio and video signals to the A-V subsystem 120B in proximity to the second person 1 10B.
  • modifications to the audio and video signals of the first person 1 10A can be performed by a further subsystem that is separate from the A-V subsystem 120A that obtains that audio and video signals of the first person 1 10A. For example, as shown in FIG.
  • a modification subsystem 320A can receive audio and video signals of the first person 110A, and the modification subsystem 320A can modify such signals before providing such signals to the communication network(s) 130 that provide the modified audio and video signals to the A-V subsystem 120B in proximity to the second person 1 10B.
  • A-V subsystem 120A that obtains that audio and video signals of the first person 1 10A
  • the modification subsystem 420 can provide the modified audio and video signals of the first person 1 10A via the communication network(s) 130 to the A-V subsystem 120B in proximity with the second person 1 10B.
  • Other variations are also possible and within the scope of the embodiments described herein. While not shown in the FIGS.
  • video and audio signals of the second person 1 10B can also be provided to a similar modification subsystem to modify such signals so that perceived emotions of the second person appear more aligned with the semantic emotion of the second person 1 10B.
  • the audio and video signals of the first person 1 10A which are captured or otherwise obtained by the AV subsystem 120A (in FIGS. 1 , 3 and 4) or by the A-V and modification subsystem 220A (in FIG. 2), can also be referred to as captured audio and video signals of the first person 1 10A.
  • FIG. 5 shows a modification subsystem 520 that receives captured audio and video signals from an AV subsystem 120A, or is part of an AV and modification subsystem 220A. As shown in FIG.
  • the modification subsystem 520 includes an emotion detection block 530 (which can also be referred to as an emotion detector 530) and an emotion modification block 540 (which can also be referred to as an emotion modifier 540).
  • the emotion detector 530 can, for example, detect negative, positive, and/or neutral emotions of the first person 1 10A. Exemplary negative emotions include, but are not limited to, angry, nervous, distracted, and frustrated.
  • the emotion modifier 540 can, for example, modify audio and video signals so that one or more types of perceived emotions of the first person 1 10A in the modified audio and video signals are neutral or positive emotions. Exemplary neutral or positive emotions include, but are not limited to, happy, calm, alert, and pleased. Additional details of the emotion detector 530 and emotion modifier 540, according to specific embodiments of the present technology, are discussed below with reference to FIG. 6.
  • the emotion detector 530 is shown as including a facial detection block 610 (also referred to as a facial detector 610) and a facial expression recognition block 612 (also referred to as a facial expression recognizer 612).
  • the emotion detector 530 is also shown as including a skeletal detection block 614 (also referred to as a skeletal detector 614) and a pose recognition block 616 (also referred to as a pose recognizer 616).
  • the facial detector 610 and the skeletal detector 614 are shown as receiving a video signal 602.
  • the video signal 602 can be, e.g., a video signal of the first person 1 10A captured by the AV subsystem 120A, and more specifically, one or more cameras thereof.
  • the emotion detector 530 is also shown as including an audio signal processing block 624 (also referred to as an audio signal processor 624 or an audio signal analyzer 624) and a natural language processing block 626 (also referred to as a natural language processor 626 or a natural language analyzer 626).
  • the audio signal analyzer 624 and the natural language analyzer 626 receive an audio signal 622.
  • the audio signal 622 can be, e.g., an audio signal of the first person 1 10A capture by the AV subsystem 120A, or more specifically, a microphone thereof.
  • the video signal 602 and the audio signal 622 are presumed to be digital signals, unless specifically stated otherwise.
  • Interfaces 603 and 623 can receive the video signal 602 and the audio signal 622, e.g., respectively from a camera and a microphone, or from one or more other subsystems.
  • the facial detector 610 can detect a person’s face within an image, and can also detect facial features within the image.
  • Already developed (or future developed) computer vision techniques can be used by the facial detector 610 to detect such facial features.
  • the HSV (Hue- Saturation-Value) color model or some other computer vision technique can be used to detect a face within an image.
  • a feature detection model or some other computer vision technique can be used to identify facial features, such as, but not limited to, eyes, nose, lips, chin, cheeks, eyebrows, forehead, and/or the like.
  • Feature detection can also be used to detect wrinkles in specific facial regions, such as on the forehead, on the sides of the mouth, and/or around the eyes.
  • a person’s face and their facial features can be identified using bounding boxes. Some features to be identified can be contained within other features, such as eyes on a user's face, in which case successive bounding boxes may be used to first identify the containing feature (e.g., a face) and then to identify the contained feature (e.g., each eye of a pair of eyes). In other embodiments, a single bounding box may be used to identify each distinct feature.
  • one or more algorithm libraries such as the OpenCV (http://opencv.willowgarage.com/wiki/) computer vision library and/or the Dlib algorithm library (http://dlib.net/), can be used to identify these facial features and to generate bounding boxes.
  • a bounding box need not be rectangular, but rather, can be another shape, such as but not limited an elliptical.
  • a machine learning technique such as boosting, may be used to increase a confidence level in the detection of facial features (e.g., eyes, nose, lips, etc.). More generally, data sets can be used to train a deep neural network (DNN) and/or other computer model to detect facial features from images, and the trained DNN and/or other computer model can be thereafter used for facial feature recognition.
  • DNN deep neural network
  • the facial expression recognizer 612 can determine a person’s facial expression.
  • a human face consists of different parts such as a chin, a mouth, eyes and a nose, as noted above. Shape, structure and size of those facial features can vary with different facial expressions. Additionally, with certain facial expressions, wrinkles in specific facial locations may change. For example, the shape of person’s eyes and mouth can be employed to distinguish between different facial expressions, as can the wrinkles on a person’s forehead, and/or the like.
  • the perceived emotion detector 632 Based at least in part on the detected facial expressions of a person, one or more types of perceived emotions of the person can be determined by the perceived emotion detector 632 in FIG. 6. Exemplary perceived emotions that may be detected based at least in part on detected facial expressions include, but are not limited to, angry, nervous, distracted, and frustrated, as well as happy, calm, alert, and pleased. Certain techniques for quantifying perceived emotions are described below.
  • the skeletal detector 614 can use a skeletal detection model or some other computer vision technique to identify human body parts and joints, such as, but not limited to, arms, hands, elbows, wrists, and/or the like.
  • the pose recognizer 616 can detect specific poses, such as whether a person is holding a steering wheel of a vehicle with both of their hands while driving a vehicle, or whether the person has one of their arms raised with their hand in a fist while driving a vehicle.
  • Data sets can be used to train a deep neural network (DNN) and/or other computer model to detect human poses from images, and the trained DNN and/or other computer model can be thereafter used for pose recognition.
  • DNN deep neural network
  • the pose recognizer 616 can determine a person’s pose.
  • a human body consists of different parts such as a head, neck, torso, upper arms, elbows, forearms, wrists, hands, etc.
  • the overall and relative locations and orientations of such body parts may change. For example, while a person is driving a vehicle, that person will often have both of their hands on the steering wheel of the vehicle, but may raise one of their arms and make a fist if the person becomes angry, e.g., because a driver of another vehicle caused the person to stop short, swerve, and/or the like.
  • a detected pose can also be used to determine a perceived emotion of a person, as represented by the line from the pose recognizer 616 to the perceived emotion detector 632.
  • the audio signal analyzer 624 and the natural language analyzer 626 receive an audio signal 622.
  • the audio signal 622 can be, e.g., an audio signal of the first person 1 10A captured by the AV subsystem 120A.
  • the audio signal analyzer 624 can analyze the audio signal 622 to detect various features of the audio signal 622 that may vary in dependence on the emotional state of a person. Examples of such audio features include pitch, vibrato, and inflection. Pitch relates for the frequency of a signal, and thus, can be quantified as a frequency.
  • Changes in the pitch of a person’s voice is often correlated with an arousal state, or more generally an emotional state, of the person. For example, increases in pitch often correlate with highly aroused states such as anger, joy, or fear, while decreases in pitch often correlate with low arousal states such as sadness or calmness.
  • Vibrato is a periodic modulation of the pitch (e.g., fundamental frequency) of a person’s voice, occurring with a given rate and depth. Vibrato also relates to jitter, and it often correlates with changes in emotion. Increased fluctuations in pitch, and thus vibrato, can, e.g., be indicative of increases in happiness, distress, or fear.
  • results of the audio signal analysis, performed by the audio signal analyzer 624 can also be used to determine the perceived emotion of a person, as represented by the line from the audio signal analyzer 624 to the perceived emotion detector 632.
  • certain changes in a specific audio feature can be indicative of either an increase in a positive emotion (e.g., happiness) or a negative emotion (e.g., anger). For example, increases in happiness or fear can both cause an increase in pitch.
  • analyzing multiple vocal features, alone, or in combination with facial expression and/or body pose it is possible to determine relatively accurate perceived emotions of a person.
  • the natural language analyzer 626 performs natural language processing (NLP) of the audio signal 622, the results of which are used to determine the semantic emotion of a person, as represented by the line from the natural language analyzer 626 to the semantic emotion detector 634.
  • the NLP that is performed by the natural language analyzer 626 can include speech recognition that provides a textual representation of a person’s speech. In natural speech there are hardly any pauses between successive words, and thus speech segmentation can be a subtask of speech recognition, wherein speech segmentation involves separating a sound clip of a person into multiple words.
  • the natural language analyzer 626 can be configured to recognize a single language, or multiple different languages, such as English, Chinese, Spanish, French, and German, just to name a few. Where the natural language analyzer 626 is capable of performing NPL for multiple different languages, an output of the natural language analyzer 626 can include an indication of the specific language that a person is speaking.
  • the perceived emotion detector 632 can use one or more look up tables (LUTs) to determine one or more types of perceived emotions associated with a person based on outputs of the facial expression analyzer 612, the pose recognizer 616, and the audio signal analyzer 624.
  • the output of the facial expression analyzer 612 can specify one or more facial expression features of the person determined based on a video signal 602 of the person
  • the output of the pose recognizer 616 can specify one or more body poses of the person determined based on the video signal 602 of the person
  • the output of the audio signal analyzer 624 specify one or more audio features determined based on the audio signal 622.
  • the perceived emotion detector 632 can be implemented by one or more DNN and/or one or more other computer models that is/are trained based on perceived emotion training data, which can include facial expression training data, body pose training data, speech training data, and/or other perceived emotion training data.
  • the semantic emotion detector 634 can use one or more look up tables (LUTs) to determine a perceived emotion associated with a person based on outputs of the natural language analyzer 626.
  • the output of the natural language analyzer 626 can specify words and sentences spoken by a person as determined based on the audio signal 622, and can also indicate the language being spoken.
  • the semantic emotion detector 634 can be implemented by one or more DNN and/or other computer models that is/are trained based on semantic emotion training data.
  • outputs of the perceived emotion detector 632 and the semantic emotion detector 634 are also shown as being provided to an emotion modification block 540, which can also be referred to as an emotion modifier 540.
  • the emotion modifier 540 is also shown as receiving the captured video signal 602 and the captured audio signal 622.
  • the emotion modifier 540 is shown as including a facial expression modification block 642, a pose modification block 646, and an audio modification block 648, which can also be referred to respectively as a facial expression modifier 642, a pose modifier 646, and an audio modifier 648.
  • the perceived emotion detector 632 can determine one or more types of perceived emotions of a person based on a detected facial expression, a detected body pose determined based on the video signal 602, and detected audio features (e.g., pitch, vibrato, and inflection) determined based on the audio signal 622.
  • the semantic emotion detector 634 determines the semantic emotion of the person based on their spoken language using NLP.
  • the facial expression modifier 642 modifies facial expression image data of the video signal 602 to increase an alignment between a facial expression perceived emotion of the person (as determined by the perceived emotion detector 632) and the semantic emotion of the person (as determined by the semantic emotion detector 364).
  • the pose modifier 646 modifies image data of the video signal 602 to increase an alignment between a body pose perceived emotion of the person (as determined by the perceived emotion detector 632) and the semantic emotion of the person (as determined by the semantic emotion detector 634).
  • the audio modifier 648 modifies audio data of the audio signal 622 to increase an alignment between a speech perceived emotion of the person (as determined by the perceived emotion detector 632) and the semantic emotion of the person (as determined by the semantic emotion detector 634).
  • the emotion modifier 540 is shown as outputting a modified video signal 652 and a modified audio signal 662.
  • Certain embodiments of the present technology rely on a presumption that a person’s emotions that are responsive to and/or caused by environmental factors are recognizably different in a certain feature space from a person’s emotions responsive to and/or caused by conversational context, and that a difference between the emotions responsive to and/or caused by environmental factors and the emotions response to and/or caused by conversational context can be quantified.
  • a feature space that is used to quantify differences between perceived and semantic emotions is the feature space defined by the arousal/valance circumplex model, which was initially developed by James Russell, and published the article entitled "A circumplex model of affect" in the Journal of Personality and Social Psychology, Vol. 39(6), Dec. 1980, pages 1 161-1 178.
  • the arousal/valance circumplex model which can also be referred to more succinctly as the circumplex model, suggests that emotions are distributed in a two-dimensional circular space, containing arousal and valence dimensions. Arousal corresponds to the vertical axis and valence corresponds to the horizontal axis, while the center of the circle corresponds to a neutral valence and a medium level of arousal. In this model, emotional states can be represented at any level of valence and arousal, or at a neutral level of one or both of these factors.
  • the perceived emotion detector 632 uses one or more arousal/valance circumplex models to determine three types of perceive emotions separately, based on facial expression, body pose, and speech. More specifically, in certain embodiments a facial circumplex model is used to determine an arousal and a valence associated with the person’s facial expression; a pose circumplex model is used to determine an arousal and a valence associated with the person’s body pose; and a speech circumplex model is used to define an arousal and a valence associated with the person’s speech.
  • the valence dimension is represented on a horizontal axis and ranges between positive and negative valences.
  • the positive and negative valences are also known, respectively, as pleasant and unpleasant emotions, or more generally as positiveness.
  • the arousal dimension is represented on a vertical axis, which intersects the horizontal “valance” axis, and ranges between activated and deactivated.
  • the activated and deactivated arousals are also known, respectively, as intense and non-intense arousals, or more generally as activeness.
  • a general circumplex model is illustrated in FIG. 7A
  • a facial circumplex model is illustrated in FIG. 7B
  • a pose circumplex model is illustrated in FIG. 7C
  • the speech circumplex model is illustrated in FIG. 7D.
  • feature vectors generated from facial expression detection, pose detection, and speech detection algorithms are input to a DNN.
  • the facial expression detection can be performed by the facial detector 610 and facial expression analyzer 612, discussed above with reference to FIG. 6.
  • Results of the facial expression detection can be one or more facial feature vectors.
  • the pose detection can be performed by the skeletal detector 614 and the pose recognizer 616.
  • Results of the body pose detection can be one or more pose feature vectors.
  • the speech detection can be performed by the audio signal analyzer 624.
  • Results of the speech detection can be one or more speech feature vectors.
  • the aforementioned feature vectors are concatenated together and fed into a DNN. Such a DNN can be used to implement the perceived emotion detector 632 in FIG. 6.
  • outputs of the DNN which implements the perceived emotion detector 632, are six values denoted as ⁇ aro f , vah, aro p , val p , aro s , val s ⁇ , where“aro” refers to arousal and“va/” refers to valence, and where the subscripts f, p, and s refer respectively to facial, pose, and speech.
  • an arousal value and valence value indicative of a person’s facial expression an arousal value and a valence value indicate of the person’s body pose, and an arousal value and a valence value indicative of the person’s speech.
  • these values are used to modify the person’s facial expression, body pose, and/or speech, as will be explained in additional detail below.
  • modify and alter are used interchangeably herein.
  • NLP Natural Language Processing
  • a primary idea is to determine the context- dependence of emotions with recognized speech.
  • textual instances are often represented as vectors in a feature space.
  • the number of features can often be as large as hundreds of thousands, and traditionally, these features have known meanings. For example, whether the instance has a particular word observed previously in the training data, whether the word is listed as a positive/negative term in the sentiment lexicon, and so on.
  • the NLP algorithm the person’s semantic emotion can be estimated.
  • outputs of the DNN which implements the semantic emotion detector 634, are two values denoted as ⁇ aro sem , va/ sem ⁇ , which collectively can be represented as Ernosem.
  • ⁇ aro sem ⁇ aro sem
  • va/ sem ⁇ va/ sem ⁇
  • Ernosem a person’s semantic emotion Ernosem is used to modify image data and audio data corresponding to a person’s facial expression, body pose, and/or speech, as will be explained in additional detail below.
  • the semantic emotion Erno sem also denotes a point on a circumplex model, which can be mapped to the same circumplex model, as also shown in FIG. 8. Referring to FIG. 8, the“X” labeled 802 corresponds to the person’s perceived facial emotion, the“X” labeled 804 corresponds to the person’s perceived pose emotion, and the“X” labeled 806 corresponds to the person’s perceived speech emotion.
  • the locations of the Xs 802, 804, and 806 are defined by the six values denoted as ⁇ aro f , vah, aro p , val p , aro s , val s ⁇ , which were discussed above. More specifically, the location of the“X” labeled 802 is defined by the values aro f and va , the location of the“X” labeled 804 is defined by the values arop and val P and the location of the“X” labeled 806 is defined by the values aro s and vah. Still referring to FIG. 8, the dot labeled 808 corresponds to the person’s semantic emotion E/r?o sem . The location of the dot labeled 808 is defined by the values aro sem and valsem.
  • activeness is a measure of arousal
  • positiveness is a measure of valence.
  • a distance disti between any of the perceived emotions Emo ⁇ jrec and the semantic emotion Emo sem can be calculated using the equation below:
  • the distance between of the perceived emotions Emo ⁇ jrec and the semantic emotion Emo sem is indicative of how close perceived and semantic emotions are aligned. For example, where a distance between a specific perceived emotion (e.g., body pose) and the semantic emotion is relatively small, that is indicative of the perceived emotion being substantially aligned with the semantic emotion. Conversely, where a distance between a specific perceived emotion (e.g., body pose) and the semantic emotion is relatively large, that is indicative of the perceived emotion being substantially unaligned with the semantic emotion. In accordance with certain embodiments, for each type of determined perceived emotion, a distance between the perceived emotion and the semantic emotion is determined. This will result in three distance values being determined, one for each of facial expression, body pose, and speech.
  • a determined distance exceeds a specified distance threshold, it will be determined that the perceived emotion is substantially unaligned with the semantic emotion, and in response to that determination, the respective feature (e.g., facial expression, body pose, or speech) is modified to increase the alignment between the perceived emotion and the semantic emotion.
  • the respective feature e.g., facial expression, body pose, or speech
  • facial image data of the video signal is modified to produce a modified video signal where the facial perceived emotion is more aligned with the semantic emotion.
  • facial image data of the video signal is not modified.
  • This determining of a respective distance and comparison of the determined distance to a distance threshold is also determined for body pose as well as for speech. Results of such comparisons are used to determine whether or not to modify body pose data of the video signal, and/or speech data of the audio signal.
  • step 902 a distance between one of the perceived emotions (facial, pose, speech) and the semantic emotion is determined, and more specifically calculated, e.g., using the equation discussed above.
  • step 904 the calculated distance is compared to the distance threshold, and at step 906 there is a determination of whether the calculated distance is within (i.e. , less than) the distance threshold. If the calculated distance is not within the distance threshold (i.e., if the answer to the determination at step 906 is No), then flow goes to step 908, and the relevant signal or portion thereof is modified at step 908, before flow proceeds to step 910.
  • step 910 If the calculated distance is within the distance threshold (i.e., if the answer to the determination at step 906 is Yes), then flow goes to step 910 without any modification to the relevant signal or portion thereof.
  • the above summarized steps can be performed for each of the different types of perceived emotions, including facial, pose, and speech.
  • video and audio of a first person is modified by generating synthetic image/audio to replace the original version thereof. More specifically, the originally obtained video and audio signals of a first person are modified to produce modified video and audio signals that when viewed and listed to by a second person (or multiple other persons) have a perceived emotion that is more aligned with the semantic emotion of the first person. In accordance with certain embodiments, the perceived emotion of the generated image/audio should approach the semantic emotion as close as possible.
  • the emotion modifier 540 which can also be referred to more specifically as the perceived emotion modifier 540, is shows as including as a facial expression modifier 642, a pose modifier 646, and an audio modifier 648.
  • Each of the modifiers 642, 646, and 648 which can also be referred to as modules, uses algorithms to generate synthetic images or synthetic audio by modifying specific data within the captured audio and video signals. For a simple example, assume it is determined that a first persons’ semantic emotion is happy, but theirfacial perceive emotion is nervous, their pose perceive emotion is upset, and their speech perceived emotion is stressed.
  • facial image data is modified so that the person’s facial expression is happy (rather than nervous)
  • pose image data is modified so that the person’s body pose is happy (rather than upset)
  • audio data is modified so that the person’s speech is happy (rather than nervous).
  • one or more DNN and/or other computer models can be used to modify the captured video and audio signals to produce the modified video and audio signals.
  • a generative adversarial network is used to perform such modifications.
  • a GAN is deep neural network architecture that includes two neural networks, namely a generative neural network and a discriminative neural network, that are pitted one against the other in a contest (thus the use of the term “adversarial”).
  • the generative neural network and the discriminative neural network can thus be considered sub-networks of the GAN neural network.
  • the generative neural network generates candidates while the discriminative neural network evaluates them.
  • the contest operates in terms of data distributions.
  • the generative neural network can learn to map from a latent space to a data distribution of interest, while the discriminative neural network can distinguish candidates produced by the generative neural network from the true data distribution.
  • the generative neural network's training objective can be to increase the error rate of the discriminative neural network (i.e. , to "fool" the discriminator neural network by producing novel candidates that the discriminator thinks are not synthesized (are part of the true data distribution).
  • a known dataset serves as the initial training data for the discriminative neural network. Training the discriminative neural network can involve presenting it with samples from the training dataset, until it achieves acceptable accuracy.
  • the generative neural network can be trained based on whether it succeeds in fooling the discriminative neural network.
  • the generative neural network can be seeded with a randomized input that is sampled from a predefined latent space (e.g., a multivariate normal distribution). Thereafter, candidates synthesized by the generative neural network can be evaluated by the discriminative neural network. Backpropagation can be applied in both networks so that the generative neural network produces better images, while the discriminative neural network becomes more skilled at flagging synthetic images.
  • the generative neural network can be, e.g., a deconvolutional neural network, and the discriminative neural network can be, e.g., a convolutional neural network.
  • the GAN should be trained prior to it being used to modify signals during a video chat.
  • the facial expression modifier 642 can be implemented by a GAN. More specifically, a GAN can be used to modify image data in a video signal to produce a modified video signal that can be used to display realistic images of a person where the images have been modified to make the person’s facial and pose perceived emotions more aligned with the person’s semantic emotion. A GAN can also be used to modify an audio signal so that a person’s speech perceived emotion is more aligned with the person’s semantic emotion. In specific embodiments, a StarGAN can be used to perform image and/or audio modifications. An article titled “StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-lmage T ranslation,” by Y.
  • a GAN can also be used to implement the pose modifier 646.
  • a pretrained visual generator model can be used to implement the pose modifier 646.
  • the original video signal 602 is provided to the skeletal detector 614.
  • the original video signal 602 can also be referred to as the original image stream 602.
  • the skeleton detector 614 extracts skeletal information from the original image stream.
  • the skeletal information can be represented as a vector X, which stores all the joint positions in a frame.
  • the vector X is combined with a semantic emotion signal, which is represented with a vector e.
  • the pretrained visual generator model can be implemented, e.g., with convolution layers, maxpooling layers, deconvolution layers, and batch normalization layers, but is not limited thereto.
  • the output of the pretrained visual generator model can be used to generate the modified video signal 652 which includes the modified body pose, which is more aligned with the semantic emotion.
  • the original audio signal 622 is shown as being provided to the audio signal analyzer 624, as was already noted above.
  • features of the audio signal that can be modified include the pitch, vibrato, and inflection, but are not limited thereto. More specifically, pitch can be shifted, where pitch-shift denotes the multiplication of the pitch of the original voice signal by a factor a. Increased pitch (a > 1 ) often correlates with highly aroused states such as happiness, while decreased pitch (a ⁇ 1 ) correlates with low valence, such as sadness.
  • Vibrato is a periodic modulation of the pitch (fundamental frequency) of the voice, occurring with a given rate and depth. Vibrato, also related to jitter, is frequently reported as a correlate of high arousal and is an important marker of emotion even in single vowels. Vibrato can be modified to alter the perceived emotion corresponding to speech. Inflection is a rapid modification (e.g., -500 ms) of the pitch at the start of each utterance, which overshoots its target by several semitones but quickly decays to the normal value. The use of inflection leads to increased variation in pitch, which is associated with high emotional intensity and positive valence. Inflection can also be modified to alter the perceived emotion corresponding to speech.
  • Inflection is a periodic modulation of the pitch (fundamental frequency) of the voice, occurring with a given rate and depth. Vibrato, also related to jitter, is frequently reported as a correlate of high arousal and is an important marker of emotion even in single vowels. Vibra
  • An audio signal can also be filtered to alter the perceived emotion corresponding to speech, where filtering denotes the process of emphasizing or attenuating the energy contributions of certain areas of the frequency spectrum. For instance, high arousal emotions tend to be associated with increased high frequency energy, making the voice sound sharper and brighter. Where a person’s semantic emotion corresponds to a less activated arousal than the person’s perceived emotion corresponding to speech, high frequency energy within the audio signal can be attenuated using filtering, to make the perceived emotion more aligned with the semantic emotion.
  • the emotional tone of the modified audio signal should be recognizable, and the voice should sound natural and not be perceived synthetic. As noted above, the terms modify and alter are used interchangeably herein.
  • One or more processors can be used to implement the above described neural networks. Where multiple processors are used, they can be collocated orwidely disturbed, or combinations thereof.
  • step 1002 involves obtaining a video signal and an audio signal of a first person participating in a video chat with a second person.
  • step 1002 can be performed by an A-V subsystem (e.g., 120A), or more specifically, one or more cameras and one or more microphones of the A-V subsystem, or some other subsystem or system.
  • A-V subsystem e.g., 120A
  • step 1004 involves determining one or more types of perceived emotions of the first person based on the video signal.
  • Step 1006 involves determining a semantic emotion of the first person based on the audio signal.
  • the types of perceived emotions that can be determined at step 1004 include a facial expression perceived emotion, a body pose perceived emotion, and a speech perceived emotion, as was described above.
  • the various types of perceived emotions can be determined, e.g., by the emotion detector 530, or more specifically, by the perceived emotion detector 632 thereof. More specifically, a facial expression and a body pose of the first person can be determined based on the video signal obtained at step 1002, and a facial expression perceived emotion and a body pose perceive emotion of the first person can be determined based thereon. Additionally, audio signal processing of the audio signal obtained at step 1002 can be performed to determine at least one of pitch, vibrato or inflection of speech of the first person, and a speech perceived emotion of the first person can be determined based on results of the audio signal processing. Additional and/or alternative variations are also possible and within the scope of the embodiments described herein.
  • a facial circumplex model is used to quantify a positiveness and an activeness of a facial expression of the first person based on the video signal;
  • a pose circumplex model is used to quantify a positiveness and an activeness of a body pose of the first person based on the video signal;
  • a speech circumplex model is used to quantify a positiveness and an activeness of speech of the first person based on the audio signal.
  • the semantic emotion determined at step 1006 can be determined, e.g., by the emotion detector 530, or more specifically, by the semantic emotion detector thereof 634. As explained in additional detail above, step 1006 can involve performing natural language processing of the audio signal, and determining the semantic emotion of the first person based on results of the natural language processing of the audio signal. In accordance with specific embodiments, at step 1006 the semantic emotion of the first person is determined based on the audio signal using a language circumplex model to quantify a positiveness and an activeness of language of the first person based on the audio signal.
  • step 1008 involves altering the video signal and the audio signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
  • One or more computer implemented neural networks can be used to perform step 1008, as was described in detail above.
  • Other types of computer implemented models can alternatively or additionally be used to perform step 1008.
  • Step 1008 can involves modifying a facial expression and a body pose of image data included in the video signal, as well as modifying at least one of a pitch, vibrato, or inflection of audio data included in the audio signal.
  • step 1008 involves altering image data included in a video signal to reduce a distance between the positiveness and the activeness of the facial expression of the first person and the positiveness and the activeness of the language of the first person.
  • Step 1008 can also involve altering image data included in the video signal to reduce a distance between the positiveness and the activeness of the body pose of the first person and the positiveness and the activeness of the language of the first person.
  • step 1008 can also involve altering audio data included in an audio signal to reduce a distance between the positiveness and activeness of the speech of the first person and the positiveness and the activeness of the language of the first person.
  • step 1010 involves providing (e.g., transmitting) the altered video signal and the altered audio signal to a subsystem (e.g., device) associated with (e.g., in proximity to) the second person that is participating in the video chat, to thereby enable the second person to view and listen to modified images and audio of the first person that have increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
  • a subsystem e.g., device
  • FIG. 1 1 illustrates an exemplary components of an exemplary mobile computing device with which embodiments of the present technology can be used.
  • a mobile computing device can be used, e.g., to implement an A-V subsystem (e.g., 120A or 220A in FIGS. 1 -4), but is not limited thereto.
  • FIG. 1 1 illustrates an exemplary mobile computing device 1 102 with which embodiments of the present technology described herein can be used.
  • the mobile computing device 1 102 can be a smartphone, such as, but not limited to, an iPhoneTM, a BlackberryTM, an AndriodTM-based or a WindowsTM-based smartphone.
  • the mobile computing device 1 102 can alternatively be a tablet computing device, such as, but not limited to, an iPadTM, an AndriodTM-based or a WindowsTM-based tablet.
  • the mobile computing device 1 102 can be iPod TouchTM, or the like.
  • the mobile computing device 1 102 is shown as including a camera 1 104, an accelerometer 1 106, a magnetometer 1 108, a gyroscope 1 1 10, a microphone 1 1 12, a display 1 1 14 (which may or may not be a touch screen display), a processor 1 1 16, memory 1 1 18, a transceiver 1 120, a speaker 1122 and a drive unit 1 124.
  • the camera 1 104 can be used to obtain a video signal that includes images of a person using the mobile computing device 1 102.
  • the microphone 1 1 12 can be used to produce an audio signal indicative of what is said by a person using the mobile computing device 1 102.
  • the accelerometer 1 106 can be used to measure linear acceleration relative to a frame of reference, and thus, can be used to detect motion of the mobile computing device 1 102 as well as to detect an angle of the mobile device 1 102 relative to the horizon or ground.
  • the magnetometer 1 108 can be used as a compass to determine a direction of magnetic north and bearings relative to magnetic north.
  • the gyroscope 1 1 10 can be used to detect both vertical and horizontal orientation of the mobile computing device 1 102, and together with the accelerometer 1 106 and magnetometer 1 108 can be used to obtain very accurate information about the orientation of the mobile computing device 1 102. It is also possible that the mobile computing device 1 102 includes additional sensor elements, such as, but not limited to, an ambient light sensor and/or a proximity sensor.
  • the display 1 1 14 which many or not be a touch screen type of display, can be used as a user interface to visually display items (e.g., images, options, instructions, etc.) to a user and accept inputs from a user.
  • the display 1 1 14 can also be used to enable the user of the mobile computing device 1 102 to participate in a video chat. Further, the mobile computing device 1 102 can include additional elements, such as keys, buttons, a track-pad, a trackball, or the like, that accept inputs from a user.
  • the memory 1 1 18 can be used to store software and/or firmware that controls the mobile computing device 1 102, as well to store images captured using the camera 1 104, but is not limited thereto.
  • the drive unit 1 124 e.g., a hard drive, but not limited thereto, can also be used to store software that controls the mobile computing device 1 102, as well to store images captured using the camera 1 104, but is not limited thereto.
  • the memory 1 1 18 and the disk unit 1 124 can include a machine readable medium on which is stored one or more sets of executable instructions (e.g., apps) embodying one or more of the methodologies and/or functions described herein.
  • the mobile computing device can include a solid-state storage device, such as those comprising flash memory or any form of non-volatile memory.
  • a solid-state storage device such as those comprising flash memory or any form of non-volatile memory.
  • the term“machine-readable medium” as used herein should be taken to include all forms of storage media, either as a single medium or multiple media, in all forms; e.g., a centralized or distributed database and/or associated caches and servers; one or more storage devices, such as storage drives (including e.g., magnetic and optical drives and storage mechanisms), and one or more instances of memory devices or modules (whether main memory, cache storage either internal or external to a processor, or buffers.
  • machine-readable medium or“computer-readable medium” shall be taken to include any tangible non-transitory medium which is capable of storing or encoding a sequence of instructions for execution by the machine and that cause the machine to perform any one of the methodologies.
  • non- transitory medium expressly includes all forms of storage drives (optical, magnetic, etc.) and all forms of memory devices (e.g., DRAM, Flash (of all storage designs), SRAM, MRAM, phase change, etc., as well as all other structures designed to store information of any type for later retrieval.
  • the transceiver 1 120 which is connected to an antenna 1 126, can be used to transmit and receive data wirelessly using, e.g., Wi-Fi, cellular communications or mobile satellite communications.
  • the mobile computing device 1 102 may also be able to perform wireless communications using Bluetooth and/or other wireless technologies. It is also possible the mobile computing device 1 102 includes multiple types of transceivers and/or multiple types of antennas.
  • the transceiver 1 120 can include a transmitter and a receiver.
  • the speaker 1 122 can be used to provide auditory instructions, feedback and/or indicators to a user, playback recordings (e.g., musical recordings), as well as to enable the mobile computing device 1 102 to operate as a mobile phone.
  • the speaker 1 122 can also be used to enable the user of the mobile computing device 1 102 to participate in a video chat.
  • the processor 1 1 16 can be used to control the various other elements of the mobile computing device 1 102, e.g., under control of software and/or firmware stored in the memory 1 1 18 and/or drive unit 1 124. It is also possible that there are multiple processors 1 1 16, e.g., a central processing unit (CPU) and a graphics processing unit (GPU). The processor(s) 1 1 16 can executed computer instructions (stored in a non-transitory computer-readable medium) to cause the processor(s) to perform steps used to implement the embodiments of the present technology described herein.
  • CPU central processing unit
  • GPU graphics processing unit
  • processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non removable media.
  • computer readable media may comprise computer readable storage media and communication media.
  • Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
  • a computer readable medium or media does not include propagated, modulated, or transitory signals.
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
  • some or all of the software can be replaced by dedicated hardware logic components.
  • illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc.
  • FPGAs Field-programmable Gate Arrays
  • ASICs Application-specific Integrated Circuits
  • ASSPs Application-specific Standard Products
  • SOCs System-on-a-chip systems
  • CPLDs Complex Programmable Logic Devices
  • special purpose computers etc.
  • software stored on a storage device
  • the one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces.
  • a connection may be a direct connection or an indirect connection (e.g., via one or more other parts).
  • the element when an element is referred to as being connected or coupled to another element, the element may be directly connected to the other element or indirectly connected to the other element via intervening elements.
  • the element When an element is referred to as being directly connected to another element, then there are no intervening elements between the element and the other element.
  • Two devices are“in communication” if they are directly or indirectly connected so that they can communicate electronic signals between them.
  • the term“based on” may be read as“based at least in part on.”

Abstract

Described herein are methods and subsystems that alter video and audio signals of a person participating in a video chat to produce altered video and audio signals in which there is an increase in the alignment between one or more of the person's perceived emotions and the person's semantic emotion. Such a method can include obtaining a video signal and an audio signal of a first person participating in a video chat with a second person, determining one or more types of perceived emotions of the first person based on the video signal, and determining a semantic emotion of the first person based on the audio signal. The method also includes altering the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.

Description

METHODS AND SYSTEMS THAT PROVIDE
EMOTION MODIFICATIONS DURING VIDEO CHATS
FIELD
[0001] The disclosure generally relates methods and systems for use during video chats, and in specific embodiments, relates to methods and systems that alter video and audio signals of a person participating in a video chat to produce altered video and audio signals in which there is an increase in alignment between one or more of the person’s perceived emotions and the person’s semantic emotion.
BACKGROUND
[0002] Drivers of automobiles and other types of vehicles often use their smartphones or other mobile computing devices to chat with other people while driving. Such a chat can be voice chat, or a video chat. For the purpose of this discussion, a voice chat refers to a communication that is solely audio, meaning that the two people participating in the voice chat can hear one another, but cannot see each other. By contrast, a video chat refers to a communication that includes both audio and video of the two people participating in the video chat, meaning that the two people participating in the video chat can both hear one another and see one another. Videotelephony technologies, which provide for the reception and transmission of audio and video signals, can be used to perform video chats. Exemplary videotelephony products include FaceTime available from Apple Inc., Google Duo and Google Hangouts both available from Google LLC, Skype available from Microsoft Corp., and WeChat available from Tencent Corp., just to name a few. Indeed, there have been surveys that found that ten percent of the drivers indicated that they have used their smartphone to video chat while driving. That percentage is likely to increase in the future, especially as semi-autonomous and fully-autonomous vehicles become more common.
[0003] Road rage, which is aggressive or angry behavior exhibited by a driver of a vehicle, is very common. Indeed, surveys have found that a significant majority of drivers have expressed significant anger while driving in the past year. Road rage can lead to numerous types of direct adverse effects. For example, for the driver of a vehicle and their passenger(s), road rage can lead to altercations, assaults and collisions that result in serious physical injuries or even death. Road rage can also lead to certain non-direct adverse effects. For example, if a first person driving a first vehicle is participating in a video chat with a second person driving a second vehicle, during which the first driver experiences road rage, the anger of the first person may be passed onto the second person and/or otherwise distract the second person, which may increase the chances of the second person being involved in a collision. For another example, if a first person driving a first vehicle is participating in a business related video chat with one or more other persons, during which the first person experiences road rage, the business relations between the first person and the one or more other persons can be ruined or otherwise adversely affected.
BRIEF SUMMARY
[0004] According to one aspect of the present disclosure, a method includes obtaining a video signal and an audio signal of a first person participating in a video chat with a second person, determining one or more types of perceived emotions of the first person based on the video signal, and determining a semantic emotion of the first person based on the audio signal. The method also includes altering the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
[0005] Optionally, in any of the preceding aspects, the determining the one or more types of perceived emotions of the first person based on the video signal includes detecting at least one of a facial expression or a body pose of the first person based on the video signal, and determining at least one of a facial expression perceived emotion or a body pose perceive emotion of the first person based on the at least one of the facial expression or the body pose of the first person. [0006] Optionally, in any of the preceding aspects, the determining the one or more types of perceived emotions of the first person is also based on the audio signal and includes performing audio signal processing of the audio signal to determine at least one of pitch, vibrato or inflection of speech of the first person, and determining a speech perceived emotion of the first person based on results of the audio signal processing of the audio signal. Such a method can further comprise altering the audio signal to increase alignment between the speech perceived emotion of the first person and the semantic emotion of the first person.
[0007] Optionally, in any of the preceding aspects, the altering the video signal to produce the altered video signal includes modifying image data of the video signal corresponding at least one of a facial expression or a body pose, and the altering the audio signal to produce the altered audio signal includes modifying audio data of the video signal corresponding at least one of the pitch, vibrato, or inflection.
[0008] Optionally, in any of the preceding aspects, the method further comprises providing the altered video signal and the altered audio signal to a subsystem associated with the second person that is participating in the video chat, to thereby enable the second person to view and listen to images and audio of the first person having increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
[0009] Optionally, in any of the preceding aspects, the determining the semantic emotion of the first person based on the audio signal includes performing natural language processing of the audio signal, and determining the semantic emotion of the first person based on results of the natural language processing of the audio signal.
[0010] Optionally, in any of the preceding aspects, the determining one or more types of perceived emotions of the first person based on the video signal includes at least one of using a facial circumplex model to quantify a positiveness and an activeness of a facial expression of the first person based on the video signal, or using a pose circumplex model to quantify a positiveness and an activeness of a body pose of the first person based on the video signal. Further, the determining the semantic emotion of the first person based on the audio signal includes using a language circumplex model to quantify a positiveness and an activeness of language of the first person based on the audio signal. Additionally, the altering the video signal to produce the altered video signal includes at least one of altering image data of the video signal to reduce a distance between the positiveness and the activeness of the facial expression of the first person and the positiveness and the activeness of the language of the first person, or altering image data of the video signal to reduce a distance between the positiveness and the activeness of the body pose of the first person and the positiveness and the activeness of the language of the first person.
[0011] Optionally, in any of the preceding aspects, the determining the one or more types of perceived emotions of the first person is also based on the audio signal and includes using a speech circumplex model to quantify a positiveness and an activeness of speech of the first person based on the audio signal. The method can further comprise altering audio data of the audio signal to produce an altered audio signal to reduce a distance between the positiveness and activeness of the speech of the first person and the positiveness and the activeness of the language of the first person.
[0012] Optionally, in any of the preceding aspects, the method further comprises providing the altered video signal and the altered audio signal to a subsystem associated with the second person that is participating in the video chat, to thereby enable the second person to view and listen to images and audio of the first person that have increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
[0013] According to one other aspect of the present disclosure, a subsystem comprises one or more interfaces and one or more processors. The one or more interfaces are configured to receive a video signal and an audio signal of a first person participating in a video chat with a second person. The one or more processors are communicatively coupled to the one or more interfaces and are configured to determine one or more types of perceived emotions of the first person based on the video signal, and determine a semantic emotion of the first person based on the audio signal. The one or more processors are also configured to alter the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person. The subsystem can also include one or more cameras configured to obtain the video signal, and one or more microphones configured to obtain the audio signal. [0014] Optionally, in any of the preceding aspects, the one or more processors implement one or more neural networks that are configured to determine the perceived emotion of the first person based on the video signal and determine the semantic emotion of the first person based on the audio signal.
[0015] Optionally, in any of the preceding aspects, the one or more processors implement one or more neural networks that are configured to alter the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
[0016] Optionally, in any of the preceding aspects, in order to determine the one or more types of perceived emotions of the first person based on the video signal, the one or more processors are configured to detect at least one of a facial expression or a body pose of the first person based on the video signal, and determine at least one of a facial expression perceived emotion or a body pose perceive emotion of the first person based on the at least one of the facial expression or the body pose of the first person.
[0017] Optionally, in any of the preceding aspects, the one or more processors are also configured to perform audio signal processing of the audio signal to determine at least one of pitch, vibrato or inflection of speech of the first person, and determine a speech perceived emotion of the first person based on results of the audio signal processing, and alter the audio signal to increase alignment between the speech perceived emotion of the first person and the semantic emotion of the first person.
[0018] Optionally, in any of the preceding aspects, the one or more processors are configured to modify image data of the video signal corresponding at least one of a facial expression or a body pose, to thereby alter the video signal to produce the altered video signal; and modify audio data of the audio signal corresponding at least one of the pitch, vibrato, or inflection, to thereby alter the audio signal to produce the altered audio signal.
[0019] Optionally, in any of the preceding aspects, the one or more processors are configured to perform natural language processing of the audio signal, and determine the semantic emotion of the first person based on results of the natural language processing of the audio signal. [0020] Optionally, in any of the preceding aspects, the one or more processors are configured to use a facial circumplex model to quantify a positiveness and an activeness of a facial expression of the first person based on the video signal, use a pose circumplex model to quantify a positiveness and an activeness of a body pose of the first person based on the video signal, and use a language circumplex model to quantify a positiveness and an activeness of language of the first person based on the audio signal. Additionally, the one or more processors are configured to alter image data of the video signal to reduce a distance between the positiveness and the activeness of the facial expression of the first person and the positiveness and the activeness of the language of the first person, and reduce a distance between the positiveness and the activeness of the body pose of the first person and the positiveness and the activeness of the language of the first person.
[0021] Optionally, in any of the preceding aspects, the one or more processors are also configure to use a speech circumplex model to quantify a positiveness and an activeness of speech of the first person based on the audio signal, and alter audio data of the audio signal to reduce a distance between the positiveness and activeness of the speech of the first person and the positiveness and the activeness of the language of the first person.
[0022] Optionally, in any of the preceding aspects, the subsystem comprises a transmitter configured to transmit the altered video signal and the altered audio signal to a subsystem associated with the second person that is participating in the video chat, to thereby enable the second person to view and listen to video and audio of the first person having increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
[0023] According to one other aspect of the present disclosure, a non-transitory computer-readable medium storing computer instructions that when executed by one or more processors cause the one or more processors to perform the steps of: obtaining a video signal and an audio signal of a first person participating in a video chat with a second person; determining one or more types of perceived emotions of the first person based on the video signal; determining a semantic emotion of the first person based on the audio signal; and altering the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person. The non-transitory computer-readable medium can also store computer instructions that when executed by one or more processors cause the one or more processors to perform additional steps of the methods summarized above, and described in additional detail below.
[0024] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background.
BRIEF DESCRIPTION OF THE DRAWINGS
[0025] Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying figures for which like references indicate like elements.
[0026] FIG. 1 illustrates an exemplary system that enables first and second persons to participate in a video chat.
[0027] FIGS. 2, 3, and 4 illustrate systems according to various embodiments of the present technology that enable first and second persons to participate in a video chat, and which also modify the audio and video of at least the first person so that the audio and video of the first person that are heard and seen by the second person differ from actual audio and video of the first person.
[0028] FIG. 5 illustrates a modification subsystem, according to an embodiment of the present technology, which can be used to modify audio and video signals of a person participating in a video chat.
[0029] FIG. 6 illustrates additional details of an emotion detector and an emotion modifier of the modification subsystem introduced in in FIG. 5.
[0030] FIG. 7A illustrates a general circumplex model.
[0031] FIG. 7B illustrates a facial circumplex model.
[0032] FIG. 7C illustrates a pose circumplex model.
[0033] FIG. 7D illustrates a speech circumplex model. [0034] FIG. 8 illustrates how different types of perceived emotions and a semantic emotion can be mapped to a circumplex model, how distances between perceived emotions and the semantic emotion can be determined, and how such distances can be reduced to increase alignment between the different types of perceived emotions and a semantic emotion.
[0035] FIG. 9 illustrates a high level flow diagram that explains how distances between perceived emotions and a semantic emotion can be used to determine whether to modify certain features of video and audio signals to increase the alignment between the perceived emotions and the semantic emotion.
[0036] FIG. 10 illustrates a high level flow diagram that is used to summarize methods according to certain embodiments of the present technology.
[0037] FIG. 1 1 illustrates an exemplary components of an exemplary mobile computing device with which embodiments of the present technology can be used.
DETAILED DESCRIPTION
[0038] Certain embodiments of the present technology alter video and audio signals of a first person participating in a video chat with a second person, such that when the altered signals are played for the second person, what is seen and heard by the second person differs from the originally captured video and audio signals. Where the first and second persons are participating in a video chat while both of them are driving vehicles, such embodiments of the present technology can prevent the anger of the first person from being passed onto the second person, in the event that the first person experiences road rage while participating in the video chat. Where a first person driving a vehicle is participating in a business related video chat with one or more other persons, such embodiments of the present technology can prevent the anger of the first person from being witnessed by the other person(s), thereby avoiding the business relations between the first person and the one or more other persons from being ruined or otherwise adversely affected. In accordance with certain embodiments, described in more detail below, one or more types of perceived emotions of the first person can be determined based on a video signal (and potentially also an audio signal) of the first person, and a semantic emotion of the first person can be determined based on the audio signal of the first person. Video and audio signals of the first person can then be modified (also referred to as altered) such that the resulting modified video and audio of the first person is more aligned with the semantic emotion of the first person than the perceived emotion(s) of the first person. More specifically, video and audio signals are modified to reduce differences between one or more types of perceived emotions of a person and the semantic emotion of the person.
[0039] A perceived emotion, as the term is used herein, generally relates to a first person’s emotional state that a second person becomes aware of using the second person’s senses, e.g., using the sight and hearing of the second person. By contrast, semantic emotion, as the term is used herein, generally relates to a first person’s emotional state that a second person becomes aware of using the second person’s understanding of the verbal language (also referred to as spoken language, or more succinctly as language) that is spoken by first person. There are many times where perceived emotions can be substantially aligned with semantic emotion, e.g., if a first person is smiling and has positive body language while saying that they are having a wonderful day during a conversation with a second person. However, there are other times that perceived emotions and semantic emotion are significantly unaligned, e.g., if a first person is frowning and has negative body language (such as, looking down and crossing their arms) while saying that they are having a wonderful day during a conversation with a second person. In accordance with certain embodiments of the present technology, if a first person is frowning and has negative body language (such as, looking down and crossing their arms) while saying that they are having a wonderful day during a video chat with a second person, the video of the first person is altered such that when the video is played for the second person, the body language of the first person has been changed from negative body language to positive body language, so that the first person’s body language is more aligned with the positive spoken language that they used. Additionally, the audio of the first person may also be altered, e.g., to change pitch, vibrato and/or inflection of the first person’s voice to be more aligned with the positive spoken language that they used.
[0040] FIG. 1 illustrates an exemplary system that enables first and second persons to participate in a video chat. In FIG. 1 , blocks 1 10A and 1 10B are representative of first and second persons that are participating in a video chat using respective client computing devices, which are also referred to herein more generally as audio-video (A-V) subsystems 120A and 120B. The A-V subsystems 120A and 120B can be referred to collectively as the A-V subsystems 120, or individually as an A-V subsystem 120. The first and second persons 1 10A and 1 10B can be referred to collectively as the persons 1 10, or individually as a person 1 10. The A-V subsystem 120A is capable of obtaining a video signal and an audio signal of the first person 1 10A, and the A-V subsystem 120B is capable of obtaining a video signal and an audio signal of the second person 1 10B. Accordingly, each of the A-V subsystems 120 can include at least one microphone that is used to obtain an audio signal, and at least one camera that is used to obtain a video signal. At least one camera can be an RGB/NIR (Near Infrared) camera including an image sensor (e.g., CMOS image sensor) that can be used to capture multiple two dimensional RGB/NIR images per second (e.g., 30 images per second). At least one further camera can be a depth camera that produces depth images, rather than RGB/NIR images, e.g., using structured light and/or time-of-flight (TOF) sensors to recreate a 3D structure on point clouds, or the like.
[0041] Further, the A-V subsystem 120A can be capable playing video and audio of a second person (e.g., 120B) for the first person 1 10A, and the A-V subsystem 120B can be capable playing video and audio of a first person (e.g., 120A) for the second person 1 10B. Accordingly, each of the A-V subsystems 120 can include at least one audio-speaker that is used to output audible sounds, and at least one display that used to display video images. One or both of the A-V subsystems 120A and 120B can be an in-cabin computer system or mobile computing devices, such as, but not limited to, smartphones, tablet computers, notebook computers, laptop computers, or the like. It is also possible that one or both of the audio-video subsystems 120A and 120B, or portions thereof, can include a microphone, camera, audio speaker, and/or display, that is build into a vehicle, e.g., as part of a vehicles entertainment system.
[0042] When the first and second persons 1 10A and 1 10B are participating in a video chat using their respective A-V subsystems 120A and 120B, at least one microphone of the A-V subsystem 120A obtains an audio signal of the first person 110A, and at least one camera of the A-V subsystem 120A obtains a video signal of the first person 1 10A. Similarly, at least one microphone of the A-V subsystem 120B obtains an audio signal of the second person 1 10B, and at least one camera of the A- V subsystem 120B obtains a video signal of the second person 1 10B. The audio and video signals of the first person 1 10A, which are obtained by the A-V subsystem 120A, are sent via one or more communication networks 130 to the A-V subsystem 120B. Similarly, the audio and video signals of the second person 1 1 OB, which are obtained by the A-V subsystem 120B, are sent via one or more communication networks 130 to the A-V subsystem 120B.
[0043] The communication network(s) 130 can be any wired or wireless local area network (LAN) and/or wide area network (WAN), such as an intranet, an extranet, or the Internet, or a combination thereof, but are not limited thereto. It is sufficient that the communication network 130 provides communication capability between the A-V subsystems 120, and optional other devices and systems. In some implementations, the communication network(s) 130 use the HyperText Transport Protocol (HTTP) to transport information using the Transmission Control Protocol/Internet Protocol (TCP/IP). HTTP permits A-V subsystems 120 to access various resources available via the communication network(s) 130. The various implementations described herein, however, are not limited to the use of any particular protocol.
[0044] At least one audio-speaker of the A-V subsystem 120A uses the audio signal of the second person 1 10B to output audible sounds of (e.g., words spoken by) the second person 1 10B, which can be listened to by the first person 1 10A. At least one display of the A-V subsystem 120A uses the video signal of the second person 110B to display video images of the second person 1 10B, which can be viewed by the first person 1 10A. Similarly, at least one audio-speaker of the A-V subsystem 120B uses the audio signal of the first person 1 10A to output audible sounds of (e.g. , words spoken by) the first person 1 10A, which can be listened to by the second person 1 10B. At least one display of the A-V subsystem 120B uses the video signal of the first person 110A to display video images of the first person 1 10A, which can be viewed by the second person 1 10B.
[0045] Conventionally, non-modified versions of the audio and video signals of the first person 1 10A (which were obtained by the A-V subsystem 120A) are used to output and display audio and video of the first person 1 10A to the second person 1 10B (using the A-V subsystem 120B which is in proximity to the second person 1 10B). Thus, if the first person 1 10A has an angry facial expression (e.g., a furrowed brow), an angry body pose (e.g., a clenched upright fist), and an angry (e.g., high) tone of voice when participating in a video chat with the second person 1 10B, the second person 1 10B will see the angry facial expression and angry body pose of the first person 1 10A and will hear the angry tone of the first person 1 10A. It is noted that the term body pose, as used herein, also encompasses hand pose.
[0046] In accordance with certain embodiments of the present technology, the audio and video signals of the first person 1 10A are modified prior to being provided to the A-V subsystem 120B, which results in audio and video of the first person 1 10A that is listened to and seen by the second person 1 10B, being different than what the first person 1 10A actually looked like and sounded like. Such modifications to the audio and video signals of the first person 1 10A can be performed by the same A-V subsystem that obtains that audio and video signals. More specifically, as shown in FIG. 2, an A-V and modification subsystem 220A can obtain audio and video signals of the first person and modify such signals before providing such signals to the communication network(s) 130 that provide the modified audio and video signals to the A-V subsystem 120B in proximity to the second person 1 10B. Alternatively, such modifications to the audio and video signals of the first person 1 10A can be performed by a further subsystem that is separate from the A-V subsystem 120A that obtains that audio and video signals of the first person 1 10A. For example, as shown in FIG. 3, a modification subsystem 320A can receive audio and video signals of the first person 110A, and the modification subsystem 320A can modify such signals before providing such signals to the communication network(s) 130 that provide the modified audio and video signals to the A-V subsystem 120B in proximity to the second person 1 10B. Another option is for the A-V subsystem 120A (that obtains that audio and video signals of the first person 1 10A) to provide audio and video signals of the first person 110A via one or more communication networks 130 to a modification subsystem 420A, and then after the modification subsystem 420 modifies such signals, the modification subsystem 420 can provide the modified audio and video signals of the first person 1 10A via the communication network(s) 130 to the A-V subsystem 120B in proximity with the second person 1 10B. Other variations are also possible and within the scope of the embodiments described herein. While not shown in the FIGS. 1-4, video and audio signals of the second person 1 10B can also be provided to a similar modification subsystem to modify such signals so that perceived emotions of the second person appear more aligned with the semantic emotion of the second person 1 10B. [0047] The audio and video signals of the first person 1 10A, which are captured or otherwise obtained by the AV subsystem 120A (in FIGS. 1 , 3 and 4) or by the A-V and modification subsystem 220A (in FIG. 2), can also be referred to as captured audio and video signals of the first person 1 10A. FIG. 5 shows a modification subsystem 520 that receives captured audio and video signals from an AV subsystem 120A, or is part of an AV and modification subsystem 220A. As shown in FIG. 5, the modification subsystem 520 includes an emotion detection block 530 (which can also be referred to as an emotion detector 530) and an emotion modification block 540 (which can also be referred to as an emotion modifier 540). The emotion detector 530 can, for example, detect negative, positive, and/or neutral emotions of the first person 1 10A. Exemplary negative emotions include, but are not limited to, angry, nervous, distracted, and frustrated. The emotion modifier 540 can, for example, modify audio and video signals so that one or more types of perceived emotions of the first person 1 10A in the modified audio and video signals are neutral or positive emotions. Exemplary neutral or positive emotions include, but are not limited to, happy, calm, alert, and pleased. Additional details of the emotion detector 530 and emotion modifier 540, according to specific embodiments of the present technology, are discussed below with reference to FIG. 6.
[0048] Referring to FIG. 6, the emotion detector 530 is shown as including a facial detection block 610 (also referred to as a facial detector 610) and a facial expression recognition block 612 (also referred to as a facial expression recognizer 612). The emotion detector 530 is also shown as including a skeletal detection block 614 (also referred to as a skeletal detector 614) and a pose recognition block 616 (also referred to as a pose recognizer 616). As shown in FIG. 6, the facial detector 610 and the skeletal detector 614 are shown as receiving a video signal 602. The video signal 602 can be, e.g., a video signal of the first person 1 10A captured by the AV subsystem 120A, and more specifically, one or more cameras thereof. Still referring to FIG. 6, the emotion detector 530 is also shown as including an audio signal processing block 624 (also referred to as an audio signal processor 624 or an audio signal analyzer 624) and a natural language processing block 626 (also referred to as a natural language processor 626 or a natural language analyzer 626). As shown in FIG. 6, the audio signal analyzer 624 and the natural language analyzer 626 receive an audio signal 622. The audio signal 622 can be, e.g., an audio signal of the first person 1 10A capture by the AV subsystem 120A, or more specifically, a microphone thereof. The video signal 602 and the audio signal 622 are presumed to be digital signals, unless specifically stated otherwise. Interfaces 603 and 623 can receive the video signal 602 and the audio signal 622, e.g., respectively from a camera and a microphone, or from one or more other subsystems.
[0049] In accordance with certain embodiments, the facial detector 610 can detect a person’s face within an image, and can also detect facial features within the image. Already developed (or future developed) computer vision techniques can be used by the facial detector 610 to detect such facial features. For an example, the HSV (Hue- Saturation-Value) color model or some other computer vision technique can be used to detect a face within an image. A feature detection model or some other computer vision technique can be used to identify facial features, such as, but not limited to, eyes, nose, lips, chin, cheeks, eyebrows, forehead, and/or the like. Feature detection can also be used to detect wrinkles in specific facial regions, such as on the forehead, on the sides of the mouth, and/or around the eyes. In certain embodiments, a person’s face and their facial features can be identified using bounding boxes. Some features to be identified can be contained within other features, such as eyes on a user's face, in which case successive bounding boxes may be used to first identify the containing feature (e.g., a face) and then to identify the contained feature (e.g., each eye of a pair of eyes). In other embodiments, a single bounding box may be used to identify each distinct feature. In certain embodiments, one or more algorithm libraries, such as the OpenCV (http://opencv.willowgarage.com/wiki/) computer vision library and/or the Dlib algorithm library (http://dlib.net/), can be used to identify these facial features and to generate bounding boxes. In certain embodiments, a bounding box need not be rectangular, but rather, can be another shape, such as but not limited an elliptical. In certain embodiments, a machine learning technique, such as boosting, may be used to increase a confidence level in the detection of facial features (e.g., eyes, nose, lips, etc.). More generally, data sets can be used to train a deep neural network (DNN) and/or other computer model to detect facial features from images, and the trained DNN and/or other computer model can be thereafter used for facial feature recognition.
[0050] Once facial features have been identified (also referred to as detected) by the facial detector 610, the facial expression recognizer 612 can determine a person’s facial expression. Generally, a human face consists of different parts such as a chin, a mouth, eyes and a nose, as noted above. Shape, structure and size of those facial features can vary with different facial expressions. Additionally, with certain facial expressions, wrinkles in specific facial locations may change. For example, the shape of person’s eyes and mouth can be employed to distinguish between different facial expressions, as can the wrinkles on a person’s forehead, and/or the like. Based at least in part on the detected facial expressions of a person, one or more types of perceived emotions of the person can be determined by the perceived emotion detector 632 in FIG. 6. Exemplary perceived emotions that may be detected based at least in part on detected facial expressions include, but are not limited to, angry, nervous, distracted, and frustrated, as well as happy, calm, alert, and pleased. Certain techniques for quantifying perceived emotions are described below.
[0051] The skeletal detector 614 can use a skeletal detection model or some other computer vision technique to identify human body parts and joints, such as, but not limited to, arms, hands, elbows, wrists, and/or the like. The pose recognizer 616 can detect specific poses, such as whether a person is holding a steering wheel of a vehicle with both of their hands while driving a vehicle, or whether the person has one of their arms raised with their hand in a fist while driving a vehicle. Data sets can be used to train a deep neural network (DNN) and/or other computer model to detect human poses from images, and the trained DNN and/or other computer model can be thereafter used for pose recognition.
[0052] Once human body parts are detected within an image by the skeletal detector 614, the pose recognizer 616 can determine a person’s pose. Generally, a human body consists of different parts such as a head, neck, torso, upper arms, elbows, forearms, wrists, hands, etc. With certain poses, the overall and relative locations and orientations of such body parts may change. For example, while a person is driving a vehicle, that person will often have both of their hands on the steering wheel of the vehicle, but may raise one of their arms and make a fist if the person becomes angry, e.g., because a driver of another vehicle caused the person to stop short, swerve, and/or the like. As can be appreciated from FIG. 6, a detected pose can also be used to determine a perceived emotion of a person, as represented by the line from the pose recognizer 616 to the perceived emotion detector 632. [0053] As noted above, in FIG. 6 the audio signal analyzer 624 and the natural language analyzer 626 receive an audio signal 622. The audio signal 622 can be, e.g., an audio signal of the first person 1 10A captured by the AV subsystem 120A. The audio signal analyzer 624 can analyze the audio signal 622 to detect various features of the audio signal 622 that may vary in dependence on the emotional state of a person. Examples of such audio features include pitch, vibrato, and inflection. Pitch relates for the frequency of a signal, and thus, can be quantified as a frequency. Changes in the pitch of a person’s voice is often correlated with an arousal state, or more generally an emotional state, of the person. For example, increases in pitch often correlate with highly aroused states such as anger, joy, or fear, while decreases in pitch often correlate with low arousal states such as sadness or calmness. Vibrato is a periodic modulation of the pitch (e.g., fundamental frequency) of a person’s voice, occurring with a given rate and depth. Vibrato also relates to jitter, and it often correlates with changes in emotion. Increased fluctuations in pitch, and thus vibrato, can, e.g., be indicative of increases in happiness, distress, or fear. Inflection is a rapid modification of the pitch at the start of each utterance, which overshoots its target by several semitones but quickly decays to the normal value. The use of inflection leads to increased variation in pitch, which is associated with high emotional intensity and positive valence. As can be appreciated from FIG. 6, results of the audio signal analysis, performed by the audio signal analyzer 624, can also be used to determine the perceived emotion of a person, as represented by the line from the audio signal analyzer 624 to the perceived emotion detector 632. As can be appreciated from the above discussion, certain changes in a specific audio feature can be indicative of either an increase in a positive emotion (e.g., happiness) or a negative emotion (e.g., anger). For example, increases in happiness or fear can both cause an increase in pitch. However, by analyzing multiple vocal features, alone, or in combination with facial expression and/or body pose, it is possible to determine relatively accurate perceived emotions of a person.
[0054] The natural language analyzer 626 performs natural language processing (NLP) of the audio signal 622, the results of which are used to determine the semantic emotion of a person, as represented by the line from the natural language analyzer 626 to the semantic emotion detector 634. The NLP that is performed by the natural language analyzer 626 can include speech recognition that provides a textual representation of a person’s speech. In natural speech there are hardly any pauses between successive words, and thus speech segmentation can be a subtask of speech recognition, wherein speech segmentation involves separating a sound clip of a person into multiple words. The natural language analyzer 626 can be configured to recognize a single language, or multiple different languages, such as English, Chinese, Spanish, French, and German, just to name a few. Where the natural language analyzer 626 is capable of performing NPL for multiple different languages, an output of the natural language analyzer 626 can include an indication of the specific language that a person is speaking.
[0055] The perceived emotion detector 632 can use one or more look up tables (LUTs) to determine one or more types of perceived emotions associated with a person based on outputs of the facial expression analyzer 612, the pose recognizer 616, and the audio signal analyzer 624. The output of the facial expression analyzer 612 can specify one or more facial expression features of the person determined based on a video signal 602 of the person, the output of the pose recognizer 616 can specify one or more body poses of the person determined based on the video signal 602 of the person, and the output of the audio signal analyzer 624 specify one or more audio features determined based on the audio signal 622. Instead of, or in additional to using LUTs, the perceived emotion detector 632 can be implemented by one or more DNN and/or one or more other computer models that is/are trained based on perceived emotion training data, which can include facial expression training data, body pose training data, speech training data, and/or other perceived emotion training data.
[0056] The semantic emotion detector 634 can use one or more look up tables (LUTs) to determine a perceived emotion associated with a person based on outputs of the natural language analyzer 626. The output of the natural language analyzer 626 can specify words and sentences spoken by a person as determined based on the audio signal 622, and can also indicate the language being spoken. Instead of, or in additional to using LUTs, the semantic emotion detector 634 can be implemented by one or more DNN and/or other computer models that is/are trained based on semantic emotion training data.
[0057] Still referring to FIG. 6, outputs of the perceived emotion detector 632 and the semantic emotion detector 634 are also shown as being provided to an emotion modification block 540, which can also be referred to as an emotion modifier 540. The emotion modifier 540 is also shown as receiving the captured video signal 602 and the captured audio signal 622. The emotion modifier 540 is shown as including a facial expression modification block 642, a pose modification block 646, and an audio modification block 648, which can also be referred to respectively as a facial expression modifier 642, a pose modifier 646, and an audio modifier 648. As noted above, the perceived emotion detector 632 can determine one or more types of perceived emotions of a person based on a detected facial expression, a detected body pose determined based on the video signal 602, and detected audio features (e.g., pitch, vibrato, and inflection) determined based on the audio signal 622. As also noted above, the semantic emotion detector 634 determines the semantic emotion of the person based on their spoken language using NLP.
[0058] In accordance with certain embodiments of the present technology, the facial expression modifier 642 modifies facial expression image data of the video signal 602 to increase an alignment between a facial expression perceived emotion of the person (as determined by the perceived emotion detector 632) and the semantic emotion of the person (as determined by the semantic emotion detector 364). In accordance with certain embodiments of the present technology, the pose modifier 646 modifies image data of the video signal 602 to increase an alignment between a body pose perceived emotion of the person (as determined by the perceived emotion detector 632) and the semantic emotion of the person (as determined by the semantic emotion detector 634). In accordance with certain embodiments of the present technology, the audio modifier 648 modifies audio data of the audio signal 622 to increase an alignment between a speech perceived emotion of the person (as determined by the perceived emotion detector 632) and the semantic emotion of the person (as determined by the semantic emotion detector 634). The emotion modifier 540 is shown as outputting a modified video signal 652 and a modified audio signal 662.
[0059] Certain embodiments of the present technology rely on a presumption that a person’s emotions that are responsive to and/or caused by environmental factors are recognizably different in a certain feature space from a person’s emotions responsive to and/or caused by conversational context, and that a difference between the emotions responsive to and/or caused by environmental factors and the emotions response to and/or caused by conversational context can be quantified. In accordance with certain embodiments, a feature space that is used to quantify differences between perceived and semantic emotions is the feature space defined by the arousal/valance circumplex model, which was initially developed by James Russell, and published the article entitled "A circumplex model of affect" in the Journal of Personality and Social Psychology, Vol. 39(6), Dec. 1980, pages 1 161-1 178. The arousal/valance circumplex model, which can also be referred to more succinctly as the circumplex model, suggests that emotions are distributed in a two-dimensional circular space, containing arousal and valence dimensions. Arousal corresponds to the vertical axis and valence corresponds to the horizontal axis, while the center of the circle corresponds to a neutral valence and a medium level of arousal. In this model, emotional states can be represented at any level of valence and arousal, or at a neutral level of one or both of these factors. James Russell and Lisa Feldman Barrett later developed a modified arousal/valance circumplex model, which they published in the article entitled "Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant" in the Journal of Personality and Social Psychology, Vol. 76(5), May 1999, pages 805-819.
[0060] In accordance with certain embodiments of the present invention, the perceived emotion detector 632 uses one or more arousal/valance circumplex models to determine three types of perceive emotions separately, based on facial expression, body pose, and speech. More specifically, in certain embodiments a facial circumplex model is used to determine an arousal and a valence associated with the person’s facial expression; a pose circumplex model is used to determine an arousal and a valence associated with the person’s body pose; and a speech circumplex model is used to define an arousal and a valence associated with the person’s speech. The valence dimension is represented on a horizontal axis and ranges between positive and negative valences. The positive and negative valences (along the horizontal axis) are also known, respectively, as pleasant and unpleasant emotions, or more generally as positiveness. The arousal dimension is represented on a vertical axis, which intersects the horizontal “valance” axis, and ranges between activated and deactivated. The activated and deactivated arousals (along the vertical axis) are also known, respectively, as intense and non-intense arousals, or more generally as activeness. A general circumplex model is illustrated in FIG. 7A, a facial circumplex model is illustrated in FIG. 7B, a pose circumplex model is illustrated in FIG. 7C, and the speech circumplex model is illustrated in FIG. 7D.
[0061] In accordance with certain embodiments of the present technology, feature vectors generated from facial expression detection, pose detection, and speech detection algorithms are input to a DNN. The facial expression detection can be performed by the facial detector 610 and facial expression analyzer 612, discussed above with reference to FIG. 6. Results of the facial expression detection can be one or more facial feature vectors. The pose detection can be performed by the skeletal detector 614 and the pose recognizer 616. Results of the body pose detection can be one or more pose feature vectors. The speech detection can be performed by the audio signal analyzer 624. Results of the speech detection can be one or more speech feature vectors. In accordance with certain embodiments, the aforementioned feature vectors are concatenated together and fed into a DNN. Such a DNN can be used to implement the perceived emotion detector 632 in FIG. 6.
[0062] In accordance with certain embodiments of the present technology, outputs of the DNN, which implements the perceived emotion detector 632, are six values denoted as {arof, vah, arop, valp, aros, vals}, where“aro” refers to arousal and“va/” refers to valence, and where the subscripts f, p, and s refer respectively to facial, pose, and speech. Accordingly, there is an arousal value and valence value indicative of a person’s facial expression, an arousal value and a valence value indicate of the person’s body pose, and an arousal value and a valence value indicative of the person’s speech. In accordance with certain embodiments, these values are used to modify the person’s facial expression, body pose, and/or speech, as will be explained in additional detail below. The terms modify and alter are used interchangeably herein.
[0063] In accordance with certain embodiments of the present technology, in order to quantify a person’s semantic emotion, a deep learning based Natural Language Processing (NLP) algorithm is applied. A primary idea is to determine the context- dependence of emotions with recognized speech. In natural language processing, textual instances are often represented as vectors in a feature space. The number of features can often be as large as hundreds of thousands, and traditionally, these features have known meanings. For example, whether the instance has a particular word observed previously in the training data, whether the word is listed as a positive/negative term in the sentiment lexicon, and so on. By using the NLP algorithm, the person’s semantic emotion can be estimated. In accordance with certain embodiments, outputs of the DNN, which implements the semantic emotion detector 634, are two values denoted as {arosem, va/sem}, which collectively can be represented as Ernosem. In accordance with certain embodiments, a person’s semantic emotion Ernosem is used to modify image data and audio data corresponding to a person’s facial expression, body pose, and/or speech, as will be explained in additional detail below.
[0064] Each perceived emotion Emo'perc denotes a point on a circumplex model, where / = {face, pose, speech}. This enables a mapping of multiple types of perceived emotions onto a circumplex model, as shown in FIG. 8. The semantic emotion Ernosem also denotes a point on a circumplex model, which can be mapped to the same circumplex model, as also shown in FIG. 8. Referring to FIG. 8, the“X” labeled 802 corresponds to the person’s perceived facial emotion, the“X” labeled 804 corresponds to the person’s perceived pose emotion, and the“X” labeled 806 corresponds to the person’s perceived speech emotion. The locations of the Xs 802, 804, and 806 are defined by the six values denoted as {arof, vah, arop, valp, aros, vals}, which were discussed above. More specifically, the location of the“X” labeled 802 is defined by the values arof and va , the location of the“X” labeled 804 is defined by the values arop and valP and the location of the“X” labeled 806 is defined by the values aros and vah. Still referring to FIG. 8, the dot labeled 808 corresponds to the person’s semantic emotion E/r?osem. The location of the dot labeled 808 is defined by the values arosem and valsem.
[0065] As noted above, activeness is a measure of arousal, and positiveness is a measure of valence. A distance disti between any of the perceived emotions Emo}jrec and the semantic emotion Emosem can be calculated using the equation below:
[0066] The distance between of the perceived emotions Emo}jrec and the semantic emotion Emosem is indicative of how close perceived and semantic emotions are aligned. For example, where a distance between a specific perceived emotion (e.g., body pose) and the semantic emotion is relatively small, that is indicative of the perceived emotion being substantially aligned with the semantic emotion. Conversely, where a distance between a specific perceived emotion (e.g., body pose) and the semantic emotion is relatively large, that is indicative of the perceived emotion being substantially unaligned with the semantic emotion. In accordance with certain embodiments, for each type of determined perceived emotion, a distance between the perceived emotion and the semantic emotion is determined. This will result in three distance values being determined, one for each of facial expression, body pose, and speech. Where a determined distance exceeds a specified distance threshold, it will be determined that the perceived emotion is substantially unaligned with the semantic emotion, and in response to that determination, the respective feature (e.g., facial expression, body pose, or speech) is modified to increase the alignment between the perceived emotion and the semantic emotion. For a more example, where a determined distance between the facial perceived emotion (represented by the“X” labeled 802 in FIG. 8) and the semantic emotion (represented by the dot labeled 808 in FIG. 8) is greater than the specified distance threshold, then facial image data of the video signal is modified to produce a modified video signal where the facial perceived emotion is more aligned with the semantic emotion. Conversely, if the determined distance between the facial perceived emotion and the semantic emotion is less than (also referred to as within) the specified distance threshold, then facial image data of the video signal is not modified. This determining of a respective distance and comparison of the determined distance to a distance threshold is also determined for body pose as well as for speech. Results of such comparisons are used to determine whether or not to modify body pose data of the video signal, and/or speech data of the audio signal.
[0067] The aforementioned distance determinations and comparisons are summarized in the flow diagram of FIG. 9. Referring to FIG. 9, at step 902 a distance between one of the perceived emotions (facial, pose, speech) and the semantic emotion is determined, and more specifically calculated, e.g., using the equation discussed above. At step 904 the calculated distance is compared to the distance threshold, and at step 906 there is a determination of whether the calculated distance is within (i.e. , less than) the distance threshold. If the calculated distance is not within the distance threshold (i.e., if the answer to the determination at step 906 is No), then flow goes to step 908, and the relevant signal or portion thereof is modified at step 908, before flow proceeds to step 910. If the calculated distance is within the distance threshold (i.e., if the answer to the determination at step 906 is Yes), then flow goes to step 910 without any modification to the relevant signal or portion thereof. The above summarized steps can be performed for each of the different types of perceived emotions, including facial, pose, and speech.
[0068] In accordance with certain embodiments of the present technology, video and audio of a first person is modified by generating synthetic image/audio to replace the original version thereof. More specifically, the originally obtained video and audio signals of a first person are modified to produce modified video and audio signals that when viewed and listed to by a second person (or multiple other persons) have a perceived emotion that is more aligned with the semantic emotion of the first person. In accordance with certain embodiments, the perceived emotion of the generated image/audio should approach the semantic emotion as close as possible.
[0069] Referring back to FIG. 6, the emotion modifier 540, which can also be referred to more specifically as the perceived emotion modifier 540, is shows as including as a facial expression modifier 642, a pose modifier 646, and an audio modifier 648. Each of the modifiers 642, 646, and 648, which can also be referred to as modules, uses algorithms to generate synthetic images or synthetic audio by modifying specific data within the captured audio and video signals. For a simple example, assume it is determined that a first persons’ semantic emotion is happy, but theirfacial perceive emotion is nervous, their pose perceive emotion is upset, and their speech perceived emotion is stressed. Using embodiments of the present technology, facial image data is modified so that the person’s facial expression is happy (rather than nervous), pose image data is modified so that the person’s body pose is happy (rather than upset), and audio data is modified so that the person’s speech is happy (rather than nervous). Such modifications should be done in real or near real time so that there is no discernable lag in the video chat.
[0070] As noted above, one or more DNN and/or other computer models can be used to modify the captured video and audio signals to produce the modified video and audio signals. In accordance with specific embodiments, a generative adversarial network (GAN) is used to perform such modifications. A GAN is deep neural network architecture that includes two neural networks, namely a generative neural network and a discriminative neural network, that are pitted one against the other in a contest (thus the use of the term “adversarial”). The generative neural network and the discriminative neural network can thus be considered sub-networks of the GAN neural network. The generative neural network generates candidates while the discriminative neural network evaluates them. The contest operates in terms of data distributions. The generative neural network can learn to map from a latent space to a data distribution of interest, while the discriminative neural network can distinguish candidates produced by the generative neural network from the true data distribution. The generative neural network's training objective can be to increase the error rate of the discriminative neural network (i.e. , to "fool" the discriminator neural network by producing novel candidates that the discriminator thinks are not synthesized (are part of the true data distribution). A known dataset serves as the initial training data for the discriminative neural network. Training the discriminative neural network can involve presenting it with samples from the training dataset, until it achieves acceptable accuracy. The generative neural network can be trained based on whether it succeeds in fooling the discriminative neural network. The generative neural network can be seeded with a randomized input that is sampled from a predefined latent space (e.g., a multivariate normal distribution). Thereafter, candidates synthesized by the generative neural network can be evaluated by the discriminative neural network. Backpropagation can be applied in both networks so that the generative neural network produces better images, while the discriminative neural network becomes more skilled at flagging synthetic images. The generative neural network can be, e.g., a deconvolutional neural network, and the discriminative neural network can be, e.g., a convolutional neural network. The GAN should be trained prior to it being used to modify signals during a video chat.
[0071] Referring back to FIG. 6, the facial expression modifier 642 can be implemented by a GAN. More specifically, a GAN can be used to modify image data in a video signal to produce a modified video signal that can be used to display realistic images of a person where the images have been modified to make the person’s facial and pose perceived emotions more aligned with the person’s semantic emotion. A GAN can also be used to modify an audio signal so that a person’s speech perceived emotion is more aligned with the person’s semantic emotion. In specific embodiments, a StarGAN can be used to perform image and/or audio modifications. An article titled “StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-lmage T ranslation,” by Y. Choi et al, CVPR, 2018, discusses how StarGANs have been used modify the facial expressions of people in a realistic manner. The use of additional and/or alternative types of neural networks and/or other types of computer models are also within the scope of the embodiments described herein.
[0072] Still referring to FIG. 6, a GAN can also be used to implement the pose modifier 646. Alternatively, a pretrained visual generator model can be used to implement the pose modifier 646. As shown in FIG. 6, the original video signal 602 is provided to the skeletal detector 614. The original video signal 602 can also be referred to as the original image stream 602. The skeleton detector 614 extracts skeletal information from the original image stream. The skeletal information can be represented as a vector X, which stores all the joint positions in a frame. In accordance with an embodiment, the vector X is combined with a semantic emotion signal, which is represented with a vector e. These two vectors can be concatenated to be vector X and used as input for a pretrained visual generator model. The pretrained visual generator model can be implemented, e.g., with convolution layers, maxpooling layers, deconvolution layers, and batch normalization layers, but is not limited thereto. The output of the pretrained visual generator model can be used to generate the modified video signal 652 which includes the modified body pose, which is more aligned with the semantic emotion.
[0073] Still referring to FIG. 6, the original audio signal 622 is shown as being provided to the audio signal analyzer 624, as was already noted above. To make the perceived emotion corresponding to a person’s speech more aligned with their semantic emotion, features of the audio signal that can be modified include the pitch, vibrato, and inflection, but are not limited thereto. More specifically, pitch can be shifted, where pitch-shift denotes the multiplication of the pitch of the original voice signal by a factor a. Increased pitch (a > 1 ) often correlates with highly aroused states such as happiness, while decreased pitch (a < 1 ) correlates with low valence, such as sadness. Vibrato is a periodic modulation of the pitch (fundamental frequency) of the voice, occurring with a given rate and depth. Vibrato, also related to jitter, is frequently reported as a correlate of high arousal and is an important marker of emotion even in single vowels. Vibrato can be modified to alter the perceived emotion corresponding to speech. Inflection is a rapid modification (e.g., -500 ms) of the pitch at the start of each utterance, which overshoots its target by several semitones but quickly decays to the normal value. The use of inflection leads to increased variation in pitch, which is associated with high emotional intensity and positive valence. Inflection can also be modified to alter the perceived emotion corresponding to speech. An audio signal can also be filtered to alter the perceived emotion corresponding to speech, where filtering denotes the process of emphasizing or attenuating the energy contributions of certain areas of the frequency spectrum. For instance, high arousal emotions tend to be associated with increased high frequency energy, making the voice sound sharper and brighter. Where a person’s semantic emotion corresponds to a less activated arousal than the person’s perceived emotion corresponding to speech, high frequency energy within the audio signal can be attenuated using filtering, to make the perceived emotion more aligned with the semantic emotion. The emotional tone of the modified audio signal should be recognizable, and the voice should sound natural and not be perceived synthetic. As noted above, the terms modify and alter are used interchangeably herein.
[0074] One or more processors can be used to implement the above described neural networks. Where multiple processors are used, they can be collocated orwidely disturbed, or combinations thereof.
[0075] The high level flow diagram of FIG. 10 will now be used to summarize methods according to certain embodiments of the present technology. Referring to FIG. 10, step 1002 involves obtaining a video signal and an audio signal of a first person participating in a video chat with a second person. Referring back to FIGS. 1 - 4, step 1002 can be performed by an A-V subsystem (e.g., 120A), or more specifically, one or more cameras and one or more microphones of the A-V subsystem, or some other subsystem or system.
[0076] Referring again to FIG. 10, step 1004 involves determining one or more types of perceived emotions of the first person based on the video signal. Step 1006 involves determining a semantic emotion of the first person based on the audio signal. The types of perceived emotions that can be determined at step 1004 include a facial expression perceived emotion, a body pose perceived emotion, and a speech perceived emotion, as was described above.
[0077] Referring briefly back to FIGS. 5 and 6, the various types of perceived emotions can be determined, e.g., by the emotion detector 530, or more specifically, by the perceived emotion detector 632 thereof. More specifically, a facial expression and a body pose of the first person can be determined based on the video signal obtained at step 1002, and a facial expression perceived emotion and a body pose perceive emotion of the first person can be determined based thereon. Additionally, audio signal processing of the audio signal obtained at step 1002 can be performed to determine at least one of pitch, vibrato or inflection of speech of the first person, and a speech perceived emotion of the first person can be determined based on results of the audio signal processing. Additional and/or alternative variations are also possible and within the scope of the embodiments described herein.
[0078] In accordance with specific embodiments, at step 1004 a facial circumplex model is used to quantify a positiveness and an activeness of a facial expression of the first person based on the video signal; a pose circumplex model is used to quantify a positiveness and an activeness of a body pose of the first person based on the video signal; and a speech circumplex model is used to quantify a positiveness and an activeness of speech of the first person based on the audio signal.
[0079] The semantic emotion determined at step 1006 can be determined, e.g., by the emotion detector 530, or more specifically, by the semantic emotion detector thereof 634. As explained in additional detail above, step 1006 can involve performing natural language processing of the audio signal, and determining the semantic emotion of the first person based on results of the natural language processing of the audio signal. In accordance with specific embodiments, at step 1006 the semantic emotion of the first person is determined based on the audio signal using a language circumplex model to quantify a positiveness and an activeness of language of the first person based on the audio signal.
[0080] Referring again to FIG. 10, step 1008 involves altering the video signal and the audio signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person. One or more computer implemented neural networks can be used to perform step 1008, as was described in detail above. Other types of computer implemented models can alternatively or additionally be used to perform step 1008. Step 1008 can involves modifying a facial expression and a body pose of image data included in the video signal, as well as modifying at least one of a pitch, vibrato, or inflection of audio data included in the audio signal. [0081] In accordance with specific embodiments, step 1008 involves altering image data included in a video signal to reduce a distance between the positiveness and the activeness of the facial expression of the first person and the positiveness and the activeness of the language of the first person. Step 1008 can also involve altering image data included in the video signal to reduce a distance between the positiveness and the activeness of the body pose of the first person and the positiveness and the activeness of the language of the first person. Further, step 1008 can also involve altering audio data included in an audio signal to reduce a distance between the positiveness and activeness of the speech of the first person and the positiveness and the activeness of the language of the first person.
[0082] Still referring to FIG. 10, step 1010 involves providing (e.g., transmitting) the altered video signal and the altered audio signal to a subsystem (e.g., device) associated with (e.g., in proximity to) the second person that is participating in the video chat, to thereby enable the second person to view and listen to modified images and audio of the first person that have increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
[0083] The methods described above, e.g., with reference to FIGS. 9 and 10, can be performed at least in part by an in-cabin computer system or mobile computing device, such as, but not limited to, smartphones, tablet computers, notebook computers, laptop computers, or the like. The steps of such methods can be performed solely by a mobile computing device, or by a mobile computing device that communicates via a network with one or more servers via one or more communication networks. FIG. 1 1 illustrates an exemplary components of an exemplary mobile computing device with which embodiments of the present technology can be used. Such a mobile computing device can be used, e.g., to implement an A-V subsystem (e.g., 120A or 220A in FIGS. 1 -4), but is not limited thereto.
[0084] FIG. 1 1 illustrates an exemplary mobile computing device 1 102 with which embodiments of the present technology described herein can be used. The mobile computing device 1 102 can be a smartphone, such as, but not limited to, an iPhoneTM, a BlackberryTM, an AndriodTM-based or a WindowsTM-based smartphone. The mobile computing device 1 102 can alternatively be a tablet computing device, such as, but not limited to, an iPadTM, an AndriodTM-based or a WindowsTM-based tablet. For another example, the mobile computing device 1 102 can be iPod TouchTM, or the like.
[0085] Referring to the block diagram of FIG. 1 1 , the mobile computing device 1 102 is shown as including a camera 1 104, an accelerometer 1 106, a magnetometer 1 108, a gyroscope 1 1 10, a microphone 1 1 12, a display 1 1 14 (which may or may not be a touch screen display), a processor 1 1 16, memory 1 1 18, a transceiver 1 120, a speaker 1122 and a drive unit 1 124. Each of these elements is shown as being connected to a bus 1 128, which enables the various components to communicate with one another and transfer data from one element to another. It is also possible that some of the elements can communicate with one another without using the bus 1 128.
[0086] The camera 1 104 can be used to obtain a video signal that includes images of a person using the mobile computing device 1 102. The microphone 1 1 12 can be used to produce an audio signal indicative of what is said by a person using the mobile computing device 1 102.
[0087] The accelerometer 1 106 can be used to measure linear acceleration relative to a frame of reference, and thus, can be used to detect motion of the mobile computing device 1 102 as well as to detect an angle of the mobile device 1 102 relative to the horizon or ground. The magnetometer 1 108 can be used as a compass to determine a direction of magnetic north and bearings relative to magnetic north. The gyroscope 1 1 10 can be used to detect both vertical and horizontal orientation of the mobile computing device 1 102, and together with the accelerometer 1 106 and magnetometer 1 108 can be used to obtain very accurate information about the orientation of the mobile computing device 1 102. It is also possible that the mobile computing device 1 102 includes additional sensor elements, such as, but not limited to, an ambient light sensor and/or a proximity sensor.
[0088] The display 1 1 14, which many or not be a touch screen type of display, can be used as a user interface to visually display items (e.g., images, options, instructions, etc.) to a user and accept inputs from a user. The display 1 1 14 can also be used to enable the user of the mobile computing device 1 102 to participate in a video chat. Further, the mobile computing device 1 102 can include additional elements, such as keys, buttons, a track-pad, a trackball, or the like, that accept inputs from a user. [0089] The memory 1 1 18 can be used to store software and/or firmware that controls the mobile computing device 1 102, as well to store images captured using the camera 1 104, but is not limited thereto. Various different types of memory, including non-volatile and volatile memory can be included in the mobile computing device 1 102. The drive unit 1 124, e.g., a hard drive, but not limited thereto, can also be used to store software that controls the mobile computing device 1 102, as well to store images captured using the camera 1 104, but is not limited thereto. The memory 1 1 18 and the disk unit 1 124 can include a machine readable medium on which is stored one or more sets of executable instructions (e.g., apps) embodying one or more of the methodologies and/or functions described herein. In place of the drive unit 1 124, or in addition to the drive unit, the mobile computing device can include a solid-state storage device, such as those comprising flash memory or any form of non-volatile memory. The term“machine-readable medium” as used herein should be taken to include all forms of storage media, either as a single medium or multiple media, in all forms; e.g., a centralized or distributed database and/or associated caches and servers; one or more storage devices, such as storage drives (including e.g., magnetic and optical drives and storage mechanisms), and one or more instances of memory devices or modules (whether main memory, cache storage either internal or external to a processor, or buffers. The term“machine-readable medium” or“computer-readable medium” shall be taken to include any tangible non-transitory medium which is capable of storing or encoding a sequence of instructions for execution by the machine and that cause the machine to perform any one of the methodologies. The term“non- transitory medium” expressly includes all forms of storage drives (optical, magnetic, etc.) and all forms of memory devices (e.g., DRAM, Flash (of all storage designs), SRAM, MRAM, phase change, etc., as well as all other structures designed to store information of any type for later retrieval.
[0090] The transceiver 1 120, which is connected to an antenna 1 126, can be used to transmit and receive data wirelessly using, e.g., Wi-Fi, cellular communications or mobile satellite communications. The mobile computing device 1 102 may also be able to perform wireless communications using Bluetooth and/or other wireless technologies. It is also possible the mobile computing device 1 102 includes multiple types of transceivers and/or multiple types of antennas. The transceiver 1 120 can include a transmitter and a receiver. [0091] The speaker 1 122 can be used to provide auditory instructions, feedback and/or indicators to a user, playback recordings (e.g., musical recordings), as well as to enable the mobile computing device 1 102 to operate as a mobile phone. The speaker 1 122 can also be used to enable the user of the mobile computing device 1 102 to participate in a video chat.
[0092] The processor 1 1 16 can be used to control the various other elements of the mobile computing device 1 102, e.g., under control of software and/or firmware stored in the memory 1 1 18 and/or drive unit 1 124. It is also possible that there are multiple processors 1 1 16, e.g., a central processing unit (CPU) and a graphics processing unit (GPU). The processor(s) 1 1 16 can executed computer instructions (stored in a non-transitory computer-readable medium) to cause the processor(s) to perform steps used to implement the embodiments of the present technology described herein.
[0093] Certain embodiments of the present technology described herein can be implemented using hardware, software, or a combination of both hardware and software. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. A computer readable medium or media does not include propagated, modulated, or transitory signals.
[0094] Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term“modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
[0095] In alternative embodiments, some or all of the software can be replaced by dedicated hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application- specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. In one embodiment, software (stored on a storage device) implementing one or more embodiments is used to program one or more processors. The one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces.
[0096] It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details.
[0097] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
[0098] The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.
[0099] The disclosure has been described in conjunction with various embodiments. However, other variations and modifications to the disclosed embodiments can be understood and effected from a study of the drawings, the disclosure, and the appended claims, and such variations and modifications are to be interpreted as being encompassed by the appended claims. In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article“a” or “an” does not exclude a plurality.
[00100] For purposes of this document, it should be noted that the dimensions of the various features depicted in the figures may not necessarily be drawn to scale.
[00101] For purposes of this document, reference in the specification to “an embodiment,”“one embodiment,”“some embodiments,” or“another embodiment” may be used to describe different embodiments or the same embodiment.
[00102] For purposes of this document, a connection may be a direct connection or an indirect connection (e.g., via one or more other parts). In some cases, when an element is referred to as being connected or coupled to another element, the element may be directly connected to the other element or indirectly connected to the other element via intervening elements. When an element is referred to as being directly connected to another element, then there are no intervening elements between the element and the other element. Two devices are“in communication” if they are directly or indirectly connected so that they can communicate electronic signals between them.
[00103] For purposes of this document, the term“based on” may be read as“based at least in part on.”
[00104] For purposes of this document, without additional context, use of numerical terms such as a“first” object, a“second” object, and a“third” object may not imply an ordering of objects, but may instead be used for identification purposes to identify different objects.
[00105] The foregoing detailed description has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the subject matter claimed herein to the precise form(s) disclosed. Many modifications and variations are possible in light of the above teachings. The described embodiments were chosen in order to best explain the principles of the disclosed technology and its practical application to thereby enable others skilled in the art to best utilize the technology in various embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope be defined by the claims appended hereto.
[00106] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

CLAIMS What is claimed is:
1. A method comprising:
obtaining a video signal and an audio signal of a first person participating in a video chat with a second person;
determining one or more types of perceived emotions of the first person based on the video signal;
determining a semantic emotion of the first person based on the audio signal; and altering the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
2. The method of claim 1 , wherein the determining the one or more types of perceived emotions of the first person based on the video signal includes:
detecting at least one of a facial expression or a body pose of the first person based on the video signal; and
determining at least one of a facial expression perceived emotion or a body pose perceive emotion of the first person based on the at least one of the facial expression or the body pose of the first person.
3. The method of claim 2, wherein:
the determining the one or more types of perceived emotions of the first person is also based on the audio signal and includes performing audio signal processing of the audio signal to determine at least one of pitch, vibrato, or inflection of speech of the first person, and determining a speech perceived emotion of the first person based on results of the audio signal processing of the audio signal; and the method further comprises altering the audio signal to increase alignment between the speech perceived emotion of the first person and the semantic emotion of the first person.
4. The method of claim 3, wherein:
the altering the video signal to produce the altered video signal includes modifying image data of the video signal corresponding at least one of a facial expression or a body pose; and
the altering the audio signal to produce the altered audio signal includes modifying audio data of the video signal corresponding at least one of the pitch, vibrato, or inflection.
5. The method of claim 3 or 4, further comprising:
providing the altered video signal and the altered audio signal to a subsystem associated with the second person that is participating in the video chat, to thereby enable the second person to view and listen to images and audio of the first person having increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
6. The method of any one of claims 1 through 5, wherein the determining the semantic emotion of the first person based on the audio signal includes:
performing natural language processing of the audio signal; and
determining the semantic emotion of the first person based on results of the natural language processing of the audio signal.
7. The method of any one of claims 1 through 6, wherein:
the determining one or more types of perceived emotions of the first person based on the video signal includes at least one of using a facial circumplex model to quantify a positiveness and an activeness of a facial expression of the first person based on the video signal, or using a pose circumplex model to quantify a positiveness and an activeness of a body pose of the first person based on the video signal; the determining the semantic emotion of the first person based on the audio signal includes using a language circumplex model to quantify a positiveness and an activeness of language of the first person based on the audio signal; and the altering the video signal to produce the altered video signal includes at least one of
altering image data of the video signal to reduce a distance between the positiveness and the activeness of the facial expression of the first person and the positiveness and the activeness of the language of the first person, or
altering image data of the video signal to reduce a distance between the positiveness and the activeness of the body pose of the first person and the positiveness and the activeness of the language of the first person.
8. The method of claim 7, wherein:
the determining the one or more types of perceived emotions of the first person is also based on the audio signal and includes using a speech circumplex model to quantify a positiveness and an activeness of speech of the first person based on the audio signal; and
the method further comprises altering audio data of the audio signal to produce an altered audio signal to reduce a distance between the positiveness and activeness of the speech of the first person and the positiveness and the activeness of the language of the first person.
9. The method of claim 8, further comprising:
providing the altered video signal and the altered audio signal to a subsystem associated with the second person that is participating in the video chat, to thereby enable the second person to view and listen to images and audio of the first person that have increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
10. A subsystem comprising:
one or more interfaces configured to receive a video signal and an audio signal of a first person participating in a video chat with a second person;
one or more processors communicative coupled to the one or more interfaces and configured to
determine one or more types of perceived emotions of the first person based on the video signal;
determine a semantic emotion of the first person based on the audio signal; and
alter the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
1 1. The subsystem of claim 10, further comprising:
one or more cameras configured to obtain the video signal; and
one or more microphones configured to obtain the audio signal.
12. The subsystem of claim 10 or 1 1 , wherein the one or more processors implement one or more neural networks that are configured to determine the one or more types of perceived emotions of the first person based on the video signal and determine the semantic emotion of the first person based on the audio signal.
13. The subsystem of any one of claims 10 through 12, wherein the one or more processors implement one or more neural networks that are configured to alter the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
14. The subsystem of any one of claims 10 through 13, wherein in order to determine the one or more types of perceived emotions of the first person based on the video signal, the one or more processors are configured to:
detect at least one of a facial expression or a body pose of the first person based on the video signal; and
determine at least one of a facial expression perceived emotion or a body pose perceive emotion of the first person based on the at least one of the facial expression or the body pose of the first person.
15. The subsystem of claim 14, wherein the one or more processors are also configured to:
perform audio signal processing of the audio signal to determine at least one of pitch, vibrato or inflection of speech of the first person, and determine a speech perceived emotion of the first person based on results of the audio signal processing; and
alter the audio signal to increase alignment between the speech perceived emotion of the first person and the semantic emotion of the first person.
16. The subsystem of claim 15, wherein the one or more processors are configured to:
modify image data of the video signal corresponding at least one of a facial expression or a body pose, to thereby alter the video signal to produce the altered video signal; and
modify audio data of the audio signal corresponding at least one of the pitch, vibrato, or inflection, to thereby alter the audio signal to produce the altered audio signal includes
17. The subsystem of any one of claims 10 through 16, wherein the one or more processors are configured to:
perform natural language processing of the audio signal; and
determine the semantic emotion of the first person based on results of the natural language processing of the audio signal.
18. The subsystem of any one of claims 10 through 17, wherein the one or more processors are configured to:
use a facial circumplex model to quantify a positiveness and an activeness of a facial expression of the first person based on the video signal;
use a pose circumplex model to quantify a positiveness and an activeness of a body pose of the first person based on the video signal; use a language circumplex model to quantify a positiveness and an activeness of language of the first person based on the audio signal; and alter image data of the video signal to reduce a distance between the positiveness and the activeness of the facial expression of the first person and the positiveness and the activeness of the language of the first person, and reduce a distance between the positiveness and the activeness of the body pose of the first person and the positiveness and the activeness of the language of the first person.
19. The subsystem of claim 18, wherein the one or more processors are also configured to:
use a speech circumplex model to quantify a positiveness and an activeness of speech of the first person based on the audio signal; and alter audio data of the audio signal to reduce a distance between the positiveness and activeness of the speech of the first person and the positiveness and the activeness of the language of the first person.
20. The subsystem of claim 19, further comprising:
a transmitter configured to transmit the altered video signal and the altered audio signal to a subsystem associated with the second person that is participating in the video chat, to thereby enable the second person to view and listen to video and audio of the first person having increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
21 . A non-transitory computer-readable medium storing computer instructions that when executed by one or more processors cause the one or more processors to perform the steps of:
obtaining a video signal and an audio signal of a first person participating in a video chat with a second person;
determining one or more types of perceived emotions of the first person based on the video signal;
determining a semantic emotion of the first person based on the audio signal; and altering the video signal to increase alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
22. The non-transitory computer-readable medium of claim 21 , wherein the determining the one or more types of perceived emotions of the first person based on the video signal includes:
detecting at least one of a facial expression or a body pose of the first person based on the video signal; and
determining at least one of a facial expression perceived emotion or a body pose perceive emotion of the first person based on the at least one of the facial expression or the body pose of the first person.
23. The non-transitory computer-readable medium of claim 22, wherein: the determining the one or more types of perceived emotions of the first person is also based on the audio signal and includes performing audio signal processing of the audio signal to determine at least one of pitch, vibrato, or inflection of speech of the first person, and determining a speech perceived emotion of the first person based on results of the audio signal processing of the audio signal; and
the computer instructions when executed by one or more processors also cause the one or more processors to perform the step of altering the audio signal to increase alignment between the speech perceived emotion of the first person and the semantic emotion of the first person.
24. The non-transitory computer-readable medium of claim 23, wherein:
the altering the video signal to produce the altered video signal includes modifying image data of the video signal corresponding at least one of a facial expression or a body pose; and
the altering the audio signal to produce the altered audio signal includes modifying audio data of the video signal corresponding at least one of the pitch, vibrato, or inflection.
25. The non-transitory computer-readable medium of claim 23 or 24, wherein the computer instructions when executed by one or more processors also cause the one or more processors to perform the step of:
providing the altered video signal and the altered audio signal to a subsystem associated with the second person that is participating in the video chat, to thereby enable the second person to view and listen to images and audio of the first person having increased alignment between at least one of the one or more types of perceived emotions of the first person and the semantic emotion of the first person.
26. The non-transitory computer-readable medium of any one of claims 21 through 25, wherein the determining the semantic emotion of the first person based on the audio signal includes:
performing natural language processing of the audio signal; and
determining the semantic emotion of the first person based on results of the natural language processing of the audio signal.
EP19718999.6A 2019-04-05 2019-04-05 Methods and systems that provide emotion modifications during video chats Pending EP3942552A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2019/026122 WO2020204948A1 (en) 2019-04-05 2019-04-05 Methods and systems that provide emotion modifications during video chats

Publications (1)

Publication Number Publication Date
EP3942552A1 true EP3942552A1 (en) 2022-01-26

Family

ID=66248731

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19718999.6A Pending EP3942552A1 (en) 2019-04-05 2019-04-05 Methods and systems that provide emotion modifications during video chats

Country Status (5)

Country Link
EP (1) EP3942552A1 (en)
JP (1) JP7185072B2 (en)
KR (1) KR102573465B1 (en)
CN (1) CN113646838B (en)
WO (1) WO2020204948A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4170609A1 (en) * 2021-10-21 2023-04-26 Koninklijke Philips N.V. Automated filter selection for altering a stream
KR20230081013A (en) * 2021-11-30 2023-06-07 주식회사 마블러스 Method for human recognition based on deep-learning, and method for magnaing untact education
US20230177755A1 (en) * 2021-12-07 2023-06-08 Electronic Arts Inc. Predicting facial expressions using character motion states

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007193824A (en) 2000-04-13 2007-08-02 Fujifilm Corp Image processing method
US6778252B2 (en) 2000-12-22 2004-08-17 Film Language Film language
JP4772315B2 (en) 2004-11-10 2011-09-14 ソニー株式会社 Information conversion apparatus, information conversion method, communication apparatus, and communication method
US8243116B2 (en) 2007-09-24 2012-08-14 Fuji Xerox Co., Ltd. Method and system for modifying non-verbal behavior for social appropriateness in video conferencing and other computer mediated communications
JP5338350B2 (en) 2009-02-06 2013-11-13 富士ゼロックス株式会社 Information processing apparatus and voice correction program
CN101917585A (en) * 2010-08-13 2010-12-15 宇龙计算机通信科技(深圳)有限公司 Method, device and terminal for regulating video information sent from visual telephone to opposite terminal
US9558425B2 (en) * 2012-08-16 2017-01-31 The Penn State Research Foundation Automatically computing emotions aroused from images through shape modeling
JP6073649B2 (en) * 2012-11-07 2017-02-01 株式会社日立システムズ Automatic voice recognition / conversion system
CN103903627B (en) * 2012-12-27 2018-06-19 中兴通讯股份有限公司 The transmission method and device of a kind of voice data
US9251405B2 (en) * 2013-06-20 2016-02-02 Elwha Llc Systems and methods for enhancement of facial expressions
JP6122792B2 (en) 2014-02-06 2017-04-26 日本電信電話株式会社 Robot control apparatus, robot control method, and robot control program
US9204098B1 (en) 2014-06-30 2015-12-01 International Business Machines Corporation Dynamic character substitution for web conferencing based on sentiment
US9576190B2 (en) * 2015-03-18 2017-02-21 Snap Inc. Emotion recognition in video conferencing
US20180077095A1 (en) * 2015-09-14 2018-03-15 X Development Llc Augmentation of Communications with Emotional Data
CN105847734A (en) * 2016-03-30 2016-08-10 宁波三博电子科技有限公司 Face recognition-based video communication method and system
US10698951B2 (en) * 2016-07-29 2020-06-30 Booktrack Holdings Limited Systems and methods for automatic-creation of soundtracks for speech audio
JP6524049B2 (en) 2016-10-28 2019-06-05 株式会社東芝 Emotion estimation device, emotion estimation method, emotion estimation program, and emotion counting system
CN107705808B (en) * 2017-11-20 2020-12-25 合光正锦(盘锦)机器人技术有限公司 Emotion recognition method based on facial features and voice features
KR101925440B1 (en) * 2018-04-23 2018-12-05 이정도 Method for providing vr based live video chat service using conversational ai

Also Published As

Publication number Publication date
KR20210146372A (en) 2021-12-03
JP2022528691A (en) 2022-06-15
JP7185072B2 (en) 2022-12-06
WO2020204948A1 (en) 2020-10-08
CN113646838A (en) 2021-11-12
CN113646838B (en) 2022-10-11
KR102573465B1 (en) 2023-08-31

Similar Documents

Publication Publication Date Title
US10433052B2 (en) System and method for identifying speech prosody
US10621991B2 (en) Joint neural network for speaker recognition
JP7022062B2 (en) VPA with integrated object recognition and facial expression recognition
US9548048B1 (en) On-the-fly speech learning and computer model generation using audio-visual synchronization
US11837249B2 (en) Visually presenting auditory information
KR102449875B1 (en) Method for translating speech signal and electronic device thereof
US20180182375A1 (en) Method, system, and apparatus for voice and video digital travel companion
KR102573465B1 (en) Method and system for providing emotion correction during video chat
US9870521B1 (en) Systems and methods for identifying objects
JP7118697B2 (en) Point-of-regard estimation processing device, point-of-regard estimation model generation device, point-of-regard estimation processing system, point-of-regard estimation processing method, program, and point-of-regard estimation model
Ringeval et al. Emotion recognition in the wild: Incorporating voice and lip activity in multimodal decision-level fusion
US20230068798A1 (en) Active speaker detection using image data
CN115631267A (en) Method and device for generating animation
Atila et al. Turkish lip-reading using Bi-LSTM and deep learning models
US20200379262A1 (en) Depth map re-projection based on image and pose changes
WO2020087534A1 (en) Generating response in conversation
US20240078731A1 (en) Avatar representation and audio generation
US20240078732A1 (en) Avatar facial expressions based on semantical context
US20240055014A1 (en) Visualizing Auditory Content for Accessibility
Abreu Visual speech recognition for European Portuguese
Chickerur et al. LSTM Based Lip Reading Approach for Devanagiri Script
EP4141867A1 (en) Voice signal processing method and related device therefor
Abbas Improving Arabic Sign Language to support communication between vehicle drivers and passengers from deaf people
Ghafoor et al. Improving social interaction of the visually impaired individuals through conversational assistive technology
CN114627898A (en) Voice conversion method, apparatus, computer device, storage medium and program product

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20211020

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)