US20220394413A1 - Spatial Audio In Video Conference Calls Based On Content Type Or Participant Role - Google Patents

Spatial Audio In Video Conference Calls Based On Content Type Or Participant Role Download PDF

Info

Publication number
US20220394413A1
US20220394413A1 US17/339,226 US202117339226A US2022394413A1 US 20220394413 A1 US20220394413 A1 US 20220394413A1 US 202117339226 A US202117339226 A US 202117339226A US 2022394413 A1 US2022394413 A1 US 2022394413A1
Authority
US
United States
Prior art keywords
audio
audiovisual
computing system
conferencing
audiovisual stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US17/339,226
Other versions
US11540078B1 (en
Inventor
Karsten Seipp
Jae Pum Park
Anton Volkov
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US17/339,226 priority Critical patent/US11540078B1/en
Assigned to GOOGLE LLC reassignment GOOGLE LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: PARK, JAE PUM, SEIPP, Karsten, VOLKOV, ANTON
Priority to EP22743959.3A priority patent/EP4248645A2/en
Priority to PCT/US2022/032040 priority patent/WO2022256585A2/en
Priority to CN202280018870.4A priority patent/CN117321984A/en
Publication of US20220394413A1 publication Critical patent/US20220394413A1/en
Application granted granted Critical
Publication of US11540078B1 publication Critical patent/US11540078B1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • G06K9/00711
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/44Receiver circuitry for the reception of television signals according to analogue transmission standards
    • H04N5/60Receiver circuitry for the reception of television signals according to analogue transmission standards for the sound signals
    • H04N5/607Receiver circuitry for the reception of television signals according to analogue transmission standards for the sound signals for more than one sound signal, e.g. stereo, multilanguages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • G06K2209/01
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present disclosure relates generally to videoconferencing technology. More particularly, the present disclosure relates to spatial audio in video conference calls based on content type or participant role.
  • Multi-attendee video conferencing systems can provide audiovisual streams to a client device for multiple attendees of a video conference. Often, there are many participants that are participating in a video conference and which may be visualized on a display screen (e.g., visual data from other participants, presented content, shared content, etc.).
  • each audiovisual stream is consistently placed in the front and center of an audio soundstage associated with the video conference. Regardless of the content type, where the participant on screen may be, or the role of the participant in the conference. This is an unnatural user experience as humans expect spatial differentiation of sound.
  • participant may struggle to disambiguate the source of an audio stream from the multiple possible sources (e.g., the multiple other participants).
  • This struggle to disambiguate the source of audio within a videoconference can lead to misunderstanding, fatigue, interruption, inability to separate multiple speakers/audio sources, etc.
  • Each of these drawbacks can lead to longer video conferences which can lead to an increased use of computational resources such as processor usage, memory usage, network bandwidth, etc.
  • One example aspect of the present disclosure is directed to a computer-implemented method for providing spatial audio within a videoconferencing application.
  • the method includes receiving, by a computing system comprising one or more computing devices, a plurality of audiovisual streams respectively associated with a plurality of participants in a video conference, wherein each audiovisual stream comprises audio data and visual data.
  • the method includes, for at least a first audiovisual stream of the plurality of audiovisual streams: determining, by the computing system, a conferencing attribute for the first audiovisual stream, wherein the conferencing attribute is descriptive of one or both of: a content type associated with the first audiovisual stream or a participant role associated with the first audiovisual stream; determining, by the computing system, a first virtual audio location for the first audiovisual stream within an audio soundstage based at least in part on the conferencing attribute; and modifying, by the computing system, the audio data associated with the first audiovisual stream to cause playback of the audio data to have the first virtual audio location within the audio soundstage.
  • the method includes providing, by the computing system, the plurality of audiovisual streams including the first audiovisual stream having the modified audio data for audiovisual playback in the video conference.
  • Another example aspect of the present disclosure is directed to a computing system that includes one or more processors and one or more non-transitory, computer-readable media that store instructions that when executed by the one or more processors cause the computing system to perform operations.
  • the operations include receiving, by the computing system, a plurality of audiovisual streams respectively associated with a plurality of participants in a video conference, wherein each audiovisual stream comprises audio data and visual data.
  • the operations include for at least a first audiovisual stream of the plurality of audiovisual streams: determining, by the computing system, a conferencing attribute for the first audiovisual stream, wherein the conferencing attribute is descriptive of one or both of: a content type associated with the first audiovisual stream or a participant role associated with the first audiovisual stream; determining, by the computing system, a first virtual audio location for the first audiovisual stream within an audio soundstage based at least in part on the conferencing attribute; and modifying, by the computing system, the audio data associated with the first audiovisual stream to cause playback of the audio data to have the first virtual audio location within the audio soundstage.
  • the operations include providing, by the computing system, the plurality of audiovisual streams including the first audiovisual stream having the modified audio data for audiovisual playback in the video conference.
  • Another example aspect of the present disclosure is directed to one or more non-transitory computer-readable media that store instructions that, when executed by one or more processors of a computing system, cause the computing system to perform operations to calibrate audio for a participant of a video conference.
  • the operations include causing playback of audio data with an audio playback device while modifying a virtual audio location of the audio data within an audio soundstage.
  • the operations include receiving input data provided by the participant of a video conference during modification of virtual audio location of the audio data within an audio soundstage.
  • the operations include determining, based on the input data provided by the participant of the video conference, a physical location of the participant of the video conference relative to an audio playback device.
  • the operations include using the physical location of the participant of the video conference relative to an audio playback device to modify one or more other audio signals from other participants of the video conference to cause playback of the one or more other audio signals by the audio playback device to have a desired virtual location in an audio soundstage generated for the participant of the video conference during the video conference.
  • FIG. 1 depicts a block diagram of an example computing system according to example embodiments of the present disclosure.
  • FIGS. 2 A and 2 B depict spatial audio modulation based on grouping of audiovisual streams according to example embodiments of the present disclosure.
  • FIG. 3 depicts an example of spatial audio modulation based on content type of audiovisual streams according to example embodiments of the present disclosure.
  • FIG. 4 depicts an example of spatial audio modulation based on participant role of audiovisual streams according to example embodiments of the present disclosure.
  • FIG. 5 depicts an example of spatial audio modulation based on participant role of audiovisual streams according to example embodiments of the present disclosure.
  • FIG. 6 depicts an example of spatial audio modulation based on content type of audiovisual streams according to example embodiments of the present disclosure.
  • Example aspects of the present disclosure are directed to systems and methods which perform spatial audio modulation techniques in video conference calls based on content type or participant role.
  • users can identify—simply by listening—the source of the audio (e.g., who the current speaker is and/or whether the sound came from a specific type of content).
  • each of a number of conference roles and/or content types can be allocated a particular virtual location within the audio soundstage.
  • audio data from some or all of the sources can be modified so that playback of the audio data has the virtual location within the audio soundstage that corresponds to its conference role or content type.
  • participants of the video conference can easily identify and attribute the source of each audio signal included in the video conference.
  • a video conferencing system can receive a plurality of audiovisual streams respectively associated with a plurality of participants in a video conference.
  • Each audiovisual stream can include audio data and visual data.
  • some or all of the participants may be human participants.
  • the visual data can correspond to video that depicts the human participant while the audio data can correspond to audio captured in the environment in which the human participant is located.
  • some of the participants may correspond to content that is being shared among some or all of the other participants.
  • an audiovisual stream can correspond to a shared display or other shared content (e.g., shared by a specific human participant from their device or shared from a third-party source or integration).
  • one audiovisual stream may correspond to multiple human participants (e.g., multiple humans located in a same room using one set of audiovisual equipment).
  • an audiovisual stream (e.g., a display stream shared by a participant) may include dynamic visual data while the audio data for the stream is null or blank.
  • an audiovisual stream may include dynamic audio data while the visual data for the stream is null or blank (e.g., as in the case of a human participant who has their video “turned off”).
  • the term audiovisual stream generally refers to defined streams of content which can include audio and/or video. Multiple streams of content may originate from the same device (e.g., as in the case of a user having a first audiovisual stream for their video/audio presence and a second audiovisual stream which shares content from their device to the other participants).
  • the video conferencing system can determine a conference attribute for each audiovisual stream.
  • the conference attribute can describe how the audiovisual stream relates to the other audiovisual streams in the video conference and/or characteristics of how the audiovisual stream should be perceived by various conference participants.
  • the conference attribute determined for each audiovisual stream can describe or correspond to one or both of: a content type associated with the audiovisual stream or a participant role associated with the audiovisual stream.
  • the video conferencing system can determine a virtual audio location for each audiovisual stream within an audio soundstage based at least in part on the conferencing attribute determined for the audiovisual stream.
  • the video conferencing system can modify the audio data associated with each audiovisual stream to cause playback of the audio data to have the virtual audio location within the audio soundstage that was determined for the audiovisual stream.
  • the video conferencing system can then provide the plurality of audiovisual streams having the modified audio data for audiovisual playback in the video conference.
  • Example techniques can be used to modify the audio data associated with each audiovisual stream to cause playback of the audio data to have the virtual audio location within the audio soundstage that was determined for the audiovisual stream.
  • Example techniques include the use of head related transfer functions, which is a response that characterizes how an ear receives a sound from a point in space.
  • Other example techniques include wave field synthesis, surround sound, reverberation, and/or other three-dimensional positional audio techniques.
  • the audio soundstage can be two-dimensional (e.g., with two dimensions that correspond with the axes of an associated display screen) or the audio soundstage can be three-dimensional (e.g., with an added depth dimension).
  • each of a number of conference roles and/or content types can be allocated a particular virtual location within the audio soundstage. Then, some or all of the audiovisual streams included in the video conference can be assigned to the different conference roles and/or content types. Thereafter, audio data from some or all of the audiovisual streams included in the video conference can be modified so that playback of the audio data from each audiovisual stream has the virtual location within the audio soundstage that corresponds to its conference role or content type. In such fashion, participants of the video conference can easily identify and attribute the source of each audio signal included in the video conference.
  • the framework described above can be used to effectuate a number of different use-cases or example applications or user experiences.
  • the conferencing attribute determined for each audiovisual stream can correspond to or be constrained to be one of a plurality of predefined attribute values.
  • the plurality of predefined attribute values can include at least a presentation content type and a presenter participant role.
  • audio associated with an audiovisual stream that has been determined to be a presentation content type can be modified so as to come from a certain virtual audio location associated with presentation content (e.g., center-left) while audio associated with an audiovisual stream that has been determined to be a presenter participant role can be modified so as to come from a different virtual audio location associated with the presenter (e.g., top-right).
  • the plurality of predefined attribute values can include at least a presenter participant role and an audience participant role.
  • audio associated with an audiovisual stream that has been determined to be a presenter role can be modified so as to come from a certain virtual audio location associated with presenters (e.g., top-center).
  • audio associated with an audiovisual stream that has been determined to be an audience participant role can be modified so as to come from a different virtual audio location associated with the audience (e.g., bottom-center).
  • multiple audiovisual streams can be designated as presenter participant roles (e.g., when a “panel” of presenters is present).
  • the plurality of predefined attribute values can include at least a primary speaker participant role and a translator participant role.
  • a primary speaker participant role can include a primary group speaker participant role (i.e. panel or fireside chat).
  • audio associated with an audiovisual stream that has been determined to be a primary speaker role can be modified so as to come from a certain virtual audio location associated with primary speakers (e.g., center-center).
  • audio associated with an audiovisual stream that has been determined to be a translator participant role can be modified so as to come from a different virtual audio location associated with the translator (e.g., bottom-right).
  • multiple audiovisual streams can be designated as primary speaker roles (e.g., when multiple persons are speaking in a common language or in different languages).
  • multiple audiovisual streams can be designated as translator roles (e.g., when multiple persons are translating into different languages). Multiple translators may be located at different virtual audio locations.
  • the plurality of predefined attribute values can include at least a captioned content type and a non-captioned content type.
  • audio associated with an audiovisual stream that has been determined to be non-captioned audio can be left unmodified so as to come from the location that such audio would have otherwise been located.
  • audio associated with an audiovisual stream that has been determined to be captioned audio can be modified so as to come from a particular virtual audio location associated with captioned audio (e.g., bottom-center).
  • audio can be determined to correspond to captioned audio based on internal conference settings or parameters and/or based on a comparison of text generated from the audio to the captioned text.
  • the conferencing attribute can be descriptive of an assignment of the audiovisual stream to one of a plurality of different groupings of the plurality of audiovisual streams.
  • each audiovisual stream can be assigned (e.g., automatically and/or by a participant or moderator) to one of a number of different groups.
  • Each group may be assigned a different virtual audio location in the audio soundstage. Then, the audio from any audiovisual stream can be modified so as to come from the virtual audio location assigned to the group in which the audiovisual stream is currently assigned/included.
  • breakout rooms or multiple sub-meetings can occur within the same video conference, while different virtual audio locations are used to enable participants to distinguish among the audio (e.g., conversations) occurring in each sub-meeting.
  • This example use may facilitate interactive events such as network events or casual get-togethers to occur in the same video conference, with users being able to move in and among different sub-meetings to join different discussions or conversations.
  • the conferencing attributes for the audiovisual streams may be preassigned and static (e.g., do not change over the course of the video conference).
  • the conferencing attributes may be dynamic (e.g., change over the course of the video conference). For example, roles can be changed by a moderator or can change automatically based on an automatic determination or analysis.
  • the respective conferencing attributes for the audiovisual streams may be manually controllable.
  • a moderator can control/assign the conferencing attributes for the audiovisual streams so that they are the same for all participants of the video conference (e.g., each video conference participant is receiving the same audio experience).
  • each conference participant may be able to assign the conferencing attributes for the audiovisual streams as played back at their own device (e.g., each video conference participant can have their own different and individually-controllable audio experience).
  • the respective conferencing attributes for the audiovisual streams may be automatically determined.
  • various algorithms or heuristics can be used to automatically determine one or more conferencing attributes for each audiovisual stream.
  • the video conferencing system can recognize text in visual data included in one of the audiovisual streams; perform speech-to-text to generate text from audio data included in another of the audiovisual streams; and identify the another of the audiovisual streams as a presenter participant role when the text generated from audio data matches the text in the visual data.
  • the video conference system can use various tools such as speech-to-text tools, optical character recognition (OCR) tools, etc. to detect when a certain audiovisual stream is giving a presentation of content presented in a different audiovisual stream.
  • OCR optical character recognition
  • various machine learning techniques such as artificial neural networks can be used to automatically determine respective conferencing attributes for the audiovisual streams.
  • machine learning models can be trained using supervised techniques applied to training data collected from manual assignment of respective conferencing attributes to the audiovisual streams.
  • the models can train themselves using unsupervised techniques, such as observing and self-evaluating meeting dynamics.
  • the conference attribute for the first audiovisual stream can be assigned or defined within a calendar invitation associated with the video conference.
  • a creator of the calendar invitation may be able to assign the conference attributes to invited attendees within the calendar invitation.
  • Other attendees may or may not have the ability to modify or request modifications to the conference attributes (e.g., depending upon selected settings).
  • the conferencing attributes which are available for use can be associated with and a function of a predefined template layout or theme selected for the video conference.
  • a number of template layouts can be predefined.
  • Each template layout may have a number of predefined conferencing attribute values which are associated with the template layout.
  • Audiovisual streams included in the video conference can be assigned to fill the different predefined attribute values included in the layout.
  • a layout may correspond to a panel of five presenter roles and a group audience role. Five of the audiovisual streams may be assigned to the five panel positions and all other audiovisual streams placed associated to the group audience role.
  • Example templates may have corresponding visual locations, visual appearance modifications (e.g., a virtual “photo-stand-in” cutout, virtual picture frames, virtual backgrounds etc.), timing characteristics, group shuffling characteristics, or other characteristics.
  • audio data associated with an audiovisual stream may be assigned to multiple different virtual audio locations.
  • multiple virtual audio locations can be assigned.
  • an example conferencing role may correspond to demonstration or teaching of a musical instrument.
  • audio data from the audiovisual stream that corresponds to speech can be modified to come from a first virtual audio location (e.g., top-right) while audio data from the same audiovisual stream that corresponds to music can be modified to come from a second virtual audio location (e.g., center-center).
  • a video conferencing system can perform source separation on the audio data associated with an audiovisual stream to separate the audio data into first source audio data from a first audio source and second source audio data from a second audio source.
  • the first source audio data and the second source audio can be modified to come from different virtual audio locations.
  • source separation can be performed based on frequency-domain analysis.
  • the virtual audio location associated with the audio data of an audiovisual stream may be correlated with the video location of the visual data of the audiovisual stream.
  • the video of a presenter may appear in a top-right of a participant's display screen while the corresponding audio is located in the top-right of the soundstage.
  • the virtual audio location associated with the audio data of an audiovisual stream may be decorrelated with the video location of the visual data of the audiovisual stream.
  • the audio of a certain audiovisual stream may be moved around the soundstage, regardless of where the corresponding video is located on display screen(s).
  • the techniques described in the present disclosure can be performed at various different devices.
  • the techniques described herein e.g., determination of conferencing attribute(s) and resulting modification of audio data
  • this scenario may be advantageous when the audio modifications are consistent/uniform for all participants.
  • the techniques described herein e.g., determination of conferencing attribute(s) and resulting modification of audio data
  • can be performed at a client computing device e.g., a device associated with one of the participants.
  • modification of audio data can be performed on a client computing device via a plug-in or other computer-readable code executed by a browser application executing a video conferencing web application.
  • client-side operations can be performed in a dedicated video conferencing application.
  • the video conferencing system can cause playback of audio data with an audio playback device while modifying a virtual audio location of the audio data within an audio soundstage.
  • the system can receive input data provided by the participant of a video conference during modification of virtual audio location of the audio data within an audio soundstage and determine, based on the input data provided by the participant of the video conference, a physical location of the participant of the video conference relative to an audio playback device.
  • the system can use the physical location of the participant of the video conference relative to an audio playback device to modify one or more other audio signals from other participants of the video conference to cause playback of the one or more other audio signals by the audio playback device to have a desired virtual location in an audio soundstage generated for the participant of the video conference during the video conference.
  • the present disclosure provides a number of technical effects and benefits.
  • the systems and methods of the present disclosure enable improved audio understanding by participants in a multi-attendee video conferencing. More particularly, the present disclosure modifies audio data from some or all of the sources (e.g., each audiovisual stream) included in the video conference so that playback of the audio data has a virtual location within the audio soundstage that corresponds to its conference role or content type.
  • participants of the video conference can easily identify and attribute the source of each audio signal included in the video conference.
  • Improved and intuitive understanding of audio source attribution can reduce fatigue and provide a user experience that is more understandable and consistent, which may be particularly advantageous for users with visual impairments.
  • the systems and methods of the present disclosure also result in the conservation of computing resources.
  • the systems and methods of the present disclosure enable participants of the video conference to identify and attribute each audio signal included in the video conference to a particular source more easily. This can reduce confusion in a video conference, which can reduce the length of the video conference as fewer misunderstandings may need to be clarified. Shorter video conferences can conserve computational resources such as processor usage, memory usage, network bandwidth, etc. Additionally, users may consume a video conference just as ‘listeners’, where this spatial distribution of audio based on content type, user role, and accessibility-settings may suffice to understand and follow the conference, saving bandwidth through the omission of visual data.
  • the techniques proposed herein may be of particular assistance to visually impaired users, who may not be able to visually determine the identity of the current speaker or audio source.
  • visually impaired users can tell whether a primary presenter or the audience was speaking and/or whether a sound belonged to shared content or an integrated application.
  • FIG. 1 depicts an example client-server environment 100 according to example embodiments of the present disclosure.
  • the client-server environment 100 includes a client computing device 102 and a server computing system 130 that are connected by and communicate through a network 180 .
  • a single client computing device 102 is depicted, any number of client computing devices 102 can be included in the client-server environment 100 and connect to server computing system 130 over a network 180 .
  • the client computing device 102 can be any suitable device, including, but not limited to, a smartphone, a tablet, a laptop, a desktop computer, a gaming console, or any other computer device that is configured such that it can allow a user to participate in a video conference.
  • the client computing device 102 can include one or more processor(s) 112 , memory 114 , an associated display device 120 , a video conferencing application 122 , a camera 124 , a microphone 126 , and an audio playback device 128 (e.g., speaker(s) such as stereo speakers).
  • the one or more processor(s) 112 can be any suitable processing device, such as a microprocessor, microcontroller, integrated circuit, or other suitable processing device.
  • the memory 114 can include any suitable computing system or media, including, but not limited to, non-transitory computer-readable media, RAM, ROM, hard drives, flash drives, or other memory devices.
  • the memory 114 can store information accessible by the one or more processor(s) 112 , including instructions that can be executed by the one or more processor(s) 112 .
  • the instructions can be any set of instructions that when executed by the one or more processor(s) 112 , cause the one or more processor(s) 112 to provide the desired functionality.
  • memory 114 can store instructions for video conferencing between the client computing device 102 and the server computing device 130 (e.g., one or more video conferencing applications 122 , etc.).
  • the client computing device 102 can implement the instructions to execute aspects of the present disclosure, including directing communications with server computing system 130 , providing a video conferencing application 122 and/or video stream to a user, scaling a received video stream to a different resolution display region, and/or generating and sending instructions to the server computing system requesting a new video stream for a display region.
  • system can refer to specialized hardware, computer logic that executes on a more general processor, or some combination thereof.
  • a system can be implemented in hardware, application specific circuits, firmware, and/or software controlling a general-purpose processor.
  • the systems can be implemented as program code files stored on a storage device, loaded into memory and executed by a processor or can be provided from computer program products, for example computer executable instructions, that are stored in a tangible computer-readable storage medium such as RAM, hard disk, or optical or magnetic media.
  • Memory 114 can also include data 116 , such as video conferencing data (e.g., captured at the client computing device 102 or received from the server computing system 130 ), that can be retrieved, manipulated, created, or stored by the one or more processor(s) 112 . In some example embodiments, such data can be accessed and displayed to one or more users of the client computing device 102 during a video conference or transmitted to the server computing system 130 .
  • video conferencing data e.g., captured at the client computing device 102 or received from the server computing system 130
  • data can be accessed and displayed to one or more users of the client computing device 102 during a video conference or transmitted to the server computing system 130 .
  • the client computing device 102 can execute a video conferencing application 122 .
  • the video conferencing application 122 is a dedicated, purpose-built video conferencing application.
  • the video conferencing application 122 is a browser application that executes computer-readable code locally (e.g., by processor(s) 112 ) to provide a video conference as a web application.
  • the video conferencing application 122 can capture visual data from a camera 124 and/or a microphone 126 and transmit that data to the server computing system 130 .
  • the client computing device 102 can receive, from the server computing system 130 , audiovisual data (e.g., audio data and/or visual data) from one or more other participants of the video conference (e.g., other client computing devices 102 ).
  • the client computing device 102 can then display the received visual data to users of the client computing device 102 on associated display device 120 and/or cause playback of the received audio data to users of the client computing device 102 with the audio playback device 128 .
  • the camera 124 collects visual data from one or more users.
  • the camera 124 can be any device capable of capturing visual data.
  • the microphone 126 can be any device capable of capturing audio data.
  • a webcam can serve as both a camera and a microphone.
  • the server computing system 130 can include one or more processor(s) 132 , memory 134 , and a video conferencing system 140 .
  • the memory 134 can store information accessible by the one or more processor(s) 132 , including instructions 138 that can be executed by processor(s) and data 136 .
  • the server computing system 130 can be in communication with one or more client computing device(s) 102 using a network communication device that is not pictured.
  • the network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof.
  • communication between the client computing device 102 and the server computing system 130 can be carried via network interface using any type of wired and/or wireless connection, using a variety of communication protocols (e.g., TCP/IP, HTTP, RTP, RTCP, etc.), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).
  • the server computing system 130 can include a video conferencing system 140 .
  • the video conferencing system 140 can be configured to facilitate operation of the video conferencing application 122 executed by one or more client computing devices 102 .
  • the video conferencing system 140 can receive audiovisual streams from a plurality of client computing devices 102 (e.g., via network 180 ) respectively associated with a plurality of video conference attendees.
  • the video conferencing system 140 can provide the audiovisual streams to each of the client computing devices 102 .
  • the video conferencing application 122 and/or the video conferencing system 140 can operate independently or collaboratively to perform any of the techniques described herein.
  • FIGS. 2 A and 2 B depict spatial audio modulation based on grouping of audiovisual streams according to example embodiments of the present disclosure.
  • FIG. 2 A shows a base user interface 200 for a video conference application.
  • the user interface 200 displays visual data from a plurality of audiovisual streams respectively associated with a plurality of participants in a video conference.
  • Each audiovisual stream can include audio data and visual data.
  • some or all of the participants may be human participants.
  • the visual data can correspond to video that depicts the human participant while the audio data can correspond to audio captured in the environment in which the human participant is located.
  • regions 202 and 204 of the user interface correspond to video that depicts two different human participants of the video conference.
  • some of the participants may correspond to content that is being shared among some or all of the other participants.
  • an audiovisual stream can correspond to a shared display or other shared content (e.g., shared by a specific human participant from their device or shared from a third-party source or integration).
  • one audiovisual stream may correspond to multiple human participants (e.g., multiple humans located in a same room using one set of audiovisual equipment).
  • an audiovisual stream (e.g., a display stream shared by a participant) may include dynamic visual data while the audio data for the stream is null or blank.
  • an audiovisual stream may include dynamic audio data while the visual data for the stream is null or blank (e.g., as in the case of a human participant who has their video “turned off”).
  • the term audiovisual stream generally refers to defined streams of content which can include audio and/or video. Multiple streams of content may originate from the same device (e.g., as in the case of a user having a first audiovisual stream for their video/audio presence and a second audiovisual stream which shares content from their device to the other participants).
  • playback of the audio data associated with the audiovisual streams can consistently come from a same virtual location on a soundstage (e.g., center-center).
  • playback of audio data from each audiovisual stream can come from a respective virtual location that is correlated with the location of the corresponding visual data within the user interface 200 (e.g., playback of audio data associated with the visual data contained in region 204 may have a virtual location in the top-right of the soundstage.
  • the video conferencing system can determine a virtual audio location for each audiovisual stream within an audio soundstage based at least in part on the conferencing attribute determined for the audiovisual stream.
  • the video conferencing system can modify the audio data associated with each audiovisual stream to cause playback of the audio data to have the virtual audio location within the audio soundstage that was determined for the audiovisual stream.
  • the video conferencing system can then provide the plurality of audiovisual streams having the modified audio data for audiovisual playback in the video conference.
  • This framework can be used to effectuate a number of different use-cases or example applications or user experiences.
  • the conferencing attribute determined for each audiovisual stream can be descriptive of assignment of each audiovisual stream to one of a plurality of different groupings of the plurality of audiovisual streams.
  • each audiovisual stream can be assigned (e.g., automatically and/or by a participant or moderator) to one of a number of different groups.
  • Each group may be assigned a different virtual audio location in the audio soundstage. Then, the audio from any audiovisual stream can be modified so as to come from the virtual audio location assigned to the group in which the audiovisual stream is currently assigned/included.
  • breakout rooms or multiple sub-meetings can occur within the same video conference, while different virtual audio locations are used to enable participants to distinguish among the audio (e.g., conversations) occurring in each sub-meeting.
  • This example use may facilitate interactive events such as network events or causal get-togethers to occur in the same video conference, with users being able to move in and among different sub-meetings to join different discussions or conversations.
  • FIG. 2 B illustrates an example user interface 250 in which the audiovisual streams have been assigned to groups. Specifically, simply as an example, three groups have been generated, with four audiovisual streams assigned to each group. Each group may be assigned a different virtual audio location in the audio soundstage. For example, group 252 may be assigned a virtual audio location in the top-left of the audio soundstage.
  • the audio from any audiovisual stream can be modified so as to come from the virtual audio location assigned to the group in which the audiovisual stream is currently assigned/included.
  • playback of the audio from the audiovisual stream shown at 254 can come from the virtual location assigned to group 252 .
  • spatial modulation of sound can be used to indicate group affiliation in the main call. For example, before or instead of breaking away from a larger video conference meeting into sub-meetings (breakout rooms), users can be grouped on the screen in different two-dimensional positions. The sound of users in that group can be modulated in three-dimensional space to come directly from that direction. This allows multiple groups to talk simultaneously, but users can easily distinguish and find their group on screen by following the modulated sound related to the screen-position of their group.
  • FIG. 3 depicts an example of spatial audio modulation based on content type of audiovisual streams according to example embodiments of the present disclosure.
  • audiovisual streams can be assigned (potentially among other possible designations) as a presentation content type and a presenter participant role.
  • audio associated with the audiovisual stream 302 that has been determined to be a presentation content type can be modified so as to come from a certain virtual audio location associated with presentation content (e.g., center-left) while audio associated with an audiovisual stream that 304 has been determined to be a presenter participant role can be modified so as to come from a different virtual audio location associated with the presenter (e.g., top-right).
  • sound coming from another source such as a live-stream from another event or platform that is being included in the video conference, could be modulated to come from the bottom-right.
  • spatial modulation of sound can be performed based on content type. This may have improved “accessibility” or other benefits for persons with disabilities.
  • the layout can be split: The presentation can be shown on one side (e.g., left), the presenter on the other (e.g., right), and the audience in a different position.
  • example implementations can modulate the sound of the presentation (e.g., a presented video) to come from a different direction than the presenter.
  • presenter and presented content can be separated audibly and can be mixed separately.
  • the video conference system can boost the presenter's voice while filtering harsh sounds from the presented material.
  • the audience can be allocated yet another space in the 3D soundscape.
  • Voice(s) belonging to people in this group can be modulated differently from presentation and presenter and are thus easy to identify.
  • users can identify the content type through sound modulation. Users can focus their attention to a specific type of content, while still being able to listen to other content.
  • FIG. 4 depicts an example of spatial audio modulation based on participant role of audiovisual streams according to example embodiments of the present disclosure.
  • audiovisual streams can be assigned (potentially among other possible designations) as a presenter participant role, an audience participant role, and a translator participant role.
  • audio associated with the audiovisual stream 402 that has been determined to be a presenter role can be modified so as to come from a certain virtual audio location associated with presenters (e.g., top-left) while audio associated with an audiovisual stream 404 that has been determined to be an audience participant role can be modified so as to come from a different virtual audio location associated with the audience (e.g., bottom-left).
  • audio associated with the audiovisual stream 406 that has been determined to be a translator role can be modified so as to come from a certain virtual audio location associated with translators (e.g., bottom-right).
  • spatial modulation of sound can be performed based on participant role. This may have improved “accessibility” or other benefits for persons with disabilities. For example, a specific screen position can be reserved for one or more key persons in the meeting. These streams can then be associated with a two- or three-dimensional virtual sound position. Users with visual impairment will be able to distinguish their voice and can tell by the specific two- or three-dimensional audio coordinates that the person is important, for example that they are the presenter or CEO currently talking, without knowing their name via visual means.
  • Two- or three-dimensional sound modulation can be used to assign distinguishable roles to different types of video conference participants.
  • sounds from the teacher in a classroom may always come from the top of the sound space while sound from the students may come from the bottom of the sound space.
  • sound from people in a panel may always come from the top of the sound space while audience questions always come from the bottom of the sound space.
  • multiple audiovisual streams can be designated as presenter participant roles (e.g., when a “panel” of presenters is present). For example, this scenario is illustrated in FIG. 5 in which four streams (e.g., including stream 502 ) have been designated as an expert panel and all other audiovisual streams (e.g., including stream 504 ) have been placed associated to the group audience role.
  • FIG. 6 depicts an example of spatial audio modulation based on content type of audiovisual streams according to example embodiments of the present disclosure.
  • predefined attribute values which can be assigned to audiovisual streams can include at least a captioned content type and a non-captioned content type.
  • audio associated with an audiovisual stream that has been determined to be non-captioned audio can be left unmodified so as to come from the location that such audio would have otherwise been located at while audio associated with an audiovisual stream that has been determined to be captioned audio can be modified so as to come from a particular virtual audio location associated with captioned audio (e.g., bottom-center).
  • a particular virtual audio location associated with captioned audio e.g., bottom-center
  • the audiovisual stream depicted in region 602 is being captioned (e.g., as shown by caption 604 ).
  • audio from the audiovisual stream depicted in region 602 can be modified so that playback of the audio comes from a bottom-center virtual location on the soundstage (e.g., regardless of the visual location of the audiovisual stream depicted in region 602 ).
  • captioned audio may be modulated to come from the location that it would have come from if it were not captioned.
  • the video conferencing system can cause playback of audio data with an audio playback device while modifying a virtual audio location of the audio data within an audio soundstage.
  • the system can receive input data provided by the participant of a video conference during modification of virtual audio location of the audio data within an audio soundstage and determine, based on the input data provided by the participant of the video conference, a physical location of the participant of the video conference relative to an audio playback device.
  • the system can use the physical location of the participant of the video conference relative to an audio playback device to modify one or more other audio signals from other participants of the video conference to cause playback of the one or more other audio signals by the audio playback device to have a desired virtual location in an audio soundstage generated for the participant of the video conference during the video conference.
  • example implementations can use three-dimensional sound panning to help the user position themselves correctly in front of the screen.
  • the video conference system can pan sound in three-dimensions and the user can indicate where the sound is coming from.
  • the video conference system can spatially manipulate the sound in three-dimensions and then the user can indicate when or where the user felt that the audio was centrally located from the user's perspective.
  • the video conference system can evaluate the response and thus define the position of the user.
  • the video conference system can use this information to modulate three-dimensional sound in the ensuing meeting. Using the three-dimensional sound manipulation, the user's position can be corrected. While the user may still sit in a different place, the video conference system can compensate for this by realigning the soundstage.
  • Another example aspect of the present disclosure is directed to techniques to make use signal processing techniques and to generate at least one feature parameter extracted from the audio data and/or the video data to determine the virtual audio location for the first audiovisual stream.
  • the audio data and/or video data are processed by using signal processing techniques to generate the one or more feature parameters indicative of particular conferencing attributes, e.g. voice recognition techniques and/or image recognition techniques to identify the primary speaker in the respective audiovisual stream.
  • the conferencing attribute for the first audiovisual stream can thus directly be determined from the audio data and/or the video data of the first audio stream.
  • the first virtual audio location for the first audiovisual stream can then be determined by evaluating the at least one feature parameter based on the result of the feature parameter evaluation.
  • the first virtual audio location will be determined to come from the top-right location in FIG. 3 .
  • a location characteristic is provided to the audio data associated with the first audiovisual stream based on the first virtual audio location.
  • the audio data are transformed using signal processing techniques to provide the virtual audio location to the audio data of the first audiovisual stream so that during playback the listener experiences that the first audiovisual stream comes from the first virtual audio location.
  • the technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems.
  • the inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components.
  • processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination.
  • Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.

Abstract

Systems and methods for multi-attendee video conferencing are described. A system can perform spatial audio modulation techniques in video conference calls based on content type or participant role. In particular, by assigning user roles and content types to specific regions in a two- or three-dimensional audio sound space or “soundstage,” users can identify—simply by listening—the source of the audio (e.g., who the current speaker is and/or whether the sound came from a specific type of content). Thus, in example implementations of the present disclosure, each of a number of conference roles and/or content types can be allocated a particular virtual location within the audio soundstage.

Description

    FIELD
  • The present disclosure relates generally to videoconferencing technology. More particularly, the present disclosure relates to spatial audio in video conference calls based on content type or participant role.
  • BACKGROUND
  • Multi-attendee video conferencing systems can provide audiovisual streams to a client device for multiple attendees of a video conference. Often, there are many participants that are participating in a video conference and which may be visualized on a display screen (e.g., visual data from other participants, presented content, shared content, etc.).
  • However, in existing video conference technologies, the audio portion of each audiovisual stream is consistently placed in the front and center of an audio soundstage associated with the video conference. Regardless of the content type, where the participant on screen may be, or the role of the participant in the conference. This is an unnatural user experience as humans expect spatial differentiation of sound.
  • As such, participants may struggle to disambiguate the source of an audio stream from the multiple possible sources (e.g., the multiple other participants). This struggle to disambiguate the source of audio within a videoconference can lead to misunderstanding, fatigue, interruption, inability to separate multiple speakers/audio sources, etc. Each of these drawbacks can lead to longer video conferences which can lead to an increased use of computational resources such as processor usage, memory usage, network bandwidth, etc.
  • SUMMARY
  • Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments.
  • One example aspect of the present disclosure is directed to a computer-implemented method for providing spatial audio within a videoconferencing application. The method includes receiving, by a computing system comprising one or more computing devices, a plurality of audiovisual streams respectively associated with a plurality of participants in a video conference, wherein each audiovisual stream comprises audio data and visual data. The method includes, for at least a first audiovisual stream of the plurality of audiovisual streams: determining, by the computing system, a conferencing attribute for the first audiovisual stream, wherein the conferencing attribute is descriptive of one or both of: a content type associated with the first audiovisual stream or a participant role associated with the first audiovisual stream; determining, by the computing system, a first virtual audio location for the first audiovisual stream within an audio soundstage based at least in part on the conferencing attribute; and modifying, by the computing system, the audio data associated with the first audiovisual stream to cause playback of the audio data to have the first virtual audio location within the audio soundstage. The method includes providing, by the computing system, the plurality of audiovisual streams including the first audiovisual stream having the modified audio data for audiovisual playback in the video conference.
  • Another example aspect of the present disclosure is directed to a computing system that includes one or more processors and one or more non-transitory, computer-readable media that store instructions that when executed by the one or more processors cause the computing system to perform operations. The operations include receiving, by the computing system, a plurality of audiovisual streams respectively associated with a plurality of participants in a video conference, wherein each audiovisual stream comprises audio data and visual data. The operations include for at least a first audiovisual stream of the plurality of audiovisual streams: determining, by the computing system, a conferencing attribute for the first audiovisual stream, wherein the conferencing attribute is descriptive of one or both of: a content type associated with the first audiovisual stream or a participant role associated with the first audiovisual stream; determining, by the computing system, a first virtual audio location for the first audiovisual stream within an audio soundstage based at least in part on the conferencing attribute; and modifying, by the computing system, the audio data associated with the first audiovisual stream to cause playback of the audio data to have the first virtual audio location within the audio soundstage. The operations include providing, by the computing system, the plurality of audiovisual streams including the first audiovisual stream having the modified audio data for audiovisual playback in the video conference.
  • Another example aspect of the present disclosure is directed to one or more non-transitory computer-readable media that store instructions that, when executed by one or more processors of a computing system, cause the computing system to perform operations to calibrate audio for a participant of a video conference. The operations include causing playback of audio data with an audio playback device while modifying a virtual audio location of the audio data within an audio soundstage. The operations include receiving input data provided by the participant of a video conference during modification of virtual audio location of the audio data within an audio soundstage. The operations include determining, based on the input data provided by the participant of the video conference, a physical location of the participant of the video conference relative to an audio playback device. The operations include using the physical location of the participant of the video conference relative to an audio playback device to modify one or more other audio signals from other participants of the video conference to cause playback of the one or more other audio signals by the audio playback device to have a desired virtual location in an audio soundstage generated for the participant of the video conference during the video conference.
  • Other aspects of the present disclosure are directed to various systems, apparatuses, non-transitory computer-readable media, user interfaces, and electronic devices.
  • These and other features, aspects, and advantages of various embodiments of the present disclosure will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate example embodiments of the present disclosure and, together with the description, serve to explain the related principles.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Detailed discussion of embodiments directed to one of ordinary skill in the art is set forth in the specification, which makes reference to the appended figures, in which:
  • FIG. 1 depicts a block diagram of an example computing system according to example embodiments of the present disclosure.
  • FIGS. 2A and 2B depict spatial audio modulation based on grouping of audiovisual streams according to example embodiments of the present disclosure.
  • FIG. 3 depicts an example of spatial audio modulation based on content type of audiovisual streams according to example embodiments of the present disclosure.
  • FIG. 4 depicts an example of spatial audio modulation based on participant role of audiovisual streams according to example embodiments of the present disclosure.
  • FIG. 5 depicts an example of spatial audio modulation based on participant role of audiovisual streams according to example embodiments of the present disclosure.
  • FIG. 6 depicts an example of spatial audio modulation based on content type of audiovisual streams according to example embodiments of the present disclosure.
  • Reference numerals that are repeated across plural figures are intended to identify the same features in various implementations.
  • DETAILED DESCRIPTION
  • Example aspects of the present disclosure are directed to systems and methods which perform spatial audio modulation techniques in video conference calls based on content type or participant role. In particular, by assigning user roles and content types to specific regions in a two- or three-dimensional audio sound space or “soundstage,” users can identify—simply by listening—the source of the audio (e.g., who the current speaker is and/or whether the sound came from a specific type of content). Thus, in example implementations of the present disclosure, each of a number of conference roles and/or content types can be allocated a particular virtual location within the audio soundstage. Then, audio data from some or all of the sources (e.g., each audiovisual stream included in the video conference) can be modified so that playback of the audio data has the virtual location within the audio soundstage that corresponds to its conference role or content type. In such fashion, participants of the video conference can easily identify and attribute the source of each audio signal included in the video conference.
  • More particularly, a video conferencing system can receive a plurality of audiovisual streams respectively associated with a plurality of participants in a video conference. Each audiovisual stream can include audio data and visual data. In some implementations, some or all of the participants may be human participants. For example, the visual data can correspond to video that depicts the human participant while the audio data can correspond to audio captured in the environment in which the human participant is located. In some implementations, some of the participants may correspond to content that is being shared among some or all of the other participants. For example, an audiovisual stream can correspond to a shared display or other shared content (e.g., shared by a specific human participant from their device or shared from a third-party source or integration). In another example, one audiovisual stream may correspond to multiple human participants (e.g., multiple humans located in a same room using one set of audiovisual equipment).
  • In some implementations, an audiovisual stream (e.g., a display stream shared by a participant) may include dynamic visual data while the audio data for the stream is null or blank. In other implementations, an audiovisual stream may include dynamic audio data while the visual data for the stream is null or blank (e.g., as in the case of a human participant who has their video “turned off”). Thus, as used herein, the term audiovisual stream generally refers to defined streams of content which can include audio and/or video. Multiple streams of content may originate from the same device (e.g., as in the case of a user having a first audiovisual stream for their video/audio presence and a second audiovisual stream which shares content from their device to the other participants).
  • According to an aspect of the present disclosure, for some or all of the audiovisual streams included in a video conference, the video conferencing system can determine a conference attribute for each audiovisual stream. The conference attribute can describe how the audiovisual stream relates to the other audiovisual streams in the video conference and/or characteristics of how the audiovisual stream should be perceived by various conference participants. As examples, the conference attribute determined for each audiovisual stream can describe or correspond to one or both of: a content type associated with the audiovisual stream or a participant role associated with the audiovisual stream.
  • According to another aspect of the present disclosure, the video conferencing system can determine a virtual audio location for each audiovisual stream within an audio soundstage based at least in part on the conferencing attribute determined for the audiovisual stream. The video conferencing system can modify the audio data associated with each audiovisual stream to cause playback of the audio data to have the virtual audio location within the audio soundstage that was determined for the audiovisual stream. The video conferencing system can then provide the plurality of audiovisual streams having the modified audio data for audiovisual playback in the video conference.
  • Various techniques can be used to modify the audio data associated with each audiovisual stream to cause playback of the audio data to have the virtual audio location within the audio soundstage that was determined for the audiovisual stream. Example techniques include the use of head related transfer functions, which is a response that characterizes how an ear receives a sound from a point in space. Other example techniques include wave field synthesis, surround sound, reverberation, and/or other three-dimensional positional audio techniques. The audio soundstage can be two-dimensional (e.g., with two dimensions that correspond with the axes of an associated display screen) or the audio soundstage can be three-dimensional (e.g., with an added depth dimension).
  • Thus, in example implementations of the present disclosure, each of a number of conference roles and/or content types can be allocated a particular virtual location within the audio soundstage. Then, some or all of the audiovisual streams included in the video conference can be assigned to the different conference roles and/or content types. Thereafter, audio data from some or all of the audiovisual streams included in the video conference can be modified so that playback of the audio data from each audiovisual stream has the virtual location within the audio soundstage that corresponds to its conference role or content type. In such fashion, participants of the video conference can easily identify and attribute the source of each audio signal included in the video conference.
  • The framework described above can be used to effectuate a number of different use-cases or example applications or user experiences. In some examples, the conferencing attribute determined for each audiovisual stream can correspond to or be constrained to be one of a plurality of predefined attribute values.
  • As one example, the plurality of predefined attribute values can include at least a presentation content type and a presenter participant role. Thus, audio associated with an audiovisual stream that has been determined to be a presentation content type can be modified so as to come from a certain virtual audio location associated with presentation content (e.g., center-left) while audio associated with an audiovisual stream that has been determined to be a presenter participant role can be modified so as to come from a different virtual audio location associated with the presenter (e.g., top-right).
  • In another example, the plurality of predefined attribute values can include at least a presenter participant role and an audience participant role. Thus, audio associated with an audiovisual stream that has been determined to be a presenter role can be modified so as to come from a certain virtual audio location associated with presenters (e.g., top-center). On the other hand, audio associated with an audiovisual stream that has been determined to be an audience participant role can be modified so as to come from a different virtual audio location associated with the audience (e.g., bottom-center). In some implementations, multiple audiovisual streams can be designated as presenter participant roles (e.g., when a “panel” of presenters is present).
  • In yet another example, the plurality of predefined attribute values can include at least a primary speaker participant role and a translator participant role. For example, a primary speaker participant role can include a primary group speaker participant role (i.e. panel or fireside chat). Thus, audio associated with an audiovisual stream that has been determined to be a primary speaker role can be modified so as to come from a certain virtual audio location associated with primary speakers (e.g., center-center). On the other hand, audio associated with an audiovisual stream that has been determined to be a translator participant role can be modified so as to come from a different virtual audio location associated with the translator (e.g., bottom-right). In some implementations, multiple audiovisual streams can be designated as primary speaker roles (e.g., when multiple persons are speaking in a common language or in different languages). In some implementations, multiple audiovisual streams can be designated as translator roles (e.g., when multiple persons are translating into different languages). Multiple translators may be located at different virtual audio locations.
  • In another example, the plurality of predefined attribute values can include at least a captioned content type and a non-captioned content type. Thus, audio associated with an audiovisual stream that has been determined to be non-captioned audio can be left unmodified so as to come from the location that such audio would have otherwise been located. On the other hand, audio associated with an audiovisual stream that has been determined to be captioned audio can be modified so as to come from a particular virtual audio location associated with captioned audio (e.g., bottom-center). For example, audio can be determined to correspond to captioned audio based on internal conference settings or parameters and/or based on a comparison of text generated from the audio to the captioned text.
  • As another example, in some implementations, the conferencing attribute can be descriptive of an assignment of the audiovisual stream to one of a plurality of different groupings of the plurality of audiovisual streams. For example, each audiovisual stream can be assigned (e.g., automatically and/or by a participant or moderator) to one of a number of different groups. Each group may be assigned a different virtual audio location in the audio soundstage. Then, the audio from any audiovisual stream can be modified so as to come from the virtual audio location assigned to the group in which the audiovisual stream is currently assigned/included. In such fashion, breakout rooms or multiple sub-meetings can occur within the same video conference, while different virtual audio locations are used to enable participants to distinguish among the audio (e.g., conversations) occurring in each sub-meeting. This example use may facilitate interactive events such as network events or casual get-togethers to occur in the same video conference, with users being able to move in and among different sub-meetings to join different discussions or conversations.
  • In some implementations, the conferencing attributes for the audiovisual streams (e.g., the content types or participant roles) may be preassigned and static (e.g., do not change over the course of the video conference). In other implementations, the conferencing attributes may be dynamic (e.g., change over the course of the video conference). For example, roles can be changed by a moderator or can change automatically based on an automatic determination or analysis.
  • In some implementations, the respective conferencing attributes for the audiovisual streams may be manually controllable. For example, a moderator can control/assign the conferencing attributes for the audiovisual streams so that they are the same for all participants of the video conference (e.g., each video conference participant is receiving the same audio experience). In another example, each conference participant may be able to assign the conferencing attributes for the audiovisual streams as played back at their own device (e.g., each video conference participant can have their own different and individually-controllable audio experience).
  • In some implementations, the respective conferencing attributes for the audiovisual streams may be automatically determined. For example, various algorithms or heuristics can be used to automatically determine one or more conferencing attributes for each audiovisual stream. As one example, the video conferencing system can recognize text in visual data included in one of the audiovisual streams; perform speech-to-text to generate text from audio data included in another of the audiovisual streams; and identify the another of the audiovisual streams as a presenter participant role when the text generated from audio data matches the text in the visual data. Stated differently, the video conference system can use various tools such as speech-to-text tools, optical character recognition (OCR) tools, etc. to detect when a certain audiovisual stream is giving a presentation of content presented in a different audiovisual stream. In another example, various machine learning techniques such as artificial neural networks can be used to automatically determine respective conferencing attributes for the audiovisual streams. For example, machine learning models can be trained using supervised techniques applied to training data collected from manual assignment of respective conferencing attributes to the audiovisual streams. In another example, the models can train themselves using unsupervised techniques, such as observing and self-evaluating meeting dynamics.
  • In some implementations, the conference attribute for the first audiovisual stream can be assigned or defined within a calendar invitation associated with the video conference. For example, a creator of the calendar invitation may be able to assign the conference attributes to invited attendees within the calendar invitation. Other attendees may or may not have the ability to modify or request modifications to the conference attributes (e.g., depending upon selected settings).
  • In some implementations, the conferencing attributes which are available for use (e.g., the predefined attribute values) can be associated with and a function of a predefined template layout or theme selected for the video conference. For example, a number of template layouts can be predefined. Each template layout may have a number of predefined conferencing attribute values which are associated with the template layout. Audiovisual streams included in the video conference can be assigned to fill the different predefined attribute values included in the layout. As an example, a layout may correspond to a panel of five presenter roles and a group audience role. Five of the audiovisual streams may be assigned to the five panel positions and all other audiovisual streams placed associated to the group audience role. Example templates may have corresponding visual locations, visual appearance modifications (e.g., a virtual “photo-stand-in” cutout, virtual picture frames, virtual backgrounds etc.), timing characteristics, group shuffling characteristics, or other characteristics.
  • In some implementations, audio data associated with an audiovisual stream may be assigned to multiple different virtual audio locations. For example, for certain conferencing attributes, multiple virtual audio locations can be assigned. As one example, an example conferencing role may correspond to demonstration or teaching of a musical instrument. In such instance, audio data from the audiovisual stream that corresponds to speech can be modified to come from a first virtual audio location (e.g., top-right) while audio data from the same audiovisual stream that corresponds to music can be modified to come from a second virtual audio location (e.g., center-center). Thus, in this and other examples, a video conferencing system can perform source separation on the audio data associated with an audiovisual stream to separate the audio data into first source audio data from a first audio source and second source audio data from a second audio source. The first source audio data and the second source audio can be modified to come from different virtual audio locations. For example, source separation can be performed based on frequency-domain analysis.
  • In some example implementations, the virtual audio location associated with the audio data of an audiovisual stream may be correlated with the video location of the visual data of the audiovisual stream. For example, the video of a presenter may appear in a top-right of a participant's display screen while the corresponding audio is located in the top-right of the soundstage. However, in other implementations, the virtual audio location associated with the audio data of an audiovisual stream may be decorrelated with the video location of the visual data of the audiovisual stream. For example, the audio of a certain audiovisual stream may be moved around the soundstage, regardless of where the corresponding video is located on display screen(s).
  • The techniques described in the present disclosure (e.g., those attributed generally to a video conference system) can be performed at various different devices. As one example, the techniques described herein (e.g., determination of conferencing attribute(s) and resulting modification of audio data) can be performed at a server computing system that is facilitating the video conference. For example, this scenario may be advantageous when the audio modifications are consistent/uniform for all participants. As another example, the techniques described herein (e.g., determination of conferencing attribute(s) and resulting modification of audio data) can be performed at a client computing device (e.g., a device associated with one of the participants). For example, this scenario may be advantageous when the audio modifications are inconsistent and different for different participants, or when the user activates an ‘accessibility mode’ that distributes the sound sources on the soundstage in a manner that facilitates their comprehension, even when visual cues are not available. In one example, modification of audio data can be performed on a client computing device via a plug-in or other computer-readable code executed by a browser application executing a video conferencing web application. In another example, client-side operations can be performed in a dedicated video conferencing application.
  • Another example aspect of the present disclosure is directed to techniques to calibrate audio for a participant of a video conference. In particular, in one example, to calibrate audio for the participant, the video conferencing system can cause playback of audio data with an audio playback device while modifying a virtual audio location of the audio data within an audio soundstage. The system can receive input data provided by the participant of a video conference during modification of virtual audio location of the audio data within an audio soundstage and determine, based on the input data provided by the participant of the video conference, a physical location of the participant of the video conference relative to an audio playback device. The system can use the physical location of the participant of the video conference relative to an audio playback device to modify one or more other audio signals from other participants of the video conference to cause playback of the one or more other audio signals by the audio playback device to have a desired virtual location in an audio soundstage generated for the participant of the video conference during the video conference.
  • The present disclosure provides a number of technical effects and benefits. As one example technical effect and benefit, the systems and methods of the present disclosure enable improved audio understanding by participants in a multi-attendee video conferencing. More particularly, the present disclosure modifies audio data from some or all of the sources (e.g., each audiovisual stream) included in the video conference so that playback of the audio data has a virtual location within the audio soundstage that corresponds to its conference role or content type. In such fashion, participants of the video conference can easily identify and attribute the source of each audio signal included in the video conference. Improved and intuitive understanding of audio source attribution can reduce fatigue and provide a user experience that is more understandable and consistent, which may be particularly advantageous for users with visual impairments.
  • As another example technical effect, the systems and methods of the present disclosure also result in the conservation of computing resources. In particular, the systems and methods of the present disclosure enable participants of the video conference to identify and attribute each audio signal included in the video conference to a particular source more easily. This can reduce confusion in a video conference, which can reduce the length of the video conference as fewer misunderstandings may need to be clarified. Shorter video conferences can conserve computational resources such as processor usage, memory usage, network bandwidth, etc. Additionally, users may consume a video conference just as ‘listeners’, where this spatial distribution of audio based on content type, user role, and accessibility-settings may suffice to understand and follow the conference, saving bandwidth through the omission of visual data.
  • The techniques proposed herein may be of particular assistance to visually impaired users, who may not be able to visually determine the identity of the current speaker or audio source. Thus, as examples, by assigning virtual locations in audio space to user roles and content types, visually impaired users can tell whether a primary presenter or the audience was speaking and/or whether a sound belonged to shared content or an integrated application.
  • With reference now to the Figures, example embodiments of the present disclosure will be discussed in further detail.
  • FIG. 1 depicts an example client-server environment 100 according to example embodiments of the present disclosure. The client-server environment 100 includes a client computing device 102 and a server computing system 130 that are connected by and communicate through a network 180. Although a single client computing device 102 is depicted, any number of client computing devices 102 can be included in the client-server environment 100 and connect to server computing system 130 over a network 180.
  • In some example embodiments, the client computing device 102 can be any suitable device, including, but not limited to, a smartphone, a tablet, a laptop, a desktop computer, a gaming console, or any other computer device that is configured such that it can allow a user to participate in a video conference. The client computing device 102 can include one or more processor(s) 112, memory 114, an associated display device 120, a video conferencing application 122, a camera 124, a microphone 126, and an audio playback device 128 (e.g., speaker(s) such as stereo speakers).
  • The one or more processor(s) 112 can be any suitable processing device, such as a microprocessor, microcontroller, integrated circuit, or other suitable processing device. The memory 114 can include any suitable computing system or media, including, but not limited to, non-transitory computer-readable media, RAM, ROM, hard drives, flash drives, or other memory devices. The memory 114 can store information accessible by the one or more processor(s) 112, including instructions that can be executed by the one or more processor(s) 112. The instructions can be any set of instructions that when executed by the one or more processor(s) 112, cause the one or more processor(s) 112 to provide the desired functionality.
  • In particular, in some devices, memory 114 can store instructions for video conferencing between the client computing device 102 and the server computing device 130 (e.g., one or more video conferencing applications 122, etc.). The client computing device 102 can implement the instructions to execute aspects of the present disclosure, including directing communications with server computing system 130, providing a video conferencing application 122 and/or video stream to a user, scaling a received video stream to a different resolution display region, and/or generating and sending instructions to the server computing system requesting a new video stream for a display region.
  • It will be appreciated that the term “system” can refer to specialized hardware, computer logic that executes on a more general processor, or some combination thereof. Thus, a system can be implemented in hardware, application specific circuits, firmware, and/or software controlling a general-purpose processor. In one embodiment, the systems can be implemented as program code files stored on a storage device, loaded into memory and executed by a processor or can be provided from computer program products, for example computer executable instructions, that are stored in a tangible computer-readable storage medium such as RAM, hard disk, or optical or magnetic media.
  • Memory 114 can also include data 116, such as video conferencing data (e.g., captured at the client computing device 102 or received from the server computing system 130), that can be retrieved, manipulated, created, or stored by the one or more processor(s) 112. In some example embodiments, such data can be accessed and displayed to one or more users of the client computing device 102 during a video conference or transmitted to the server computing system 130.
  • The client computing device 102 can execute a video conferencing application 122. In one example, the video conferencing application 122 is a dedicated, purpose-built video conferencing application. In another example, the video conferencing application 122 is a browser application that executes computer-readable code locally (e.g., by processor(s) 112) to provide a video conference as a web application.
  • The video conferencing application 122 can capture visual data from a camera 124 and/or a microphone 126 and transmit that data to the server computing system 130. The client computing device 102 can receive, from the server computing system 130, audiovisual data (e.g., audio data and/or visual data) from one or more other participants of the video conference (e.g., other client computing devices 102). The client computing device 102 can then display the received visual data to users of the client computing device 102 on associated display device 120 and/or cause playback of the received audio data to users of the client computing device 102 with the audio playback device 128. In some example embodiments, the camera 124 collects visual data from one or more users. The camera 124 can be any device capable of capturing visual data. The microphone 126 can be any device capable of capturing audio data. In one example, a webcam can serve as both a camera and a microphone.
  • In accordance with some example embodiments, the server computing system 130 can include one or more processor(s) 132, memory 134, and a video conferencing system 140. The memory 134 can store information accessible by the one or more processor(s) 132, including instructions 138 that can be executed by processor(s) and data 136.
  • The server computing system 130 can be in communication with one or more client computing device(s) 102 using a network communication device that is not pictured. The network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof. In general, communication between the client computing device 102 and the server computing system 130 can be carried via network interface using any type of wired and/or wireless connection, using a variety of communication protocols (e.g., TCP/IP, HTTP, RTP, RTCP, etc.), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).
  • The server computing system 130 can include a video conferencing system 140. In some implementations, the video conferencing system 140 can be configured to facilitate operation of the video conferencing application 122 executed by one or more client computing devices 102. As an example, the video conferencing system 140 can receive audiovisual streams from a plurality of client computing devices 102 (e.g., via network 180) respectively associated with a plurality of video conference attendees. The video conferencing system 140 can provide the audiovisual streams to each of the client computing devices 102.
  • The video conferencing application 122 and/or the video conferencing system 140 can operate independently or collaboratively to perform any of the techniques described herein.
  • FIGS. 2A and 2B depict spatial audio modulation based on grouping of audiovisual streams according to example embodiments of the present disclosure. In particular, FIG. 2A shows a base user interface 200 for a video conference application. The user interface 200 displays visual data from a plurality of audiovisual streams respectively associated with a plurality of participants in a video conference. Each audiovisual stream can include audio data and visual data. In some implementations, some or all of the participants may be human participants. For example, the visual data can correspond to video that depicts the human participant while the audio data can correspond to audio captured in the environment in which the human participant is located. For example, regions 202 and 204 of the user interface correspond to video that depicts two different human participants of the video conference.
  • In some implementations, some of the participants (not shown) may correspond to content that is being shared among some or all of the other participants. For example, an audiovisual stream can correspond to a shared display or other shared content (e.g., shared by a specific human participant from their device or shared from a third-party source or integration). In another example, one audiovisual stream may correspond to multiple human participants (e.g., multiple humans located in a same room using one set of audiovisual equipment).
  • In some implementations, an audiovisual stream (e.g., a display stream shared by a participant) may include dynamic visual data while the audio data for the stream is null or blank. In other implementations, an audiovisual stream may include dynamic audio data while the visual data for the stream is null or blank (e.g., as in the case of a human participant who has their video “turned off”). Thus, as used herein, the term audiovisual stream generally refers to defined streams of content which can include audio and/or video. Multiple streams of content may originate from the same device (e.g., as in the case of a user having a first audiovisual stream for their video/audio presence and a second audiovisual stream which shares content from their device to the other participants).
  • In some implementations of the base user interface 200, playback of the audio data associated with the audiovisual streams can consistently come from a same virtual location on a soundstage (e.g., center-center). In other implementations of the base user interface 200, playback of audio data from each audiovisual stream can come from a respective virtual location that is correlated with the location of the corresponding visual data within the user interface 200 (e.g., playback of audio data associated with the visual data contained in region 204 may have a virtual location in the top-right of the soundstage.
  • According to an aspect of the present disclosure, the video conferencing system can determine a virtual audio location for each audiovisual stream within an audio soundstage based at least in part on the conferencing attribute determined for the audiovisual stream. The video conferencing system can modify the audio data associated with each audiovisual stream to cause playback of the audio data to have the virtual audio location within the audio soundstage that was determined for the audiovisual stream. The video conferencing system can then provide the plurality of audiovisual streams having the modified audio data for audiovisual playback in the video conference. This framework can be used to effectuate a number of different use-cases or example applications or user experiences.
  • Specifically, with reference now to FIG. 2B, in some implementations, the conferencing attribute determined for each audiovisual stream can be descriptive of assignment of each audiovisual stream to one of a plurality of different groupings of the plurality of audiovisual streams. For example, each audiovisual stream can be assigned (e.g., automatically and/or by a participant or moderator) to one of a number of different groups. Each group may be assigned a different virtual audio location in the audio soundstage. Then, the audio from any audiovisual stream can be modified so as to come from the virtual audio location assigned to the group in which the audiovisual stream is currently assigned/included. In such fashion, breakout rooms or multiple sub-meetings can occur within the same video conference, while different virtual audio locations are used to enable participants to distinguish among the audio (e.g., conversations) occurring in each sub-meeting. This example use may facilitate interactive events such as network events or causal get-togethers to occur in the same video conference, with users being able to move in and among different sub-meetings to join different discussions or conversations.
  • FIG. 2B illustrates an example user interface 250 in which the audiovisual streams have been assigned to groups. Specifically, simply as an example, three groups have been generated, with four audiovisual streams assigned to each group. Each group may be assigned a different virtual audio location in the audio soundstage. For example, group 252 may be assigned a virtual audio location in the top-left of the audio soundstage.
  • Then, the audio from any audiovisual stream can be modified so as to come from the virtual audio location assigned to the group in which the audiovisual stream is currently assigned/included. For example, playback of the audio from the audiovisual stream shown at 254 can come from the virtual location assigned to group 252.
  • Thus, spatial modulation of sound can be used to indicate group affiliation in the main call. For example, before or instead of breaking away from a larger video conference meeting into sub-meetings (breakout rooms), users can be grouped on the screen in different two-dimensional positions. The sound of users in that group can be modulated in three-dimensional space to come directly from that direction. This allows multiple groups to talk simultaneously, but users can easily distinguish and find their group on screen by following the modulated sound related to the screen-position of their group.
  • As another example application, FIG. 3 depicts an example of spatial audio modulation based on content type of audiovisual streams according to example embodiments of the present disclosure. Specifically, in FIG. 3 audiovisual streams can be assigned (potentially among other possible designations) as a presentation content type and a presenter participant role. Thus, audio associated with the audiovisual stream 302 that has been determined to be a presentation content type can be modified so as to come from a certain virtual audio location associated with presentation content (e.g., center-left) while audio associated with an audiovisual stream that 304 has been determined to be a presenter participant role can be modified so as to come from a different virtual audio location associated with the presenter (e.g., top-right). Additionally, sound coming from another source, such as a live-stream from another event or platform that is being included in the video conference, could be modulated to come from the bottom-right.
  • Thus, spatial modulation of sound can be performed based on content type. This may have improved “accessibility” or other benefits for persons with disabilities. For example, when people present something in a video conference, the layout can be split: The presentation can be shown on one side (e.g., left), the presenter on the other (e.g., right), and the audience in a different position. To allow clear sound distinction, example implementations can modulate the sound of the presentation (e.g., a presented video) to come from a different direction than the presenter. Using spatial modulation, presenter and presented content can be separated audibly and can be mixed separately. For example, the video conference system can boost the presenter's voice while filtering harsh sounds from the presented material. The audience can be allocated yet another space in the 3D soundscape. Voice(s) belonging to people in this group can be modulated differently from presentation and presenter and are thus easy to identify. Thus, users can identify the content type through sound modulation. Users can focus their attention to a specific type of content, while still being able to listen to other content.
  • As another example application, FIG. 4 depicts an example of spatial audio modulation based on participant role of audiovisual streams according to example embodiments of the present disclosure. Specifically, in FIG. 4 audiovisual streams can be assigned (potentially among other possible designations) as a presenter participant role, an audience participant role, and a translator participant role. Thus, audio associated with the audiovisual stream 402 that has been determined to be a presenter role can be modified so as to come from a certain virtual audio location associated with presenters (e.g., top-left) while audio associated with an audiovisual stream 404 that has been determined to be an audience participant role can be modified so as to come from a different virtual audio location associated with the audience (e.g., bottom-left). Likewise, audio associated with the audiovisual stream 406 that has been determined to be a translator role can be modified so as to come from a certain virtual audio location associated with translators (e.g., bottom-right).
  • Thus, spatial modulation of sound can be performed based on participant role. This may have improved “accessibility” or other benefits for persons with disabilities. For example, a specific screen position can be reserved for one or more key persons in the meeting. These streams can then be associated with a two- or three-dimensional virtual sound position. Users with visual impairment will be able to distinguish their voice and can tell by the specific two- or three-dimensional audio coordinates that the person is important, for example that they are the presenter or CEO currently talking, without knowing their name via visual means.
  • Two- or three-dimensional sound modulation can be used to assign distinguishable roles to different types of video conference participants. As one example, sounds from the teacher in a classroom may always come from the top of the sound space while sound from the students may come from the bottom of the sound space. As another example, sound from people in a panel may always come from the top of the sound space while audience questions always come from the bottom of the sound space.
  • Thus, in some implementations, multiple audiovisual streams can be designated as presenter participant roles (e.g., when a “panel” of presenters is present). For example, this scenario is illustrated in FIG. 5 in which four streams (e.g., including stream 502) have been designated as an expert panel and all other audiovisual streams (e.g., including stream 504) have been placed associated to the group audience role.
  • As another example application, FIG. 6 depicts an example of spatial audio modulation based on content type of audiovisual streams according to example embodiments of the present disclosure. For example, predefined attribute values which can be assigned to audiovisual streams can include at least a captioned content type and a non-captioned content type. Thus, audio associated with an audiovisual stream that has been determined to be non-captioned audio can be left unmodified so as to come from the location that such audio would have otherwise been located at while audio associated with an audiovisual stream that has been determined to be captioned audio can be modified so as to come from a particular virtual audio location associated with captioned audio (e.g., bottom-center). For example, in FIG. 6 , the audiovisual stream depicted in region 602 is being captioned (e.g., as shown by caption 604). Thus, audio from the audiovisual stream depicted in region 602 can be modified so that playback of the audio comes from a bottom-center virtual location on the soundstage (e.g., regardless of the visual location of the audiovisual stream depicted in region 602). In other implementations, captioned audio may be modulated to come from the location that it would have come from if it were not captioned.
  • Another example aspect of the present disclosure is directed to techniques to calibrate audio for a participant of a video conference. In particular, in one example, to calibrate audio for the participant, the video conferencing system can cause playback of audio data with an audio playback device while modifying a virtual audio location of the audio data within an audio soundstage. The system can receive input data provided by the participant of a video conference during modification of virtual audio location of the audio data within an audio soundstage and determine, based on the input data provided by the participant of the video conference, a physical location of the participant of the video conference relative to an audio playback device. The system can use the physical location of the participant of the video conference relative to an audio playback device to modify one or more other audio signals from other participants of the video conference to cause playback of the one or more other audio signals by the audio playback device to have a desired virtual location in an audio soundstage generated for the participant of the video conference during the video conference.
  • More particularly, users often do not sit in the center of their screen and it may not be a safe assumption that the user's audio playback device (speaker(s)) are placed in the same location as the display. Thus, example implementations can use three-dimensional sound panning to help the user position themselves correctly in front of the screen. The video conference system can pan sound in three-dimensions and the user can indicate where the sound is coming from. For example, the video conference system can spatially manipulate the sound in three-dimensions and then the user can indicate when or where the user felt that the audio was centrally located from the user's perspective.
  • The video conference system can evaluate the response and thus define the position of the user. The video conference system can use this information to modulate three-dimensional sound in the ensuing meeting. Using the three-dimensional sound manipulation, the user's position can be corrected. While the user may still sit in a different place, the video conference system can compensate for this by realigning the soundstage.
  • Another example aspect of the present disclosure is directed to techniques to make use signal processing techniques and to generate at least one feature parameter extracted from the audio data and/or the video data to determine the virtual audio location for the first audiovisual stream. In particular, in one example, to extract the feature parameter, the audio data and/or video data are processed by using signal processing techniques to generate the one or more feature parameters indicative of particular conferencing attributes, e.g. voice recognition techniques and/or image recognition techniques to identify the primary speaker in the respective audiovisual stream. According to this aspect, the conferencing attribute for the first audiovisual stream can thus directly be determined from the audio data and/or the video data of the first audio stream. The first virtual audio location for the first audiovisual stream can then be determined by evaluating the at least one feature parameter based on the result of the feature parameter evaluation. For example, if according to the extracted feature parameter, the primary speaker is identified as the presenter 304 of the presentation 302, the first virtual audio location will be determined to come from the top-right location in FIG. 3 . In order to modify the audio data associated with the first audiovisual stream to cause playback of the audio data to have the first virtual audio location within the audio soundstage, a location characteristic is provided to the audio data associated with the first audiovisual stream based on the first virtual audio location. For example, the audio data are transformed using signal processing techniques to provide the virtual audio location to the audio data of the first audiovisual stream so that during playback the listener experiences that the first audiovisual stream comes from the first virtual audio location.
  • The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.
  • While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.

Claims (20)

1. A computer-implemented method for providing spatial audio within a videoconferencing application, the method comprising:
receiving, by a computing system comprising one or more computing devices, a plurality of audiovisual streams respectively associated with a plurality of participants in a video conference, wherein each audiovisual stream comprises audio data and visual data;
for at least a first audiovisual stream of the plurality of audiovisual streams:
determining, by the computing system, a conferencing attribute for the first audiovisual stream, wherein the conferencing attribute is descriptive of one or both of: a content type associated with the first audiovisual stream or a participant role associated with the first audiovisual stream;
determining, by the computing system, a first virtual audio location for the first audiovisual stream within an audio soundstage based at least in part on the conferencing attribute; and
modifying, by the computing system, the audio data associated with the first audiovisual stream to cause playback of the audio data to have the first virtual audio location within the audio soundstage; and
providing, by the computing system, the plurality of audiovisual streams including the first audiovisual stream having the modified audio data for audiovisual playback in the video conference.
2. The computer-implemented method of claim 1, wherein:
the conferencing attribute comprises one of a plurality of predefined attribute values; and
the plurality of predefined attribute values comprise at least a presentation content type and a presenter participant role.
3. The computer-implemented method of claim 1, wherein:
the conferencing attribute comprises one of a plurality of predefined attribute values; and
the plurality of predefined attribute values comprise at least a presenter participant role and an audience participant role.
4. The computer-implemented method of claim 1, wherein:
the conferencing attribute comprises one of a plurality of predefined attribute values; and
the plurality of predefined attribute values comprise at least a primary speaker participant role and a translator participant role.
5. The computer-implemented method of claim 1, wherein:
the conferencing attribute comprises one of a plurality of predefined attribute values; and
the plurality of predefined attribute values comprise at least a captioned content type and a non-captioned content type.
6. The computer-implemented method of claim 1, wherein the conferencing attribute is descriptive of assignment of the first audiovisual stream to one of a plurality of different groupings of the plurality of audiovisual streams.
7. The computer-implemented method of claim 1, wherein the conferencing attribute is dynamic and manually controllable by a moderator of the video conference.
8. The computer-implemented method of claim 1, wherein the conferencing attribute for the first audiovisual stream is specific to each participant and manually controllable by each participant of the video conference.
9. The computer-implemented method of claim 1, wherein the conference attribute for the first audiovisual stream is defined within a calendar invitation associated with the video conference.
10. The computer-implemented method of claim 1, wherein:
the conferencing attribute comprises one of a plurality of predefined attribute values; and
the plurality of predefined attribute values are associated with and a function of a predefined template layout selected for the video conference.
11. The computer-implemented method of claim 1, wherein determining, by the computing system, the conferencing attribute for the first audiovisual stream comprises automatically determining, by the computing system, the conferencing attribute for the first audiovisual stream.
12. The computer-implemented method of claim 11, wherein automatically determining, by the computing system, the conferencing attribute for the first audiovisual stream comprise:
recognizing, by the computing system, text in visual data included in one of the audiovisual streams;
performing, by the computing system, speech-to-text to generate text from audio data included in another of the audiovisual streams; and
identifying, by the computing system, the another of the audiovisual streams when the text generated from audio data matches the text in the visual data.
13. The computer-implemented method of claim 1, wherein:
determining, by the computing system, the first virtual audio location for the first audiovisual stream within the audio soundstage based at least in part on the conferencing attribute comprises determining, by the computing system, both the first virtual audio location and a second virtual audio location for the first audiovisual stream within the audio soundstage based at least in part on the conferencing attribute; and
modifying, by the computing system, the audio data associated with the first audiovisual stream to cause playback of the audio data to have the first virtual audio location within the audio soundstage comprises:
performing, by the computing system, source separation on the audio data associated with the first audiovisual stream to separate the audio data into first source audio data from a first audio source and second source audio data from a second audio source;
modifying, by the computing system, the first source audio data to cause playback of the first source audio data to have the first virtual audio location within the audio soundstage; and
modifying, by the computing system, the second source audio data to cause playback of the second source audio data to have the second virtual audio location within the audio soundstage.
14. The computer-implemented method of claim 1, wherein the first virtual audio location of the first audiovisual stream is decorrelated with a video location of the first audiovisual stream.
15. The computer-implemented method of claim 1, wherein the computing system consists of a server computing system and the computer-implemented method is performed at the server computing system.
16. The computer-implemented method of claim 1, wherein the computing system consists of a client computing device associated with one of the participants and the computer-implemented method is performed at the client computing device.
17. A computing system, comprising:
one or more processors; and
one or more non-transitory, computer-readable media that store instructions that when executed by the one or more processors cause the computing system to perform operations, the operations comprising:
receiving, by the computing system, a plurality of audiovisual streams respectively associated with a plurality of participants in a video conference, wherein each audiovisual stream comprises audio data and visual data;
for at least a first audiovisual stream of the plurality of audiovisual streams:
determining, by the computing system, a conferencing attribute for the first audiovisual stream, wherein the conferencing attribute is descriptive of one or both of: a content type associated with the first audiovisual stream or a participant role associated with the first audiovisual stream;
determining, by the computing system, a first virtual audio location for the first audiovisual stream within an audio soundstage based at least in part on the conferencing attribute; and
modifying, by the computing system, the audio data associated with the first audiovisual stream to cause playback of the audio data to have the first virtual audio location within the audio soundstage; and
providing, by the computing system, the plurality of audiovisual streams including the first audiovisual stream having the modified audio data for audiovisual playback in the video conference.
18. The computing system of claim 17, wherein:
the conferencing attribute comprises one of a plurality of predefined attribute values; and
the plurality of predefined attribute values comprise:
a presentation content type;
a presenter participant role;
an audience participant role;
a primary speaker participant role;
a translator participant role;
a captioned content type; or
a non-captioned content type.
19. The computing system of claim 17, wherein the conferencing attribute is descriptive of assignment of the first audiovisual stream to one of a plurality of different groupings of the plurality of audiovisual streams.
20. One or more non-transitory computer-readable media that store instructions that, when executed by one or more processors of a computing system, cause the computing system to perform operations to calibrate audio for a participant of a video conference, the operations comprising:
causing playback of audio data with an audio playback device while modifying a virtual audio location of the audio data within an audio soundstage;
receiving input data provided by the participant of a video conference during modification of virtual audio location of the audio data within an audio soundstage;
determining, based on the input data provided by the participant of the video conference, a physical location of the participant of the video conference relative to an audio playback device; and
using the physical location of the participant of the video conference relative to an audio playback device to modify one or more other audio signals from other participants of the video conference to cause playback of the one or more other audio signals by the audio playback device to have a desired virtual location in an audio soundstage generated for the participant of the video conference during the video conference.
US17/339,226 2021-06-04 2021-06-04 Spatial audio in video conference calls based on content type or participant role Active US11540078B1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US17/339,226 US11540078B1 (en) 2021-06-04 2021-06-04 Spatial audio in video conference calls based on content type or participant role
EP22743959.3A EP4248645A2 (en) 2021-06-04 2022-06-03 Spatial audio in video conference calls based on content type or participant role
PCT/US2022/032040 WO2022256585A2 (en) 2021-06-04 2022-06-03 Spatial audio in video conference calls based on content type or participant role
CN202280018870.4A CN117321984A (en) 2021-06-04 2022-06-03 Spatial audio in video conference calls based on content type or participant roles

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US17/339,226 US11540078B1 (en) 2021-06-04 2021-06-04 Spatial audio in video conference calls based on content type or participant role

Publications (2)

Publication Number Publication Date
US20220394413A1 true US20220394413A1 (en) 2022-12-08
US11540078B1 US11540078B1 (en) 2022-12-27

Family

ID=82608152

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/339,226 Active US11540078B1 (en) 2021-06-04 2021-06-04 Spatial audio in video conference calls based on content type or participant role

Country Status (4)

Country Link
US (1) US11540078B1 (en)
EP (1) EP4248645A2 (en)
CN (1) CN117321984A (en)
WO (1) WO2022256585A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230008964A1 (en) * 2021-07-06 2023-01-12 Meta Platforms, Inc. User-configurable spatial audio based conferencing system
US20230384914A1 (en) * 2022-05-28 2023-11-30 Microsoft Technology Licensing, Llc Meeting accessibility staging system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100238265A1 (en) * 2004-04-21 2010-09-23 Telepresence Technologies, Llc Telepresence Systems and Methods Therefore
WO2010120303A2 (en) * 2009-04-16 2010-10-21 Hewlett-Packard Development Company, L.P. Managing shared content in virtual collaboration systems
US20110283008A1 (en) * 2010-05-13 2011-11-17 Vladimir Smelyansky Video Class Room
WO2014010472A1 (en) * 2012-07-09 2014-01-16 日産自動車株式会社 Automobile hood
US9525830B1 (en) * 2015-11-12 2016-12-20 Captioncall Llc Captioning communication systems
US20170353694A1 (en) * 2016-06-03 2017-12-07 Avaya Inc. Positional controlled muting
US20180341374A1 (en) * 2017-05-26 2018-11-29 Microsoft Technology Licensing, Llc Populating a share-tray with content items that are identified as salient to a conference session
US20190007467A1 (en) * 2017-06-29 2019-01-03 Cisco Technology, Inc. Files automatically shared at conference initiation
US10924709B1 (en) * 2019-12-27 2021-02-16 Microsoft Technology Licensing, Llc Dynamically controlled view states for improved engagement during communication sessions

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6850252B1 (en) 1999-10-05 2005-02-01 Steven M. Hoffberg Intelligent electronic appliance system and method
US7012630B2 (en) 1996-02-08 2006-03-14 Verizon Services Corp. Spatial sound conference system and apparatus
GB2349055B (en) 1999-04-16 2004-03-24 Mitel Corp Virtual meeting rooms with spatial audio
US7190775B2 (en) 2003-10-29 2007-03-13 Broadcom Corporation High quality audio conferencing with adaptive beamforming
US7688345B2 (en) 2004-10-15 2010-03-30 Lifesize Communications, Inc. Audio output in video conferencing and speakerphone based on call type
TW200743385A (en) 2006-05-05 2007-11-16 Amtran Technology Co Ltd Method of audio-visual communication using television and television using the same
US20070070177A1 (en) 2005-07-01 2007-03-29 Christensen Dennis G Visual and aural perspective management for enhanced interactive video telepresence
NO20071401L (en) 2007-03-16 2008-09-17 Tandberg Telecom As System and arrangement for lifelike video communication
KR101742256B1 (en) 2007-09-26 2017-05-31 에이큐 미디어 인크 Audio-visual navigation and communication
US8237771B2 (en) 2009-03-26 2012-08-07 Eastman Kodak Company Automated videography based communications
US8351589B2 (en) 2009-06-16 2013-01-08 Microsoft Corporation Spatial audio for audio conferencing
US10326978B2 (en) 2010-06-30 2019-06-18 Warner Bros. Entertainment Inc. Method and apparatus for generating virtual or augmented reality presentations with 3D audio positioning
US8755432B2 (en) 2010-06-30 2014-06-17 Warner Bros. Entertainment Inc. Method and apparatus for generating 3D audio positioning using dynamically optimized audio 3D space perception cues
US8848028B2 (en) 2010-10-25 2014-09-30 Dell Products L.P. Audio cues for multi-party videoconferencing on an information handling system
US20120216129A1 (en) 2011-02-17 2012-08-23 Ng Hock M Method and apparatus for providing an immersive meeting experience for remote meeting participants
US8681203B1 (en) 2012-08-20 2014-03-25 Google Inc. Automatic mute control for video conferencing
US9368117B2 (en) 2012-11-14 2016-06-14 Qualcomm Incorporated Device and system having smart directional conferencing
CN104469256B (en) 2013-09-22 2019-04-23 思科技术公司 Immersion and interactive video conference room environment
US9318121B2 (en) 2014-04-21 2016-04-19 Sony Corporation Method and system for processing audio data of video content
US9402054B2 (en) 2014-12-08 2016-07-26 Blue Jeans Network Provision of video conference services
US10522151B2 (en) 2015-02-03 2019-12-31 Dolby Laboratories Licensing Corporation Conference segmentation based on conversational dynamics
US10497382B2 (en) 2016-12-16 2019-12-03 Google Llc Associating faces with voices for speaker diarization within videos
US11539844B2 (en) 2018-09-21 2022-12-27 Dolby Laboratories Licensing Corporation Audio conferencing using a distributed array of smartphones
US10986301B1 (en) 2019-03-26 2021-04-20 Holger Schanz Participant overlay and audio placement collaboration system platform and method for overlaying representations of participants collaborating by way of a user interface and representational placement of distinct audio sources as isolated participants
US11849196B2 (en) 2019-09-11 2023-12-19 Educational Vision Technologies, Inc. Automatic data extraction and conversion of video/images/sound information from a slide presentation into an editable notetaking resource with optional overlay of the presenter

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100238265A1 (en) * 2004-04-21 2010-09-23 Telepresence Technologies, Llc Telepresence Systems and Methods Therefore
WO2010120303A2 (en) * 2009-04-16 2010-10-21 Hewlett-Packard Development Company, L.P. Managing shared content in virtual collaboration systems
US20110283008A1 (en) * 2010-05-13 2011-11-17 Vladimir Smelyansky Video Class Room
WO2014010472A1 (en) * 2012-07-09 2014-01-16 日産自動車株式会社 Automobile hood
US9525830B1 (en) * 2015-11-12 2016-12-20 Captioncall Llc Captioning communication systems
US20170353694A1 (en) * 2016-06-03 2017-12-07 Avaya Inc. Positional controlled muting
US20180341374A1 (en) * 2017-05-26 2018-11-29 Microsoft Technology Licensing, Llc Populating a share-tray with content items that are identified as salient to a conference session
US20190007467A1 (en) * 2017-06-29 2019-01-03 Cisco Technology, Inc. Files automatically shared at conference initiation
US10924709B1 (en) * 2019-12-27 2021-02-16 Microsoft Technology Licensing, Llc Dynamically controlled view states for improved engagement during communication sessions

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230008964A1 (en) * 2021-07-06 2023-01-12 Meta Platforms, Inc. User-configurable spatial audio based conferencing system
US20230384914A1 (en) * 2022-05-28 2023-11-30 Microsoft Technology Licensing, Llc Meeting accessibility staging system

Also Published As

Publication number Publication date
WO2022256585A2 (en) 2022-12-08
EP4248645A2 (en) 2023-09-27
CN117321984A (en) 2023-12-29
US11540078B1 (en) 2022-12-27
WO2022256585A3 (en) 2023-01-12

Similar Documents

Publication Publication Date Title
Vertegaal The GAZE groupware system: mediating joint attention in multiparty communication and collaboration
CN110113316B (en) Conference access method, device, equipment and computer readable storage medium
WO2021143315A1 (en) Scene interaction method and apparatus, electronic device, and computer storage medium
US8243116B2 (en) Method and system for modifying non-verbal behavior for social appropriateness in video conferencing and other computer mediated communications
CA2757847C (en) System and method for hybrid course instruction
WO2022256585A2 (en) Spatial audio in video conference calls based on content type or participant role
WO2004010414A1 (en) Method and apparatus for improving listener differentiation of talkers during a conference call
JP2018036690A (en) One-versus-many communication system, and program
US9438859B2 (en) Method and device for controlling a conference
Chen Conveying conversational cues through video
US11849257B2 (en) Video conferencing systems featuring multiple spatial interaction modes
Wong et al. Shared-space: Spatial audio and video layouts for videoconferencing in a virtual room
JP2005055846A (en) Remote educational communication system
JP2000231644A (en) Speaker, specifying method for virtual space and recording medium where program thereof is recorded
WO2022253856A2 (en) Virtual interaction system
US11637991B2 (en) Video conferencing systems featuring multiple spatial interaction modes
Vinnikov et al. Gaze-contingent auditory displays for improved spatial attention in virtual reality
US20230362571A1 (en) Information processing device, information processing terminal, information processing method, and program
KR20150087017A (en) Audio control device based on eye-tracking and method for visual communications using the device
WO2021006303A1 (en) Translation system, translation device, translation method, and translation program
Aguilera et al. Spatial audio for audioconferencing in mobile devices: Investigating the importance of virtual mobility and private communication and optimizations
WO2023249005A1 (en) Screen synthesis method using web conference system
JP7292343B2 (en) Information processing device, information processing method and information processing program
Davat et al. Integrating Socio-Affective Information in Physical Perception aimed to Telepresence Robots
Kilgore et al. The Vocal Village: enhancing collaboration with spatialized audio

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

AS Assignment

Owner name: GOOGLE LLC, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SEIPP, KARSTEN;VOLKOV, ANTON;PARK, JAE PUM;SIGNING DATES FROM 20210903 TO 20210921;REEL/FRAME:057548/0873

STCF Information on status: patent grant

Free format text: PATENTED CASE