WO2022173684A1 - Echo reference generation and echo reference metric estimation according to rendering information - Google Patents
Echo reference generation and echo reference metric estimation according to rendering information Download PDFInfo
- Publication number
- WO2022173684A1 WO2022173684A1 PCT/US2022/015436 US2022015436W WO2022173684A1 WO 2022173684 A1 WO2022173684 A1 WO 2022173684A1 US 2022015436 W US2022015436 W US 2022015436W WO 2022173684 A1 WO2022173684 A1 WO 2022173684A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- echo
- audio
- examples
- references
- echo reference
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K11/00—Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/16—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/175—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
- G10K11/178—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound by electro-acoustically regenerating the original acoustic waves in anti-phase
- G10K11/1781—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound by electro-acoustically regenerating the original acoustic waves in anti-phase characterised by the analysis of input or output signals, e.g. frequency range, modes, transfer functions
- G10K11/17821—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound by electro-acoustically regenerating the original acoustic waves in anti-phase characterised by the analysis of input or output signals, e.g. frequency range, modes, transfer functions characterised by the analysis of the input signals only
- G10K11/17823—Reference signals, e.g. ambient acoustic environment
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M9/00—Arrangements for interconnection not involving centralised switching
- H04M9/08—Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
- H04M9/082—Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K11/00—Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/16—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/175—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
- G10K11/178—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound by electro-acoustically regenerating the original acoustic waves in anti-phase
- G10K11/1787—General system configurations
- G10K11/17873—General system configurations using a reference signal without an error signal, e.g. pure feedforward
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K2210/00—Details of active noise control [ANC] covered by G10K11/178 but not provided for in any of its subgroups
- G10K2210/30—Means
- G10K2210/301—Computational
- G10K2210/3023—Estimation of noise, e.g. on error signals
- G10K2210/30231—Sources, e.g. identifying noisy processes or components
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K2210/00—Details of active noise control [ANC] covered by G10K11/178 but not provided for in any of its subgroups
- G10K2210/30—Means
- G10K2210/301—Computational
- G10K2210/3027—Feedforward
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K2210/00—Details of active noise control [ANC] covered by G10K11/178 but not provided for in any of its subgroups
- G10K2210/50—Miscellaneous
- G10K2210/505—Echo cancellation, e.g. multipath-, ghost- or reverberation-cancellation
Definitions
- This disclosure pertains to devices, systems and methods for implementing echo management.
- Audio devices having acoustic echo management systems are widely deployed.
- An acoustic echo management system may include an acoustic echo canceller and/or an acoustic echo suppressor.
- the terms “speaker,” “loudspeaker” and “audio reproduction transducer” are used synonymously to denote any sound-emitting transducer (or set of transducers).
- a typical set of headphones includes two speakers.
- a speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), which may be driven by a single, common speaker feed or multiple speaker feeds.
- the speaker feed(s) may undergo different processing in different circuitry branches coupled to the different transducers.
- performing an operation “on” a signal or data e.g., filtering, scaling, transforming, or applying gain to, the signal or data
- a signal or data e.g., filtering, scaling, transforming, or applying gain to, the signal or data
- performing the operation directly on the signal or data or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
- system is used in a broad sense to denote a device, system, or subsystem.
- a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X - M inputs are received from an external source) may also be referred to as a decoder system.
- processor is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data).
- data e.g., audio, or video or other image data.
- processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
- Coupled is used to mean either a direct or indirect connection.
- that connection may be through a direct connection, or through an indirect connection via other devices and connections.
- a “smart device” is an electronic device, generally configured for communication with one or more other devices (or networks) via various wireless protocols such as Bluetooth, Zigbee, near-field communication, Wi-Fi, light fidelity (Li-Fi), 3G, 4G, 5G, etc., that can operate to some extent interactively and/or autonomously.
- wireless protocols such as Bluetooth, Zigbee, near-field communication, Wi-Fi, light fidelity (Li-Fi), 3G, 4G, 5G, etc.
- smartphones are smartphones, smart cars, smart thermostats, smart doorbells, smart locks, smart refrigerators, phablets and tablets, smartwatches, smart bands, smart key chains and smart audio devices.
- the term “smart device” may also refer to a device that exhibits some properties of ubiquitous computing, such as artificial intelligence.
- a single-purpose audio device is a device (e.g., a television (TV)) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker and/or at least one camera), and which is designed largely or primarily to achieve a single purpose.
- TV television
- a modem TV runs some operating system on which applications ran locally, including the application of watching television.
- a single-purpose audio device having speaker(s) and microphone(s) is often configured to ran a local application and/or service to use the speaker(s) and microphone(s) directly.
- Some single-purpose audio devices may be configured to group together to achieve playing of audio over a zone or user configured area.
- multi-purpose audio device is an audio device that implements at least some aspects of virtual assistant functionality, although other aspects of virtual assistant functionality may be implemented by one or more other devices, such as one or more servers with which the multi-purpose audio device is configured for communication.
- a multi-purpose audio device may be referred to herein as a “virtual assistant.”
- a virtual assistant is a device (e.g., a smart speaker or voice assistant integrated device) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker and/or at least one camera).
- a virtual assistant may provide an ability to utilize multiple devices (distinct from the virtual assistant) for applications that are in a sense cloud-enabled or otherwise not completely implemented in or on the virtual assistant itself.
- virtual assistant functionality e.g., speech recognition functionality
- a virtual assistant may be implemented (at least in part) by one or more servers or other devices with which a virtual assistant may communication via a network, such as the Internet.
- Virtual assistants may sometimes work together, e.g., in a discrete and conditionally defined way. For example, two or more virtual assistants may work together in the sense that one of them, e.g., the one which is most confident that it has heard a wakeword, responds to the wakeword.
- the connected virtual assistants may, in some implementations, form a sort of constellation, which may be managed by one main application which may be (or implement) a virtual assistant.
- wakeword is used in a broad sense to denote any sound (e.g., a word uttered by a human, or some other sound), where a smart audio device is configured to awake in response to detection of (“hearing”) the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone).
- a smart audio device is configured to awake in response to detection of (“hearing”) the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone).
- to “awake” denotes that the device enters a state in which it awaits (in other words, is listening for) a sound command.
- a “wakeword” may include more than one word, e.g., a phrase.
- wakeword detector denotes a device configured (or software that includes instructions for configuring a device) to search continuously for alignment between real-time sound (e.g., speech) features and a trained model.
- a wakeword event is triggered whenever it is determined by a wakeword detector that the probability that a wakeword has been detected exceeds a predefined threshold.
- the threshold may be a predetermined threshold which is tuned to give a reasonable compromise between rates of false acceptance and false rejection.
- a device Following a wakeword event, a device might enter a state (which may be referred to as an “awakened” state or a state of “attentiveness”) in which it listens for a command and passes on a received command to a larger, more computationally-intensive recognizer.
- a wakeword event a state in which it listens for a command and passes on a received command to a larger, more computationally-intensive recognizer.
- the terms “program stream” and “content stream” refer to a collection of one or more audio signals, and in some instances video signals, at least portions of which are meant to be heard together. Examples include a selection of music, a movie soundtrack, a movie, a television program, the audio portion of a television program, a podcast, a live voice call, a synthesized voice response from a smart assistant, etc.
- the content stream may include multiple versions of at least a portion of the audio signals, e.g., the same dialogue in more than one language. In such instances, only one version of the audio data or portion thereof (e.g., a version corresponding to a single language) is intended to be reproduced at one time.
- At least some aspects of the present disclosure may be implemented via one or more audio processing methods.
- the method(s) may be implemented, at least in part, by a control system and/or via instructions (e.g., software) stored on one or more non- transitory media.
- Some such methods may involve receiving, by a control system, location information for each of a plurality of audio devices in an audio environment.
- Some such methods may involve generating, by the control system and based at least in part on the location information, rendering information for a plurality of audio devices in an audio environment.
- Some such methods may involve determining, by the control system and based at least on part on the rendering information, a plurality of echo reference metrics.
- each echo reference metric of the plurality of echo reference metrics may corresponding to audio data reproduced by one or more audio devices of the plurality of audio devices.
- the rendering information may be, or may include, a matrix of loudspeaker activations.
- at least one echo reference metric may correspond to a level of a corresponding echo reference, a uniqueness of the corresponding echo reference, a temporal persistence of the corresponding echo reference, an audibility of the corresponding echo reference, or one or more combinations thereof.
- the method also may involve receiving, by the control system, a content stream that includes audio data and corresponding metadata.
- determining the at least one echo reference metric may be based, at least in part, on one or more of loudspeaker metadata, metadata corresponding to received audio data or an upmixing matrix.
- control system may be, or may include, an audio device control system.
- the method may involve making, by the control system and based at least in part on the echo reference metrics, an importance estimation for each echo reference of a plurality of echo references.
- making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment.
- the at least one echo management system may be, or may include, an acoustic echo canceller (AEC), an acoustic echo suppressor (AES), or both an AEC and an AES.
- AEC acoustic echo canceller
- AES acoustic echo suppressor
- the method may involve selecting, by the control system and based at least in part on the importance estimation, one or more selected echo references. In some such implementations, the method may involve providing, by the control system, the one or more selected echo references to the at least one echo management system.
- the method also may involve causing the at least one echo management system to cancel or suppress echoes based, at least in part, on the one or more selected echo references.
- making the importance estimation may involve determining an importance metric for a corresponding echo reference.
- determining the importance metric may be based, at least in part, on a current listening objective, a current ambient noise estimate, or both a current listening objective and a current ambient noise estimate.
- the method also may involve making, by the control system, a cost determination.
- the cost determination may involve determining a cost for at least one echo reference of the plurality of echo references.
- selecting the one or more selected echo references may be based, at least in part, on the cost determination.
- the cost determination may be based on the network bandwidth required for transmitting the at least one echo reference, an encoding computational requirement for encoding the at least one echo reference, a decoding computational requirement for decoding the at least one echo reference, an echo management system computational requirement for use of the at least one echo reference by the at least one echo management system, or one or more combinations thereof.
- the method also may involve determining a current echo management system performance level. According to some such examples, the importance estimation may be based, at least in part, on the current echo management system performance level.
- the method also may involve receiving, by the control system, scene change metadata.
- the importance estimation may be based, at least in part, on the scene change metadata.
- the method also may involve rendering the audio data, based at least in part on the rendering information, to produce rendered audio data.
- the control system may be, or may include, an orchestrating device control system.
- the method also may involve providing at least a portion of the rendered audio data to each audio device of the plurality of audio devices.
- the method also may involve providing at least one echo reference metric to each audio device of the plurality of audio devices.
- the method also may involve generating, by the control system, at least one virtual echo reference corresponding to two or more audio devices of the plurality of audio devices.
- the method also may involve determining, by the control system, a weighted summation of echo references over a range of low frequencies. According to some such examples, the method may involve providing the weighted summation to at least one echo management system.
- Non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon. At least some aspects of the present disclosure may be implemented via apparatus.
- an apparatus is, or includes, an audio processing system having an interface system and a control system.
- the control system may include one or more general purpose single- or multi-chip processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or combinations thereof.
- DSPs digital signal processors
- ASICs application specific integrated circuits
- FPGAs field programmable gate arrays
- Figure 1 A is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure.
- Figure IB shows an example of an audio environment.
- Figures 1C and ID show examples of how playback channels may be received by the audio devices 1 lOA-1 IOC.
- Figure IE shows another example of an audio environment.
- Figure 2A presents a block diagram of an audio device that is capable of performing at least some disclosed implementations.
- FIGS 2B and 2C show additional examples of audio devices in an audio environment.
- Figure 3 A presents a block diagram that shows components of an audio device according to one example.
- Figures 3B and 3C are graphs that show examples of the expected echo management performance versus the number of echo references used for echo management.
- Figure 4 presents a block diagram that shows components of an echo reference orchestrator according to one example.
- Figure 5A is a flow diagram that outlines one example of a disclosed method.
- Figure 5B is a flow diagram that outlines another example of a disclosed method.
- Figure 6 is a flow diagram that outlines one example of a disclosed method.
- FIGS 7 and 8 show block diagrams that include components of echo reference orchestrators according to some alternative examples.
- Figure 9A shows an example of a graph that shows locations of a listener and audio devices in an audio environment.
- Figure 9B shows examples of graphs corresponding to a rendering matrix for each of the audio devices shown in Figure 9A.
- Figures 10A and 10B show examples of graphs indicating spatial audio object counts for a single song.
- Figures 11 A and 1 IB show examples of a spatially informed correlation matrix and an uninformed rendering correlation matrix.
- Figures 12 A, 12B and 12C show examples of echo reference importance rankings based on a PCM-based correlation matrix, a spatially informed correlation matrix and an uninformed correlation matrix, respectively.
- Figure 13 illustrates a simplified example of determining a virtual echo reference.
- Figure 14 shows an example of a low-frequency management module.
- Figures 15A and 15B show examples of low-frequency management for implementations with and without a subwoofer.
- Figure 15C illustrates elements that may be used to implement a higher-frequency management method according to one example.
- Figure 16 is a block diagram that outlines another example of a disclosed method.
- Figure 17 is a flow diagram that outlines another example of a disclosed method.
- Figure 18 shows an example of a floor plan of an audio environment, which is a living space in this example.
- Figure 1 A is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure.
- the types and numbers of elements shown in Figure 1 A are merely provided by way of example. Other implementations may include more, fewer and/or different types and numbers of elements.
- the apparatus 50 may be configured for performing at least some of the methods disclosed herein.
- the apparatus 50 may be, or may include, one or more components of an audio system.
- the apparatus 50 may be an audio device, such as a smart audio device, in some implementations.
- the examples, the apparatus 50 may be a mobile device (such as a cellular telephone), a laptop computer, a tablet device, a television or another type of device.
- the apparatus 50 may be, or may include, a server.
- the apparatus 50 may be, or may include, an encoder.
- the apparatus 50 may be a device that is configured for use within an audio environment, such as a home audio environment, whereas in other instances the apparatus 50 may be a device that is configured for use in “the cloud,” e.g., a server.
- the apparatus 50 includes an interface system 55 and a control system 60.
- the interface system 55 may, in some implementations, be configured for communication with one or more other devices of an audio environment.
- the audio environment may, in some examples, be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, etc.
- the interface system 55 may, in some implementations, be configured for exchanging control information and associated data with audio devices of the audio environment.
- the control information and associated data may, in some examples, pertain to one or more software applications that the apparatus 50 is executing.
- the interface system 55 may, in some implementations, be configured for receiving, or for providing, a content stream.
- the content stream may include audio data.
- the audio data may include, but may not be limited to, audio signals.
- the audio data may include spatial data, such as channel data and/or spatial metadata. Metadata may, for example, have been provided by what may be referred to herein as an “encoder.”
- the content stream may include video data and audio data corresponding to the video data.
- the interface system 55 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some implementations, the interface system 55 may include one or more wireless interfaces.
- the interface system 55 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system and/or a gesture sensor system.
- the interface system 55 may include one or more interfaces between the control system 60 and a memory system, such as the optional memory system 65 shown in Figure 1 A. However, the control system 60 may include a memory system in some instances.
- the interface system 55 may, in some implementations, be configured for receiving input from one or more microphones in an environment.
- the control system 60 may, for example, include a general purpose single- or multi chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- control system 60 may reside in more than one device.
- a portion of the control system 60 may reside in a device within one of the environments depicted herein and another portion of the control system 60 may reside in a device that is outside the environment, such as a server, a mobile device (e.g., a smartphone or a tablet computer), etc.
- a portion of the control system 60 may reside in a device within one of the environments depicted herein and another portion of the control system 60 may reside in one or more other devices of the environment.
- control system functionality may be distributed across multiple smart audio devices of an environment, or may be shared by an orchestrating device (such as what may be referred to herein as a smart home hub) and one or more other devices of the environment.
- an orchestrating device such as what may be referred to herein as a smart home hub
- a portion of the control system 60 may reside in a device that is implementing a cloud-based service, such as a server, and another portion of the control system 60 may reside in another device that is implementing the cloud-based service, such as another server, a memory device, etc.
- the interface system 55 also may, in some examples, reside in more than one device.
- control system 60 may be configured for performing, at least in part, the methods disclosed herein. According to some examples, the control system 60 may be configured to obtain a plurality of echo references.
- the plurality of echo references may include at least one echo reference for each audio device of a plurality of audio devices in an audio environment. Each echo reference may, for example, correspond to audio data being played back by one or more loudspeakers of one audio device of the plurality of audio devices.
- control system 60 may be configured to make an importance estimation for each echo reference of the plurality of echo references. In some examples, making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment.
- the echo management system(s) may include an acoustic echo canceller (AEC) and/or an acoustic echo suppressor (AES).
- control system 60 may be configured to select based at least in part on the importance estimation, one or more selected echo references. In some examples, the control system 60 may be configured to provide the one or more selected echo references to the at least one echo management system.
- Non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc.
- RAM random access memory
- ROM read-only memory
- the one or more non-transitory media may, for example, reside in the optional memory system 65 shown in Figure 1 A and/or in the control system 60.
- the software may, for example, include instructions for controlling at least one device to perform some or all of the methods disclosed herein.
- the software may, for example, be executable by one or more components of a control system such as the control system 60 of Figure 1 A.
- the apparatus 50 may include the optional microphone system 70 shown in Figure 1 A.
- the optional microphone system 70 may include one or more microphones.
- the optional microphone system 70 may include an array of microphones.
- the array of microphones may be configured to determine direction of arrival (DO A) and/or time of arrival (TOA) information, e.g., according to instructions from the control system 60.
- the array of microphones may, in some instances, be configured for receive-side beamforming, e.g., according to instructions from the control system 60.
- one or more of the microphones may be part of, or associated with, another device, such as a speaker of the speaker system, a smart audio device, etc.
- the apparatus 50 may not include a microphone system 70.
- the apparatus 50 may nonetheless be configured to receive microphone data for one or more microphones in an audio environment via the interface system 60.
- a cloud-based implementation of the apparatus 50 may be configured to receive microphone data, or data corresponding to the microphone data, from one or more microphones in an audio environment via the interface system 60.
- the apparatus 50 may include the optional loudspeaker system 75 shown in Figure 1 A.
- the optional loudspeaker system 75 may include one or more loudspeakers, which also may be referred to herein as “speakers” or, more generally, as “audio reproduction transducers.”
- the apparatus 50 may not include a loudspeaker system 75.
- the apparatus 50 may include the optional sensor system 80 shown in Figure 1 A.
- the optional sensor system 80 may include one or more touch sensors, gesture sensors, motion detectors, etc.
- the optional sensor system 80 may include one or more cameras.
- the cameras may be free-standing cameras.
- one or more cameras of the optional sensor system 80 may reside in a smart audio device, which may be a single purpose audio device or a virtual assistant.
- one or more cameras of the optional sensor system 80 may reside in a television, a mobile phone or a smart speaker.
- the apparatus 50 may not include a sensor system 80. However, in some such implementations the apparatus 50 may nonetheless be configured to receive sensor data for one or more sensors in an audio environment via the interface system 60.
- the apparatus 50 may include the optional display system 85 shown in Figure 1 A.
- the optional display system 85 may include one or more displays, such as one or more light-emitting diode (LED) displays.
- the optional display system 85 may include one or more organic light-emitting diode (OLED) displays.
- the optional display system 85 may include one or more displays of a smart audio device.
- the optional display system 85 may include a television display, a laptop display, a mobile device display, or another type of display.
- the sensor system 80 may include a touch sensor system and/or a gesture sensor system proximate one or more displays of the display system 85.
- the control system 60 may be configured for controlling the display system 85 to present one or more graphical user interfaces (GUIs).
- GUIs graphical user interfaces
- the apparatus 50 may be, or may include, a smart audio device.
- the apparatus 50 may be, or may include, a wakeword detector.
- the apparatus 50 may be, or may include, a virtual assistant.
- playback media that is stereo or mono, traditionally it has been rendered into an audio environment (e.g., a living space, automobile, office space, etc.) via a pair of speakers physically wired to an audio player (e.g. a CD/DVD player, a television (TV), etc.).
- an audio player e.g. a CD/DVD player, a television (TV), etc.
- smart speakers have become popular, users often have more than two audio devices configured for wireless communication (which may include, but are not limited to, smart speakers or other smart audio devices) in their homes (or other audio environments) that are capable of playing back audio.
- Smart speakers are often configured to operate according to voice commands. Accordingly, such smart speakers are generally configured to listen continuously for a wakeword, which will normally be followed by a voice command. Any continuous listening task such as waiting for a wakeword, or performing any kind of “continuous calibration,” will preferably continue to function during the playback of content (such as the playback of music, the playback of sound tracks for movies and television programs, etc.) and while device interactions take place (e.g., during telephone calls). Audio devices that need to listen during the playback of content will typically need to employ some form of echo management, e.g., echo cancellation and/or echo suppression, to remove the “echo” (content played by the devices) from microphone signals.
- echo management e.g., echo cancellation and/or echo suppression
- Figure IB shows an example of an audio environment.
- the types, numbers and arrangement of elements shown in Figure IB are merely provided by way of example.
- Other implementations may include more, fewer and/or different types, numbers and/or arrangements of elements.
- the audio environment 100 includes audio devices 110A,
- each of the audio devices 1 lOA-1 IOC is an instance of the apparatus 50 of Figure 1 A and includes an instance of the microphone system 70 and the loudspeaker system 75, though these are not shown in Figure IB.
- each the audio devices 1 lOA-1 IOC may be a smart audio device, such as a smart speaker.
- the audio devices 1 lOA-1 IOC are playing back audio content while a person 130 is talking.
- the microphones of audio device 110B detect not only the audio content played back by its own speaker, but also the speech sounds 131 of the person 130 and the audio content played back by the audio devices 110A and 1 IOC.
- a typical approach is for all of the audio devices in an audio environment to play back the same content, with some timing mechanism to keep the playback media in synchronization. This has the advantage of making distribution simple, because all the devices receive the same copy of the playback media either downloaded or streamed to each audio device, or broadcast by one device and multicast to all the audio devices.
- a spatial effect may be achieved by adding more playback channels (e.g., one per speaker), e.g., through upmixing.
- a spatial effect may be achieved via a flexible rendering process such as Center of Mass Amplitude Panning (CMAP), Flexible Virtualization (FV), or a combination of CMAP and FV.
- CMAP Center of Mass Amplitude Panning
- FV Flexible Virtualization
- Relevant examples of CMAP, FV and combinations thereof are described in International Patent Publication No. WO 2021/021707 A1 (e.g., on pages 25-41), which is hereby incorporated by reference.
- Figures 1C and ID show additional examples of audio devices in an audio environment.
- the audio environments 100 include a smart home hub 105 and audio devices 110 A, 110B and 1 IOC.
- the smart home hub 105 and the audio devices 1 lOA-1 IOC are instances of the apparatus 50 of Figure 1 A.
- each of the audio devices 1 lOA-1 IOC includes a corresponding one of the loudspeakers 121 A, 121B and 121C.
- each the audio devices 1 lOA-1 IOC may be a smart audio device, such as a smart speaker.
- Figures 1C and ID show examples of how playback channels may be received by the audio devices 1 lOA-1 IOC.
- an encoded audio bitstream is multicast to all of the audio devices 1 lOA-1 IOC.
- each of the audio devices 1 lOA-1 IOC receives only the channel that the audio device needs for playback.
- the choice of bitstream distribution may vary according to the individual implementation and may, for example, be based on the available system bandwidth, the coding efficiency of the audio codec used, the capabilities of the audio devices 1 lOA-1 IOC and/or other factors.
- the exact topologies of the audio environments shown in Figures 1C and ID are not important. However, these examples illustrate the fact that distributing audio channels to devices audio devices will incur some cost. The cost may be assessed in terms of the required network bandwidth, the added computational cost of encoding decoding the channels of audio, etc.
- Figure IE shows another example of an audio environment.
- the audio environment 100 includes audio devices 110A, 110B, 1 IOC and 110D.
- each of the audio devices 1 lOA-110D is an instance of the apparatus 50 of Figure 1A and includes at least one microphone (see microphones 120 A, 120B, 120C and 120D) at least one loudspeaker (see loudspeakers 121 A, 121B, 121C and 121D).
- each the audio devices 1 lOA-110D may be a smart audio device, such as a smart speaker.
- the audio devices 1 lOA-110D are rendering content 122A, 122B, 122C and 122D via the loudspeakers 121 A-121D.
- the “echo” corresponding to the content 122A-122D played back by each of the audio devices 1 lOA-110D is detected by each of the microphones 120A-120D.
- the audio devices 1 lOA-110D are configured to listen for a command or wakeword in the speech 131 from the person 130 within the audio environment 100.
- Figure 2A presents a block diagram of an audio device that is capable of performing at least some disclosed implementations.
- the types, numbers and arrangement of elements shown in Figure 2A are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and/or arrangements of elements.
- the audio device 110A is an instance of the audio device 110A of Figure IE.
- the audio device 110A includes a control system 60a, which is an instance of the control system 60 of Figure 1 A.
- the control system 60 is capable of listening to speech 131 of the person 130 in the presence of echo corresponding to the content 122 A, 122B, 122C and 122D played back by each audio device in the audio environment 100.
- the control system 60 is implementing a Tenderer 201 A, a multi-channel acoustic echo management system (MC-EMS) 203 A and a speech processing block 240A.
- the MC-EMS 203 A may include an acoustic echo canceller (AEC), an acoustic echo suppressor (AES), or both an AEC and an AES, depending on the particular implementation.
- the speech processing block 240A is configured to detect user wakewords and commands.
- the speech processing block 240A may be configured for supporting a communications session, such as a telephone call.
- the Tenderer 201 A is configured to provide a local echo reference 220A to the MC-EMS 203 A.
- the local echo reference 220A corresponds to (and in this example is identical to) the speaker feed signals provided to the loudspeaker 121 A for playback by the audio device 110A.
- the Tenderer 201 A is also configured to provide non-local echo references 221 A-corresponding to the content 122B, 122C and 122D played back by the other audio devices in the audio environment 100-to the MC-EMS 203 A.
- the audio device 110A receives a combined bitstream (e.g., as shown in Figure 1C) that includes audio data for all of the audio devices 1 lOA-110D of Figure IE.
- the Tenderer 201 A may be configured to isolate the local echo reference 220 A from the non-local echo references 221 A, to provide the local echo reference 220 A to the loudspeaker 121 A and to provide the local echo reference 220 A and the non-local echo references 221 A to the MC-EMS 203 A.
- the audio device 110A may receive a bitstream that is only intended for playback on the audio device 110A, e.g., as shown in Figure ID.
- the smart home hub 105 (or the other audio devices 110B-D) may provide the non-local echo references 221 A to the audio device 110 A, as suggested by the dashed arrow next to reference number 221 A in Figure 2A.
- the local echo reference 220A and/or the non-local echo references 221 A may be full-fidelity replicas of the speaker feed signals provided to the loudspeakers 121 A-121D for playback.
- the local echo reference 220A and/or the non-local echo references 221 A may be lower-fidelity representations of of the speaker feed signals provided to the loudspeakers 121 A-121D for playback.
- the non-local echo references 221 A may be downsampled versions of the speaker feed signals provided to the loudspeakers 121B-121D for playback.
- the non-local echo references 221 A may be lossy compressions of the speaker feed signals provided to the loudspeakers 121B-121D for playback.
- the non local echo references 221 A may be banded power information corresponding to the speaker feed signals provided to the loudspeakers 121B-121D for playback.
- the MC-EMS 203 A is configured to use the local echo reference 220A and the non-local echo references 221 A to predict and cancel and/or suppress the echo from microphone signals 223 A, thereby producing the residual signal 224A in which the speech to echo ratio (SER) may have been improved with respect to that in the microphone signals 223 A.
- This residual signal 224A may enable the speech processing block 240A to detect user wakewords and commands.
- the speech processing block 240A may be configured for supporting a communications session, such as a telephone call.
- Some aspects of this disclosure involve making an importance estimation for each echo reference of a plurality of echo references (e.g., for the local echo reference 220A and the non-local echo references 221 A). Making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment (e.g., the mitigation of echo by the MC-EMS 203 A of audio device 110A). Various examples are provided below.
- each audio device may obtain the echo references corresponding to what is played back by one or more other audio devices in an audio environment, in addition to its own echo reference.
- the impact of including a particular echo reference in a local echo management system or “EMS” may vary according to a multitude of parameters, such as the diversity of the audio content being played out, the network bandwidth required for transmitting the echo reference, the encoding computational requirement for encoding an echo reference if an encoded echo reference is transmitted, the decoding computational requirement for decoding the echo reference, the echo management system computational requirement for using the echo reference by the echo management system, the relative audibility of the audio devices, etc.
- some implementations may provide a distributed and orchestrated EMS (DOEMS), wherein echo references are prioritized and transmitted (or not) accordingly.
- DOEMS distributed and orchestrated EMS
- Some such examples may implement a tradeoff between the cost (e.g., network bandwidth required and/or computational overhead required) and the benefit (e.g., the expected echo mitigation improvement, which may be measured according to the signal-to-echo ratio (SER) and/or echo return loss enhancement (ERLE)) of each additional echo reference.
- the cost e.g., network bandwidth required and/or computational overhead required
- the benefit e.g., the expected echo mitigation improvement, which may be measured according to the signal-to-echo ratio (SER) and/or echo return loss enhancement (ERLE) of each additional echo reference.
- SER signal-to-echo ratio
- ERLE echo return loss enhancement
- Figures 2B and 2C show additional examples of audio devices in an audio environment.
- the audio environments 100 include a smart home hub 105 and audio devices 110 A, 110B and 1 IOC.
- the smart home hub 105 and the audio devices 1 lOA-1 IOC are instances of the apparatus 50 of Figure 1 A.
- each of the audio devices 1 lOA-1 IOC includes a corresponding one of the microphones 120 A, 120B and 120C and a corresponding one of the loudspeakers 121 A, 121B and 121C.
- each the audio devices 1 lOA-1 IOC may be a smart audio device, such as a smart speaker.
- the smart home hub 105 sends the same encoded audio bitstream to all of the audio devices 1 lOA-1 IOC.
- the smart home hub 105 sends only the audio channel that each of the audio devices 1 lOA-1 IOC needs for playback.
- audio channel 0 is intended for playback on audio device 110A
- audio channel 1 is intended for playback on audio device 110B
- audio channel 2 is intended for playback on audio device HOC.
- Figures 2B and 2C show examples of echo reference data being shared across a local network.
- the audio device 110A is sending echo reference 220A’, which is an echo reference corresponding to the loudspeaker playback of the audio device 110A, over the local network to the audio devices 110B and 1 IOC.
- the echo reference 220A’ is different from the channel 0 audio found in the bitstream.
- the echo reference 220A’ may be different from the channel 0 audio because of playback post-processing being implemented on the audio device 110 A.
- the combined bitstream is not provided to all of the audio devices 110A- 1 IOC, so another device (such as the audio device 110A or the smart home hub 105) provides the echo reference 220A’.
- another device such as the audio device 110A or the smart home hub 105
- the echo reference 220A’ may nonetheless need to be transmitted in some such instances.
- the echo reference 220A’ may be different from the channel 0 audio because the echo reference 220A’ may not be a full-fidelity replica of the audio data being played back on the audio device 110 A.
- the echo reference 220A’ may correspond to the audio data being played back on the audio device 110A, but may require relatively less data than the complete replica and therefore may consume relatively less bandwidth of the local network when the echo reference 220A’ is transmitted.
- the audio device 110A may be configured for making a downsampled version of the local echo reference 220A that is described above with reference to Figure 2A.
- the echo reference 220A’ may be, or may include, the downsampled version.
- the audio device 110A may be configured for making a lossy compression of the local echo reference 220A.
- the echo reference 220A’ may be a result of the control system 60a applying a lossy compression algorithm to the local echo reference 220A.
- audio device 110A may be configured for providing banded power information to the audio devices 110B and 1 IOC corresponding to the local echo reference 220A.
- the control system 60a instead of transmitting a full-fidelity replica of the audio data being played back on the audio device 110 A, the control system 60a may be configured to determine a power level of the audio data being played back on the audio device 110A in each of a plurality of frequency bands and to transmit the corresponding banded power information to the audio devices 110B and 1 IOC.
- the echo reference 220A’ may be, or may include, the banded power information.
- Figure 3 A presents a block diagram that shows components of an audio device according to one example.
- the types, numbers and arrangement of elements shown in Figure 3 A are merely provided by way of example.
- Other implementations may include more, fewer and/or different types, numbers and/or arrangements of elements.
- some implementations may be configured to send and/or receive either “raw” echo references (which may be complete, full-fidelity replicas of audio being reproduced on an audio device), a lower-fidelity version or representation of the audio being reproduced on an audio device (such as a downsampled version, a version produced by lossy compression, or banded power information corresponding to the audio being reproduced on an audio device), but not both the raw and lower-fidelity versions.
- “raw” echo references which may be complete, full-fidelity replicas of audio being reproduced on an audio device
- a lower-fidelity version or representation of the audio being reproduced on an audio device such as a downsampled version, a version produced by lossy compression, or banded power information corresponding to the audio
- the audio device 110A is an instance of the audio device 110A of Figure IE and includes a control system 60a, which is an instance of the control system 60 of Figure 1 A.
- the control system 60a is configured to implement a renderer 201 A, a multi-channel acoustic echo management system (MC-EMS) 203 A, a speech processing block 240A, an echo reference orchestrator 302A, a decoder 303A and a noise estimator 304A.
- MC-EMS 203 A and the speech processing block 240A function as described above with reference to Figure 2A unless the following description of Figure 3 A indicates otherwise.
- the network interface 301 A is an instance of the interface system 55 that is described above with reference to Figure 1A.
- 110 A an audio device
- the audio device 110A may have more than one microphone in some implementations;
- the audio device 110A may have more than one loudspeaker in some implementations;
- 201 A a renderer that produces references for local playback and echo references to model the audio that is played back by the other audio devices in the audio environment
- 203 A a multi-channel acoustic echo management system (MC-EMS), which may include an acoustic echo canceller (AEC) and/or an acoustic echo suppressor (AES);
- MC-EMS multi-channel acoustic echo management system
- AEC acoustic echo canceller
- AES acoustic echo suppressor
- 220A a local echo reference for playback and cancellation
- 224A a plurality of residual signals (the microphone signal after the MC-EMS 203 A has cancelled and/or suppressed the predicted echo);
- 240A a speech processing block configured for wakeword detection, voice command detection and/or providing telephonic communication
- a network interface configured for communication between audio devices, which also may be configured for communication via the Internet and/or via one or more cellular networks;
- an echo reference orchestrator configured to rank echo references and select an appropriate set of one or more echo references
- 304A a noise estimator block
- 310 A one or more decoded echo references received by audio device 110A from one or more other devices in the audio environment;
- 311 A A request for echo references to be sent over the local network from one or more other devices, such as a smart home hub or one or more of the audio devices 11 OB- 110D;
- Metadata which may be, or may include, metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmixing matrix and/or a matrix of loudspeaker activations;
- 314A echo references received by device 110A from one or more other devices;
- 315 A echo references sent from device 110A to other devices
- 317A lower-fidelity (e.g., coded) versions of echo references received by device 110A from one or more other devices of the audio environment;
- 318A an audio environment noise estimate
- 350A one or more metrics indicating the current performance of the MC-EMS 203 A, which may be, or may include, adaptive filter coefficient data or other AEC statistics, speech- to-echo (SER) ratio data, etc.
- SER speech- to-echo
- the echo reference orchestrator 302 A may function in various ways, depending on the particular implementation. Many examples are disclosed herein.
- the echo reference orchestrator 302 A may be configured for making an importance estimation for each echo reference of a plurality of echo references (e.g., for the local echo reference 220 A and the non-local echo references 221 A). Making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment (e.g., the mitigation of echo by the MC-EMS 203 A of audio device 110A).
- the importance metric may be based, at least in part, on one or more characteristics of each echo reference, such as level, uniqueness, temporal persistence, audibility, or one or more combinations thereof.
- the importance metric may be based, at least in part, on metadata (e.g., the metadata 312A), such as metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmixing matrix, a matrix of loudspeaker activations, or one or more combinations thereof.
- the importance metric may be based, at least in part, on a current listening objective, a current ambient noise estimate, an estimate of a current performance of at least one echo management system, or one or more combinations thereof.
- the echo reference orchestrator 302 A may be configured for selecting a set of one or more echo references based, at least in part, on a cost determination.
- the echo reference orchestrator 302 A may be configured to make the cost determination, whereas in other examples another block of the control system 60a may be configured to make the cost determination.
- the cost determination may involve determining a cost for at least one echo reference of a plurality of echo references, or in some cases for each of the plurality of echo references.
- the cost determination may be based on network bandwidth required for transmitting the echo reference, an encoding computational requirement for encoding the at least one echo reference, a decoding computational requirement for decoding the at least one echo reference, a downsampling cost of making a downsampled version of the echo reference, an echo management system computational requirement for use of the at least one echo reference by the echo management system, or one or more combinations thereof.
- the cost determination may be based on a replica of the at least one echo reference in a time domain or a frequency domain, on a downsampled version of the at least one echo reference, on a lossy compression of the at least one echo reference, on banded power information for the at least one echo reference, or one or more combinations thereof. In some instances, the cost determination may be based on a method of compressing a relatively more important echo reference less than a relatively less important echo reference.
- the echo reference orchestrator 302 A (or another block of the control system 60a) may be configured for determining a current echo management system performance level (e.g., based at least in part on the metric(s) 350A). In some such examples, selecting the one or more selected echo references may be based, at least in part, on the current echo management system performance level.
- the rate at which the importance of each echo reference is estimated and the rate at which the set of echo references is evaluated may differ.
- the rate at which the importance is estimated need not be equal to the rate at which the echo reference selection process makes decisions. If the two are not synchronized, in some examples the importance calculation would be more frequent.
- the echo reference selection may be a discrete process wherein binary decisions are made either to include or not include particular echo references.
- Figures 3B and 3C are graphs that show examples of the expected echo management performance versus the number of echo references used for echo management.
- Figure 3B one may see that as additional references are added, the expected echo performance increases. However in this example, one may see that there are only a few discrete points at which that the system can operate. In some examples, the points shown in Figure 3B may correspond to processing complete, full-fidelity replicas of each echo reference.
- point 301 may correspond to an instance of processing a local echo reference (e.g., the local reference 220 A of Figure 2 A or Figure 3 A) and point 310 may correspond to an instance of receiving a complete replica of a first non-local echo reference (e.g., a full-fidelity version of one of the received echo references 314A of Figure 3 A, which may have been selected as the most important non-local echo reference) and processing both the local echo reference and the complete replica of the first non-local echo reference.
- Figure 3C illustrates one example of operating between any two of the discrete operating points that are shown in Figure 3B.
- the lines connecting the points in Figure 3B may, for example, correspond to a range of echo reference fidelities, including lower-fidelity versions or representations of each echo reference.
- points 303, 305 and 307 may correspond to copies, or representations, of the first non-local echo reference at increasing levels of fidelity, with point 303 corresponding to the lowest-fidelity representation and point 307 corresponding to the highest-fidelity representation other than the full-fidelity replica.
- point 303 may correspond to banded power information for the first non-local echo reference.
- points 305 and 307 may correspond to a relatively more lossy compression of the first non-local echo reference and a relatively less lossy compression of the first non-local echo reference, respectively.
- the fidelity of the copies, or representations, of the echo references will generally correlate inversely to the number of bits required for each such copy or representation. Accordingly, the fidelity of the copies, or representations, of the echo references provides an indication of the tradeoff between network cost (due to the varying number of bits required for transmission) and the expected echo management performance (because the performance should improve as the fidelity increases). Note that the straight lines used to connect the points in Figure 3C merely represent one of many different possible trajectories, in part because the incremental change from one echo reference to the next depends on which echo reference would be selected as the next echo reference and in part because there may not be a linear relationship between the expected echo management performance and fidelity.
- Figure 4 presents a block diagram that shows components of an echo reference orchestrator according to one example.
- the types, numbers and arrangement of elements shown in Figure 4 are merely provided by way of example.
- Other implementations may include more, fewer and/or different types, numbers and/or arrangements of elements.
- some implementations may be configured to send and/or receive either “raw” echo references (which may be full-fidelity replicas of audio being reproduced on an audio device), lower-fidelity versions or representations of the audio being reproduced on an audio device (such as downsampled versions, versions produced by lossy compression, or banded power information corresponding to the audio being reproduced on an audio device), but not both the raw and lower-fidelity versions.
- “raw” echo references which may be full-fidelity replicas of audio being reproduced on an audio device
- lower-fidelity versions or representations of the audio being reproduced on an audio device such as downsampled versions, versions produced by lossy compression, or banded power information corresponding to the audio being reproduced on an audio device
- some implementations of the echo reference orchestrator 302 A may include a metadata-based metric computation module such as the metadata-based metric computation module 705 that is described herein with reference to Figures 7 et seq.
- the metadata-based metric computation module may generate EMS look-ahead statistics, based at least in part on scene change message(s) from a scene change analyzer, and may provide the EMS look-ahead statistics to the MC-EMS performance model 405A.
- the metadata-based metric computation module may generate echo reference characteristics from which the importance metrics 420 may be determined.
- the echo reference characteristics may be based, at least in part, on the metadata 312.
- the echo reference characteristics may be based, at least in part, on the audio scene change messages.
- the metadata-based metric computation module may provide the echo reference characteristics to the echo reference importance estimator 401 A.
- the metadata-based metric computation module may provide the echo reference characteristics to the echo reference selector 402A.
- the echo reference orchestrator 302A is an instance of the echo reference orchestrator 302A of Figure 3A and is implemented by an instance of the control system 60a of Figure 3A.
- the elements of Figure 4 are as follows:
- 220A a local echo reference for playback and cancellation
- 221 A a locally-produced copy of a non-local echo reference that another audio device of the audio environment is playing;
- the echo reference orchestrator a module that is configured to rank and select a set of one or more echo references
- 310 A one or more decoded echo references received by audio device 110A from one or more other devices in the audio environment;
- 311 A a request for echo references to be sent over the local network from one or more other devices of the audio environment;
- Metadata which may be, or may include, metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmixing matrix and/or a matrix of loudspeaker activations;
- 313 A a set of one or more echo references selected by the echo reference orchestrator 302A and sent to the MC-EMS 203A in this example;
- 317A lower-fidelity (e.g., coded) versions of echo references received by device 110A from one or more other devices of the audio environment
- 318A an audio environment noise estimate
- one or more metrics indicating the current performance of the MC-EMS 203 A which may be, or may include, adaptive filter coefficient data or other AEC statistics, speech- to-echo (SER) ratio data, etc.
- an echo reference importance estimator which is configured to estimate the expected importance of each echo reference and, in this example, to generate corresponding importance metrics 420A;
- an echo reference selector that is configured to select the set of echo references 313 A, in this example based at least in part on the current listening objective (as indicated by 421 A), the cost of each echo reference (as indicated by 422 A), the current state/performance of the EMS (as indicated by 350A) and the estimated importance of each candidate echo reference (as indicated by importance metrics 420 A);
- a cost estimation module that is configured to determine the cost(s) (e.g., the computational and/or network costs) of including an echo reference in the set of echo references 313 A;
- 404A an optional module that determines or estimates the current listening objective of the audio device 110 A;
- 405A a module configured to implement one or more MC-EMS performance models, which may in some examples produce data such as shown in Figure 3B or Figure 3C;
- EMS health data information produced by the MC-EMS performance model 405A, which may in some examples be, or include, data such as shown in Figure 3B or Figure 3C; the information 423 A may be referred to herein as “EMS health data.”
- the echo reference importance estimator 401 A may function in various ways, depending on the particular implementation. Various examples are provided in this disclosure. In some examples, the echo reference importance estimator 401 A may be configured for making an importance estimation for each echo reference of a plurality of echo references (e.g., for the local echo reference 220A and the non-local echo references 221 A). Making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment (e.g., the mitigation of echo by the MC-EMS 203A of audio device 110A).
- the echo reference importance estimator 401 A may be configured for making an importance estimation for each echo reference of a plurality of echo references (e.g., for the local echo reference 220A and the non-local echo references 221 A). Making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment (e.g., the mitigation of echo by the
- making the importance estimation involves determining importance metrics 420A.
- the importance metrics 420A may be based, at least in part, on one or more characteristics of each echo reference, such as level, uniqueness, temporal persistence, audibility, or one or more combinations thereof.
- an importance metric may be based, at least in part, on metadata (e.g., the metadata 312A), which may include metadata corresponding to an audio device layout, loudspeaker metadata (e.g., the sound pressure level (SPL) ratings, frequency ranges, whether the loudspeaker is an upwards-firing loudspeaker, etc.), metadata corresponding to received audio data (e.g., positional metadata, metadata indicating vocals or other speech, etc.), an upmixing matrix, a matrix of loudspeaker activations, or one or more combinations thereof.
- the echo reference importance estimator 401 A may provide importance metrics 420A to the MC-EMS performance model 405 A.
- the importance metrics 420A are based, at least in part, on a current listening objective, as indicated by the information 421 A.
- the current listening objective may significantly change how factors such as level, uniqueness, temporal persistence, audibility, etc., are evaluated.
- the importance analysis may be very different during a telephone call than when awaiting a wake word.
- the importance metrics 420A are based, at least in part, on the current ambient noise estimate 318 A, the metric(s) 350A indicating the current performance of the MC-EMS 203 A, information 423 A produced by the MC-EMS performance model 405A, or one or more combinations thereof.
- the echo reference importance estimator 401 A may determine that a relatively higher room noise level (as indicated by the current ambient noise estimate 318 A) will make it less likely that adding an echo reference will help mitigate echo significantly.
- information 423 A may correspond to the type of information that is described above with reference to Figures 3B and 3C, which may provide a direct correlation between the use of an echo reference and the expected increase in performance by the MC-EMS 203 A.
- the performance of an EMS may be based in part on the robustness of the EMS when perturbed by noise in the audio environment.
- the echo reference selector 402 selects a set of one or more echo references based, at least in part, on one or more metrics 350A indicating the current performance of the MC-EMS 203 A, the importance metrics 420 A, the current listening objective 421 A, information 422A indicating the cost(s) of including an echo reference in the set of echo references 313 A and information 423 A produced by the MC- EMS performance model 405 A.
- metrics 350A indicating the current performance of the MC-EMS 203 A
- the importance metrics 420 A the current listening objective 421 A
- information 422A indicating the cost(s) of including an echo reference in the set of echo references 313 A and information 423 A produced by the MC- EMS performance model 405 A.
- the cost estimation module 403 A is configured to determine the computational and/or network costs of including an echo reference in the set of echo references 313 A.
- the computational cost may, for example, include the additional computational cost of use, by the MC-EMS 203 A, of a particular echo reference. This computational cost may depend, in turn, on the number of bits required to represent the echo reference.
- the computational cost may include the computational cost of a lossy echo reference encoding process and/or the computational cost of a corresponding echo reference decoding process. Determining the network costs may involve determining the amount of data required to send a complete replica of an echo reference or a copy or representation of the echo reference across a local data network (e.g., a local wireless data network).
- the echo reference selection block 402A may generate and transmit a request 311 A for another device in the audio environment to send one or more echo references to it over the network.
- element 314A of Figure 3 A indicates one or more echo references being received by the audio device 110 A, which may in some instances have been responsive to a request 311 A).
- the request 311 A may specify the fidelity of the requested echo reference, e.g., whether a “raw” copy (a full-fidelity replica) of the echo reference should be sent, whether an encoded version of the echo reference should be sent, if an encoded version of the echo reference should be sent, whether a relatively more or relatively less lossy compression algorithm should be applied to the echo reference, whether banded power information corresponding to the echo reference should be sent, etc.
- the fidelity of the requested echo reference e.g., whether a “raw” copy (a full-fidelity replica) of the echo reference should be sent, whether an encoded version of the echo reference should be sent, if an encoded version of the echo reference should be sent, whether a relatively more or relatively less lossy compression algorithm should be applied to the echo reference, whether banded power information corresponding to the echo reference should be sent, etc.
- a request for an encoded echo reference not only introduces a network cost due to sending the request and the reference, but also adds a computational cost for the responding device(s) (e.g., the smart home hub 105 or one or more of the audio devices 1 lOB-110D) that must encode the reference, as well as the computational cost for the audio device 110A to decode the received reference.
- this encoding cost may be a one-time cost. Accordingly, the request from one audio device to another to send an encoded reference over the network changes the potential performance/cost tradeoff being performed in other devices (e.g., in audio devices 402C and 402D).
- one or more of the blocks of the echo reference orchestrator 302 A may be performed by an orchestrating device, e.g., the smart home hub 105 or one of the audio devices 1 lOA-110D.
- an orchestrating device e.g., the smart home hub 105 or one of the audio devices 1 lOA-110D.
- at least some functionality of the echo reference importance estimator 401 A and/or the echo reference selection block 402A may be performed by the orchestrating device.
- Some such implementations may be capable of determining cost/benefit trade-offs on a systemwide basis, taking into account the performance enhancements of all instances of the MC-EMS in the audio environment, the overall computational demands for all instances of the MC-EMS, the overall demands on the local network and/or the overall computational demands for all encoders and decoders.
- the importance metric (which may be referred to herein as “Importance” or “I”) may be a measure of the expected improvement in performance of an EMS due to the inclusion of a particular echo reference.
- Importance may depend on the present state of the EMS, particularly on the set of echo references already in use and at what level of fidelity they are being received. Importance may be available at different timescales, depending on the particular implementation. On one extreme, Importance may be implemented on a frame-by-frame basis (e.g., according to an Importance signal for each frame).
- Importance may be implemented as a constant value for the duration of a content segment, or as a constant value for the time during which a particular configuration of audio devices is in use.
- the configuration of audio devices may correspond to audio device positions and/or audio device orientations.
- the Importance metric may be calculated on a variety of timescales depending on the particular implementation, e.g.,:
- a track corresponds to a content segment such as a song or other musical content segment that may, for example, persist on a time scale of minutes;
- a control system may be configured to determine an Importance matrix, which may include all the importance information for a present system of audio devices.
- Importance matrix may have dimension NxM, including an entry for each audio device and an entry for each potential echo reference channel.
- N represents the number of audio devices and M represents the number of potential echo references. Because some audio devices may play back more than one channel, this type of Importance matrix will not always be square.
- the importance metric / may be based on one or more of the following:
- LUPA refers generally to echo reference characteristics from which the importance metric may be determined, including but not limited to one or more of L, U, P and/or A.
- the term “level” refers to the level within the digital representation of an audio signal, and not necessarily to the actual sound pressure level of the audio signal after being reproduced via a loudspeaker.
- the loudness of a single channel of echo reference may be based on a root mean square (RMS) metric or an LKFS (loudness, k-weighted, relative to full scale) metric.
- RMS root mean square
- LKFS ladness, k-weighted, relative to full scale
- Such metrics are easily computed on the echo references in real-time, or may be present as metadata in a bitstream.
- L may be determined according to a volume setting, such as an audio system volume setting or a volume setting within a media application.
- the uniqueness aspect is intended to capture the amount of new information that a particular echo reference provides about an overall audio presentation.
- multichannel audio presentations often contain redundancy across channels. This redundancy may, for example, occur because instruments and other sound sources are replicated across channels on the left and right sides of a room, or as signals are panned and thus further replicated in multiple active loudspeakers at the same time. Even though such scenarios result in an over-specified problem for an EMS to solve (where echo filters may infer observations from multiple echo paths), some benefits and higher performance can nonetheless be observed in practice.
- U may be computed or estimated in various ways. In some examples U may be based, at least in part, on the correlation coefficient between each echo reference. In one such example, U may be estimated as follows:
- U may be based, at least in part, on decomposition of audio signals to find redundancies.
- Some such examples may involve instantaneous frequency estimation, fundamental frequency (F0) estimation, spectrogram inversion and/or nonnegative matrix factorization (NMF).
- F0 fundamental frequency
- NMF nonnegative matrix factorization
- U may be based, at least in part, on data used for matrix decoding.
- Matrix decoding is an audio technology in which a small number of discrete audio channels (e.g., 2) are decoded into a larger number of channels on play back (e.g., 4 or 5).
- the channels are generally arranged for transmission or recording by an encoder, and decoded for playback by a decoder.
- Matrix decoding allows multichannel audio, such as surround sound, to be encoded in a stereo signal, to be played back as stereo on stereo equipment, and to be played back as surround on surround equipment.
- a static upmixing matrix could be applied to the stereo audio data in order to provide properly rendered audio for each of the loudspeakers in the Dolby 5.1 system.
- U may be based, at least in part, on the coefficients of an up-mixing or down-mixing matrix used to address each of the loudspeakers of an audio environment (e.g., each of the audio devices 1 lOA-110D) with audio.
- U may be based, at least in part, on a standard canonical loudspeaker layout used in the audio environment (e.g., Dolby 5.1, Dolby 7.1, etc.)
- a standard canonical loudspeaker layout used in the audio environment (e.g., Dolby 5.1, Dolby 7.1, etc.)
- Some such examples may involve leveraging the way media content is traditionally mixed and presented in such a canonical loudspeaker layout.
- a Dolby 5.1 or a Dolby 7.1 system artists typically put vocals in the center channel, but not surround channels.
- audio corresponding to musical instruments and other sound sources is typically replicated across channels on the left and right sides of a room.
- vocals, dialogue, instrumental music, etc. may be identified via metadata received with the corresponding audio data.
- the persistence metric is intended to capture the aspect that different types of played- back media may have a wide range of temporal persistence, with different types of content having varying degrees of silence and loudspeaker activation.
- a continuous stream of spectrally dense content (such as music or the audio output of a video game console) may have a high level of temporal persistence, whereas podcasts may have a lower level of temporal persistence.
- Infrequent system notifications will have a very low level of temporal persistence.
- Echo references corresponding to media with a low degree of persistence may be less important for an EMS, depending on the specific listing task at hand. For instance, an occasional system notification is less likely to collide with a wake-word or barge-in request, and thus the relative importance of managing this echo is low.
- the audio content type may affect estimates of L, U and/or P. For example, knowing that the audio content is stereo music would allow the ranking of all of the echo references using just the channel assignment mentioned above. Alternatively, knowing that the audio content is Atmos could alter default L, U and/or E assumptions if the control system were not to analyze the audio content but instead to rely on the channel assignment.
- the audibility metric is directed to the facts that audio devices have different playback characteristics and may be located at varying distances from one another in any given audio environment. Following are examples of metrics that may be used to measure or estimate audio device audibility :
- a data structure that includes characteristics of one or more loudspeakers of the audio device, such as the rated SPL, frequency response and directivity (e.g., whether a loudspeaker is omnidirectional, front-firing, upward-firing, etc.);
- the listening objective may define the context and desired performance characteristics of the EMS.
- the listening objective may modify the parameters and/or the domain over which LUPA is evaluated.
- the following discussion will consider 3 potential contexts in which the listening objective changes. In these different contexts, we will see how Probability and Criticality can affect LUPA.
- the command recognition module may be relatively less robust than the wakeword detector, the criticality of echo leakage will generally be high.
- the way LUPA is evaluated may change.
- the temporal range over which a control system evaluates LUPA may be quite long in order to obtain better estimates of those parameters.
- the time interval over which a control system evaluates LUPA may be set to look relatively far into the future (e.g., over a time frame of minutes).
- LUPA may be evaluated over much shorter timescales than in the barge-in context, e.g., on the order of second.
- references that are temporally sparse and which have content playing within the next few seconds after wakeword detection will be considered much more important during this time interval, now that the likelihood of a collision is high.
- Figure 5A is a flow diagram that outlines one example of a disclosed method.
- the blocks of method 500 like other methods described herein, are not necessarily performed in the order indicated. In some examples, one or more blocks may be performed concurrently. Moreover, such methods may include more or fewer blocks than shown and/or described.
- some implementations may not include block 501.
- method 500 is an echo reference selection method.
- the blocks of method 500 may, for example, be performed by a control system, such as the control system 60a of Figure 2A or Figure 3 A.
- the blocks of method 500 may be performed by an echo reference selector module, such as the echo reference selector 402A that is described above with reference to Figure 4.
- the reference selection method of Figure 5 A is an example of what may be referred to herein as a “greedy” echo reference selection method, which involves evaluating the cost and expected performance increase only at the MC-EMS’s current operating point (in other words, how many references the MC-EMS is currently using, including the echo references that haven been selected), and evaluating the results of adding each additional echo reference, e.g., in decreasing order of importance. Accordingly, this example involves a process of determining whether to add new echo references.
- the echo reference(s) being evaluated in method 500 may already have been ranked (e.g., by the echo reference importance estimator 401 A) according to estimated importance.
- block 501 involves determining whether or not a current performance level of an EMS is greater than or equal to a desired performance level. If so, the process terminates (block 510). However, if it is determined that the current performance level is less than a desired performance level, in this example the process continues to block 502.
- the determination of block 501 is based, at least in part, on one or more metrics indicating the current performance of the EMS, such as adaptive filter coefficient data or other AEC statistics, speech-to-echo (SER) ratio data, etc.
- this determination may be based, at least in part, on the one or more metrics 350A from the MC-EMS 203 A. As noted above, some implementations may not include block 501.
- block 502 involves ranking the remaining unselected echo references by importance and estimating the potential EMS performance increase to be gained by including the most important echo reference that is not yet being used by the EMS.
- this process may be based, at least in part, on information 423A produced by the MC-EMS performance model 405A, which may in some examples be, or include, data such as shown in Figure 3B or Figure 3C.
- the ranking and predicting processes described above may be performed at an earlier phase of the method 500, e.g., when a previous echo reference was being evaluated. In some examples, the ranking and predicting processes described above may be performed before the method 500 is performed.
- block 502 may simply involve selecting the highest-ranking unselected echo reference as determined by such a previous process.
- block 503 involves comparing the performance and cost of adding the echo reference selected in block 502.
- block 503 may be based, at least in part, on information 422A from the cost estimation module 403 A indicating the cost(s) of including an echo reference in the set of echo references 313 A.
- performance and cost may be variables having different ranges and/or domains, it may be challenging to compare these variables directly. Therefore, in some implementations the evaluation of block 503 may be facilitated by mapping the performance and cost may be variables to a similar scale, such as a range between predefined minimum and maximum values.
- the cost of adding the echo reference being evaluated may simply be set to zero if adding the echo reference would not cause a predetermined network bandwidth and/or computational cost budget to be exceeded. In some such examples, the cost of adding the echo reference being evaluated may be set to be infinite if adding the echo reference would cause a predetermined network bandwidth and/or computational cost budget to be exceeded. Such examples have the benefits of simplicity and efficiency. In this manner, the control system may simply add the maximum number of echo references that the predetermined network bandwidth and/or computational cost budget will allow.
- the estimated performance increase corresponding with adding an echo reference may be set to zero if the estimated performance increase is not above a predetermined threshold (e.g., 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, etc.).
- a predetermined threshold e.g., 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, etc.
- block 504 involves determining whether or not the new echo reference will be added, given the performance/cost evaluation of block 503.
- blocks 503 and 504 may be combined into a single block.
- block 504 involves determining whether the cost of adding the echo reference being evaluated would be less than the EMS performance increase that is estimated to be caused by adding the echo reference. In this example, if the estimated cost would not be less than the estimated performance increase, the process continues to block 511 and method 500 terminates. However, in this implementation, if the estimated cost would be less than the estimated performance increase, the process continues to block 505.
- block 505 involves adding the new echo reference to the set of selected echo references.
- block 505 may include informing the renderer 201 to output the relevant echo reference.
- block 505 may involve sending the echo reference over the local network or sending a command 311 to another device to send the echo reference over the local network.
- the echo references evaluated in method 500 may be either local or non-local echo references, the latter of which may be determined locally (e.g., by a local renderer as described above) or received over a local network. Accordingly, the cost estimation for some echo references may involve evaluating both computational and network costs.
- control system may simply reset the selected and unselected echo references and revert to a previous block of Figure 5 A, such as block 501, block 502 or block 503.
- more elaborate methods also may involve evaluating references that have already been chosen, e.g., ranking all of the references that have already been chosen and deciding whether or not to drop the echo reference with the lowest estimated importance.
- An echo reference may be transmitted (or used locally within a device, such as a device that produces all of the echo references) in a number of forms or variants, which may alter the cost/benefit ratio of that particular echo reference. For example, it is possible to reduce the cost of sending an echo reference across the local network if we transform the echo reference into a banded power form (in other words, determining the power in each of a plurality of frequency bands and transmitting banded power information about the power in each frequency band). However, the potential improvement that could be obtained by an EMS using a lower-fidelity variant of an echo reference will generally also be lower. The choice to make any particular variant of the echo reference available can be accounted for by making it a potential candidate for selection.
- an echo reference may be in one of the following forms, which are listed below (the first four of which are in an estimated order of decreasing performance):
- a full-fidelity (original, exact) echo reference which will incur full computational and network (if transported over the network) costs
- a downsampled echo reference for which computational and network costs will be proportionately decreased according to the downsampling factor, but which will incur a computational cost for the downsampling process;
- Banded power information corresponding to an echo reference for which network cost may be decreased significantly because the number of bands may be much lower than the number of subbands of the full-fidelity echo reference and for which the computational cost may be decreased significantly because the cost of implementing a banded AES is much less than the cost of implementing a subband AEC; or
- Figure 5B is a flow diagram that outlines another example of a disclosed method.
- the blocks of method 550 like other methods described herein, are not necessarily performed in the order indicated. In some examples, one or more blocks may be performed concurrently. Moreover, such methods may include more or fewer blocks than shown and/or described.
- the blocks of method 550 may, for example, be performed by a control system, such as the control system 60a of Figure 2A or Figure 3 A.
- the blocks of method 550 may be performed by an echo reference selector module, such as the echo reference selector 402A that is described above with reference to Figure 4.
- Method 550 takes into account the fact that echo references may not necessarily be transmitted or used in a full-fidelity form, but instead may be in one of the above-described alternative partial-fidelity forms. Therefore, in method 550 the evaluation of performance and cost does not involve a binary decision as to whether an echo reference in a full-fidelity form will or will not be used. Instead, method 550 involves determining whether to include one or more lower-fidelity versions of an echo reference, which may involve and potentially less of an increase in EMS performance, but at a lower cost. Methods such as method 550 provide additional flexibility in the potential set of echo references to be used by the echo management system.
- method 550 is an extension of the echo reference selection method 500 that is described above with reference to Figure 5 A. Accordingly, blocks 501 (if included), 502, 503, 504 and 505 may be performed as above with reference to Figure 5A, unless noted to the contrary below.
- Method 550 adds the potentially iterative loop that includes blocks 506 and 507 to method 500.
- it is determined (here, in block 504) that the estimated cost of adding one version of an echo reference will not be less than the estimated EMS performance increase, it is determined in block 506 whether there is another version of the echo reference. In some examples, a full-fidelity version of the echo reference may be evaluated before a lower-fidelity version (if any are available). According to this implementation, if it is determined in block 506 that another version of the echo reference is available, another version of the echo reference (e.g., the highest-fidelity version that is not the full-fidelity version) will be selected in block 507 and evaluated in block 503.
- another version of the echo reference e.g., the highest-fidelity version
- method 550 involves evaluating lower-fidelity versions of an echo reference, if any are available.
- Such lower-fidelity versions may include a downsampled version of the echo reference, an encoded version of the echo reference produced via a lossy encoding process and/or banded power information corresponding to the echo reference.
- cost of an echo reference refers to the resources required to utilize the reference for the purposes of echo management, whether that be with an AEC or an AES. Some disclosed implementations may involve estimating one or more of the following types of costs:
- Computational cost which may be determined with reference to the use of a limited amount of processing power available on one or more of the devices in an audio environment.
- Computational cost may refer to one or more of the following: o The cost required to perform echo management on a particular listening device using the reference. This may refer to the use of the reference in an AEC or an AES.
- an AEC operates on bins or subbands (which are complex numbers) and requires significantly more CPU operations than an AES, which operates on bands (of which there are a fewer number of compared to the bins/subbands used by an AES and the band powers are real numbers, not complex numbers); o The cost required to encode or decode the echo reference if coded references are being used; o The cost required to band a signal (in other words, transforming the signal from a simple linear frequency domain representation to a banded frequency domain representation); and/or o The cost required to produce the echo reference (e.g., by a Tenderer).
- Network cost which refers to the use of a limited amount of network resources such as the bandwidth available in the local network (e.g., the local wireless network in the audio environment) used for sharing echo references amongst devices.
- the total cost of a particular set of echo references may be determined as the sum of the cost of each echo reference in the set. Some disclosed examples involve combining both the network and computational costs. According to some examples, the total cost Ctotai may be determined as follows:
- R com p represents the total amount of computational resources available for the purposes of echo management
- R ne twork represents the total amount of network resources available for the purposes of echo management
- C mp represents the computational cost associated with using the m th reference
- C ⁇ twork represents the network cost associated with using the m th reference (where there are a total of M references used in the EMS).
- C totai includes only the cost components that are closest to becoming bounded by the resources available to the system.
- the “performance” of an echo management system may refer to the following:
- the amount of echo that is removed from the microphone feed which may be measured in echo return loss enhancement (ERLE), which is measured in decibels and is the ratio of send-in power to the power of a residual error signal.
- ERLE echo return loss enhancement
- This metric can be normalized, e.g., according to an application-based metric such as the minimal ERLE required in order to support and Automatic Speech Recognition (ASR) processor performing a wakeword detection task wherein a particular keyword uttered in the presence of echo is detected;
- ASR Automatic Speech Recognition
- the ability of the EMS to track changes in the rendered audio scene may refer to shifts in an echo reference covariance matrix and the robustness of the EMS to a non-stationary non-uniqueness problem.
- Some examples may involve determining a single performance metric P. Some such examples use the ERLE and the robustness estimated from adaptive filter coefficient data or other AEC statistics obtained from the EMS. According to some such examples, a performance robustness metric P rob may be determined using the “microphone probability” extracted from an AEC, e.g., as follows:
- 0 ⁇ P Rob ⁇ 1, 0 ⁇ M prob ⁇ 1 and M_prob represents the microphone probability, which is the proportion of the number of subband adaptive filters in the AEC that produce poor echo predictions that do not provide substantial (or any) echo cancellation in their respective subband.
- the performance of a wakeword (WW) detector is strongly dependent on the speech to echo ratio (SER), which is proportionately improved by the ERLE of the EMS.
- SER speech to echo ratio
- the WW detector is more likely to both trigger falsely (a false alarm) and miss keywords uttered by the user (a missed detection) due to the echo corrupting the microphone signal and decreasing the accuracy of the system.
- the SER of the residual signal e.g., the residual signal 224A of Figure 2A
- the ASR processor e.g., speech processing block 240A of Figure 2A
- some disclosed examples involve mapping a desired WW performance level to a nominal SER level which in turn, in conjunction with knowledge of the typical playback levels of the devices in a system, allows a control system to map this a desired WW performance level to a nominal ERLE directly.
- this method may be extended to map the WW performance of a system at various SER levels to the ERLE.
- the receiver operating characteristic (ROC) curve of a particular WW detector can be produced using input data with a range of SER values.
- Some examples involve choosing a particular False alarm rate (FAR) of interest and taking the accuracy of the WW detector as a function of the SER for this particular FAR as our application basis. In some such examples,
- Acc (SER res ) represents the accuracy of the WW detector as a function of the SER res which represents the SER of the residual signal output by the EMS.
- ROC() represents a collection of ROC curves for multiple SERs and FAR / represents the False alarm rate of interest, of which typical values may be 3 per 24 hours and 1 per 10 hours.
- the accuracy Acc (SER res ) may be represented as a percentage or normalized such that it is in the range from 0 to 1, which may be expressed as follows:
- LUPA components for, e.g., the actual echo level and speech levels typical of the target audio environments can be combined to determine typical SER values in the microphone signal (e.g., microphone signal 223A of Figure 2A), e.g., as follows:
- Speech pwr and Echo pwr represent the expected baseline speech power level and the echo power level of the targeted audio environment, respectively.
- the SER mic can improved to SER res proportionately to the ERLE, e.g., as follows:
- dB indicates that the variables are represented in decibels in this example.
- some implementations may define the ERLE of the EMS as follows:
- some implementations may define a WW application based EMS performance metric as follows: where SER ⁇ ic is representative of the SER in the target environment. In some examples SER ⁇ ic may be a static default number, whereas in other examples SER ⁇ ic may be estimated, e.g., as a function of one or more LUPA components. Some implementations may involve defining a net performance metric P as a vector containing each element, e.g., as follows: P — [PwW’PRobl
- one or more additional performance components may be added by increasing the size of the net performance vector.
- one or more additional performance components may be combined into a single scalar metric by weighting them, e.g. , as follows:
- K represents a weighting factor, chosen by the system designer, which is used to determine how much of each component contributes to the net performance.
- Some alternative examples may use another method, e.g., simply averaging individual performance metrics. However, it may be advantageous to combine the individual performance metrics into a single scalar one.
- a method When comparing the estimated cost and the estimated EMS performance enhancement for an echo reference, a method needs to somehow compare these two parameters which will not normally be in the same domain.
- One such method involves evaluating the cost and performance estimates individually and taking the lowest-cost solution that meets a predefined minimum performance criterion, P min.
- This predefined EMS performance criterion may, for example, be determined according to the requirements of a specific downstream application (e.g., providing a telephone call, music playback, awaiting a WW, etc ).
- the performance may relate to a WW performance metric P W w-
- a WW performance metric P W w- there may be some minimum level of WW detector accuracy that is deemed sufficient (e.g., an 80% level of WW detector accuracy, an 85% level of WW detector accuracy, a 90% level of WW detector accuracy, a 95% level of WW detector accuracy, etc.), which would have a corresponding ERLE dB as per the previous section.
- the ERLE of the EMS may be estimated using the EMS performance model (e.g., the MC-EMS performance model 405 of Figure 4).
- the EMS performance model e.g., the MC-EMS performance model 405 of Figure 4
- some implementations may involve using both a performance metric P and a cost metric C.
- Some such examples may involve using a tradeoff parameter l (e.g., a Lagrange multiplier), and formulate the cost/performance evaluation process as an optimization problem which seeks to maximize a quantity, such as the variable F in the following expression:
- a relatively larger value of F corresponds with a relatively larger difference between the performance metric P and product of A and the total cost C totai.
- the tradeoff parameter A may be chosen (e.g., by the system designer) in order to directly trade off cost and performance.
- the solution for the set of echo references used by the EMS may then be found using an optimization algorithm wherein a set of echo references (which may include all available echo reference fidelity levels) determines the search space.
- Figure 6 is a flow diagram that outlines one example of a disclosed method.
- the blocks of method 600 like other methods described herein, are not necessarily performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described. In some examples, two or more blocks may be performed concurrently.
- method 600 is an audio processing method.
- the method 600 may be performed by an apparatus or system, such as the apparatus 50 that is shown in Figure 1 A and described above.
- blocks of method 600 may be performed by one or more devices within an audio environment, e.g., by an audio system controller (such as what is referred to herein as a smart home hub) or by another component of an audio system, such as a smart speaker, a television, a television control module, a laptop computer, a mobile device (such as a cellular telephone), etc.
- the audio environment may include one or more rooms of a home environment.
- the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, etc.
- at least some blocks of the method 600 may be performed by a device that implements a cloud- based service, such as a server.
- block 605 involves obtaining, by a control system, a plurality of echo references.
- the plurality of echo references includes at least one echo reference for each audio device of a plurality of audio devices in an audio environment.
- each echo reference corresponds to audio data being played back by one or more loudspeakers of one audio device of the plurality of audio devices.
- block 610 involves making, by the control system, an importance estimation for each echo reference of the plurality of echo references.
- making the importance estimation involves determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment.
- the at least one echo management system includes an acoustic echo canceller (AEC) and/or an acoustic echo suppressor (AES).
- AEC acoustic echo canceller
- AES acoustic echo suppressor
- block 615 involves selecting, by the control system and based at least in part on the importance estimation, one or more selected echo references.
- block 620 involves providing, by the control system, the one or more selected echo references to the at least one echo management system.
- method 600 may involve causing the at least one echo management system to cancel or suppress echoes based, at least in part, on the one or more selected echo references.
- obtaining the plurality of echo references may involve receiving a content stream that includes audio data and determining one or more echo references of the plurality of echo references based on the audio data.
- control system may include an audio device control system of an audio device of the plurality of audio devices in the audio environment.
- the method may involve rendering, by the audio device control system, the audio data for reproduction on the audio device to produce local speaker feed signals.
- the method may involve determining a local echo reference that corresponds with the local speaker feed signals.
- obtaining the plurality of echo references may involve determining one or more non-local echo references based on the audio data.
- Each of the non-local echo references may, for example, correspond to non-local speaker feed signals for playback on another audio device of the audio environment.
- obtaining the plurality of echo references may involve receiving one or more non-local echo references.
- Each of the non-local echo references may, for example, correspond to non-local speaker feed signals for playback on another audio device of the audio environment.
- receiving the one or more non-local echo references may involve receiving the one or more non-local echo references from one or more other audio devices of the audio environment.
- receiving the one or more non-local echo references may involve receiving each of the one or more non-local echo references from a single other device of the audio environment.
- the method may involve a cost determination. According to some such examples, the cost determination may involve determining a cost for at least one echo reference of the plurality of echo references.
- selecting the one or more selected echo references may be based, at least in part, on the cost determination.
- the cost determination may be based, at least in part, on the network bandwidth required for transmitting the at least one echo reference, an encoding computational requirement for encoding the at least one echo reference, a decoding computational requirement for decoding the at least one echo reference, an echo management system computational requirement for use of the at least one echo reference by the echo management system, or one or more combinations thereof.
- the cost determination may be based, at least in part, on a full-fidelity replica of the at least one echo reference in a time domain or a frequency domain, on a downsampled version of the at least one echo reference, on a lossy compression of the at least one echo reference, on banded power information for the at least one echo reference, or one or more combinations thereof.
- the cost determination may be based, at least in part, on a method of compressing a relatively more important echo reference less than a relatively less important echo reference.
- the method may involve determining a current echo management system performance level.
- selecting the one or more selected echo references may be based, at least in part, on the current echo management system performance level.
- making the importance estimation may involve determining an importance metric for a corresponding echo reference.
- determining the importance metric may involve determining a level of the corresponding echo reference, determining a uniqueness of the corresponding echo reference, determining a temporal persistence of the corresponding echo reference, determining an audibility of the corresponding echo reference, or one or more combinations thereof.
- determining the importance metric may be based, at least in part, on metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmixing matrix, a matrix of loudspeaker activations, or one or more combinations thereof.
- determining the importance metric may be based, at least in part, on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or one or more combinations thereof.
- Some disclosed implementations involve the challenge of requiring the other (“non- local”) devices’ playback references for each local echo management system (EMS).
- EMS local echo management system
- the bandwidth required for transmitting echo references to all the participating audio devices in an audio environment can be significant. Such bandwidth requirements may be prohibitive if the number of audio devices is large and if the transmitted echo references are full-fidelity replicas of the speaker feed signals provided to the loudspeakers.
- the computational resources required to implement such methods and systems, including but not limited to computational resources for implementing the non-local devices’ postprocessing, may also be significant.
- transmitting all the playback streams to all the participating audio devices in an audio environment may not be necessary or even desirable for some implementations. This is true in part because the amount of echo in audio devices heavily depends on the content, the listening objective(s) and the audio device configurations.
- the importance metric may be based, at least in part, on metadata (e.g., one or more components of the metadata 312 described above), such as metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data (such as a spatiality index), an upmixing matrix, a matrix of loudspeaker activations (which may also be referred to herein as a “rendering matrix”), or one or more combinations thereof.
- metadata e.g., one or more components of the metadata 312 described above
- the importance metric may be based, at least in part, on metadata (e.g., one or more components of the metadata 312 described above), such as metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data (such as a spatiality index), an upmixing matrix, a matrix of loudspeaker activations (which may also be referred to herein as a “rendering matrix”), or one or more combinations thereof.
- U the “uniqueness” aspect of the “LUPA” echo reference characteristics from which the importance metric may be determined
- U may be based at least in part on data used for matrix decoding, such as a static upmixing matrix.
- Figures 7 et seq. and the corresponding descriptions elaborate on such alternative approaches to computing importance metrics based on metadata, including but not limited to rendering information.
- Some disclosed examples disclose echo references that are generated using such metadata. Such implementations may significantly reduce the computational and bandwidth requirements of EMS management, at least in part because many related metrics can be precomputed and encoded in an efficient manner.
- Figures 7 and 8 show block diagrams that include components of echo reference orchestrators according to some alternative examples. As with other figures provided herein, the types, numbers and arrangement of elements shown in Figures 7 and 8 are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and/or arrangements of elements.
- cost estimation modules 403 themselves are optional in the implementations of Figures 7 and 8.
- Audio data which may include audio signals corresponding to audio objects (such as pulse code modulate (PCM) data, audio bed signals corresponding to loudspeaker locations, etc.;
- PCM pulse code modulate
- Audio object metadata which may include audio object spatial metadata, audio object size metadata, etc.
- the audio object metadata 702 may be received as a component of the metadata 312 that is disclosed elsewhere herein;
- Audio scene change metadata (such as spatial Tenderer scene change metadata) indicating upcoming changes in an audio scene, such as changes that will take place within a determined time interval (e.g., within the next second, within the next 100 milliseconds, etc.), which may be used by the audio scene change analyzer 755 to estimate sound field changes.
- the audio scene change metadata 703 may include aggregated audio object statistics computed from the audio object metadata 702.
- the scene change metadata may include one or more indications (e.g., in a designated part of an audio data structure, such as a header portion) that are selectable (e.g., by a content creator) for indicating changes in an audio scene.
- the audio scene change metadata 703 may be received as a component of the metadata 312 that is disclosed elsewhere herein;
- a metadata-based metric computation module which in this example is configured to compute echo reference characteristics 733 from which the importance metrics 420 may be determined, based on the metadata 312 and the audio scene change messages 715.
- the metadata 312 may include the audio object metadata 702 and/or rendering information, such as information regarding the rendering matrix 722 (or the rendering matrix 722 itself).
- the echo reference characteristics 733 may, in some examples, include approximations of L, U, P and/or A as described above;
- An echo reference generator which in this example is configured to generate one or more local audio device echo references 220 and non-local audio device echo references 721 for the MC-EMS 203, based on the rendered audio streams 720, the metadata 312, the selected echo references 313, the EMS statistics 350, and/or the information 423 produced by the MC-EMS performance model 405.
- the echo reference generator 710 may be configured to produce virtual echo references 742.
- the Tenderer 201 may be configured to produce virtual echo references 742.
- the echo reference generator 710 is configured to generate subspace-based non-local device echo references 723 and/or low-frequency device echo references 723LF.
- low- frequency non-local device echo references 723LF may be regarded as a subset of the subspace-based non-local device echo references 723.
- the echo reference generator 710 may be configured to customize the echo references for each audio device.
- the echo reference generator 710 may also use EMS look-ahead statistics 732 and/or the audio scene change messages 715 as input.
- Figures 7 and 8 indicate that the MC-EMS 203 receives all of the outputs of the echo reference generator 710, in some implementations the MC-EMS 203 only receive the selected echo references 313, as shown in Figure 3 A. According to some such implementations, all echo references generated by the echo reference generator 710 may be provided to blocks 401 and 402 (and in some instances, to block 705) and only the selected echo references 313 may be provided to the MC-EMS 203;
- • 715 One or more scene change messages from the scene change analyzer 755;
- Rendering information which in these examples includes one or more rendering matrices
- the illustrated blocks are implemented by one or more instances of the control system 60 that is disclosed herein (see, e.g., Figure 1 A and the associated description).
- all of the blocks of the control system 60 that are shown in Figure 7 may be implemented via a single device (such as an audio device).
- each of a plurality of audio devices in an audio environment may implement all of the blocks of the control system 60 that are shown in Figure 7, as well as other features (such as a loudspeaker system, a microphone system, a noise estimator and/or a speech processing block).
- the non-local device references may be locally generated and added to the rendered audio stream 720.
- the non-local device references may take the form of one or more virtual audio device references, or a set of device-specific, non-local device streams for the devices selected by the reference selection block.
- each audio device may render its local reference only, and use a local network to exchange echo references (e.g., as described above with reference to Figures 1-6).
- the optional elements 311 and 314 of Figure 7 may be included in the signals transmitted and received by each audio device.
- such elements would be necessary only if the audio device is severely limited on computational power, or if the audio device does not have non-local device signal chain parameters available for the local generation of non-local echo references.
- the blocks that are shown in Figure 7 may be implemented via two or more devices.
- the echo reference orchestrator 302 of Figure 7 may be implemented (at least in part) by an orchestrating device, such as the smart home hub 105 disclosed herein or via one an audio device 110 that is configured to function as an orchestrating device.
- the echo reference orchestrator 302 of Figure 7 may be implemented (at least in part) via a cloud-based service, e.g., via one or more servers.
- the Tenderer 201 and a portion of the echo reference orchestrator 302 are implemented by a hub device 805, whereas the MC-EMS 203 and other portions of the echo reference orchestrator 302 are implemented by audio devices 110 (only one of which is illustrated in Figure 8).
- This type of implementation may be referred to herein as a “hub and spoke” model.
- the “hub” may be a smart television (TV) and the audio devices 110 may be a set of wireless loudspeakers that are configured for communication with the smart TV.
- the hub device 805 may be a smart home hub 105 as disclosed elsewhere herein.
- the hub device 805 may be one of the audio devices 110, such as an audio device 110 that has greater computational abilities than the other audio devices 110.
- the portion of the echo reference orchestrator 302 that is implemented by the hub device 805 may be implemented by one or more servers.
- both the Tenderer 201 and the echo reference generator 710 reside in the hub device 805.
- each audio device 110 receives rendered audio data for playback from the hub device 805.
- rendered audio data for playback includes the local echo reference 220.
- the non-local audio device references may be rendered at the hub device 805 as one or more single virtual non-local device references, as device-specific echo references (e.g., as described above with reference to Figures 1-6), or combinations thereof.
- the hub device 805 is provided with the required information for producing echo references, such as rendering information (which may include rendering matrix information), audio device-specific information (such as audio device capability information), spatial metadata, etc.
- the local echo reference 220 may, in some alternative examples, be created in each audio device 110.
- a main component of the rendering metadata set is the rendering matrix (722) for the given audio device configuration.
- the rendering matrix defines the audio device configuration’s spatial-frequency response to any audio object in the encoded audio-stream.
- the audio environment e.g., a room within which the audio devices reside
- the audio environment is first discretized to [n x , n y , n z ] points, and a rendering filter is designed for each spatial point, for each device.
- the rendering filter may be defined in a frequency bin domain, with all filters having an n bin number of taps.
- the rendering matrix is an N x n x x n y x n z set of n bin length filters.
- this information (702) is available for each audio object in an audio object metadata file provided to the Tenderer (for example, as an Atmos .prm file).
- An ideal rendering system would achieve this with high accuracy.
- the audio object vector may be approximated by a weighted average of values corresponding to the closest grid points of a rendering matrix, and the subset of rendering filters activated for these points may be used to render the sound source at that location.
- the rendering matrix may act as a spatial transfer function, defined on each device and each point on a spatial grid.
- the rendering matrix includes information regarding how audible each audio device is to each other audio device (which may be referred to herein as “mutual audibility”).
- the rendering matrix 722 contains this information, it is desirable to compute an audibility metric that can be readily consumed by the echo reference importance estimator 401.
- the meta data based metric computation block 705 is configured to perform this computation and the echo reference characteristics 733 include these metrics.
- Figure 9A shows an example of a graph that shows locations of a listener and audio devices in an audio environment.
- the audio environment is a room.
- the vertical axis of graph 900 indicates the y coordinates (width) of the room in meters and the horizontal axis indicates the x coordinates (length) of the room in meters.
- the listener L is positioned at the center of the audio environment, at the origin of graph 900 (location (0,0)) and audio devices 1, 2, 3, 4 and 5, as well as a subwoofer S, are positioned at various points along a circle that is one meter from listener L.
- the audio devices 1-5 are all of the same type and have identical, or substantially identical, audio device characteristics (e.g., loudspeaker numbers, types and capabilities).
- audio device characteristics e.g., loudspeaker numbers, types and capabilities.
- Other audio environments may include different numbers, types and/or arrangements of audio devices, listener(s), etc.
- Figure 9B shows examples of graphs corresponding to a rendering matrix for each of the audio devices shown in Figure 9A.
- graph 905a corresponds to audio device 1
- graph 905b corresponds to audio device 2
- graph 905c corresponds to audio device 3
- graph 905d corresponds to audio device 4
- graph 905e corresponds to audio device 5.
- graphs 905a-905e show each audio device’s rendering matrix cross section after averaging across the z and frequency dimensions.
- the x,y plane of the audio environment has been divided into 64 equal areas, each having a side length of 0.5 meters, and only one loudspeaker activation value is represented for each of the 64 areas. Such areas may be referred to herein as “spatial tiles.”
- a loudspeaker activation value is analogous to a total broadband gain for playback of each corresponding audio device.
- the rendering matrix for each device contains all information needed to estimate the spatial realization of an audio object (such as an Atmos audio object) for the audio device configuration shown in Figure 9A. In other words, one can use the rendering matrix information to estimate what percentage of an audio object will be rendered in each device, and how similar the device channels will be.
- an audio object such as an Atmos audio object
- One simple implementation involves computing a device-wise rendering matrix covariance matrix and using the covariance matrix as a proxy for covariance of the resultant speaker feeds. We refer to this herein as an “uninformed rendering covariance matrix” or an “uninformed rendering correlation matrix.”
- the rendering matrix itself contains spatial information from which the inter-device audibility can be estimated. Even in its simplest form, one can use the uninformed rendering correlation matrix to obtain audibility rankings of each device as heard from every other device. Moreover, a complete uninformed rendering correlation matrix will also contain information about how this audibility varies in frequency.
- some implementations involve transforming audio object spatial metadata (which may be a component of the audio object metadata 702) into a metric that may be readily consumed by the echo reference importance estimator 401.
- the metadata-based metric computation module 705 may be configured to make such transformations.
- the importance metric / may be based on one or more of the following:
- the acronym “LUPA” refers generally to echo reference characteristics from which the importance metric may be determined, including but not limited to one or more of L, U, P and/or A.
- the rendering matrix includes audibility information, which is the “A” component of LUPA.
- Other LUPA parameters may be estimated based on the rendering matrix and spatial data. Some implementations estimate LUPA parameters by determining a statistic based on aggregate spatial data that is highly correlated with one or more LUPA parameters.
- the audio object spatial metadata indicates the spatio-temporal distribution of each audio source in the received audio data bit stream. Some implementations involve computing the amount of time an audio object is present in each spatial grid tile. Some such implementations involve producing 3-D heatmaps of “counts” for each audio object channel.
- Figures 10A and 10B show examples of graphs indicating spatial audio object counts for a single song.
- the song was in the Atmos format and the audio objects are Atmos audio objects.
- graph 1005a corresponds to audio object 1
- graph 1005b corresponds to audio object 2
- graph 1005c corresponds to audio object 3
- graph 1005d corresponds to audio object 4
- graph 1005e corresponds to audio object 5
- graph 1005f corresponds to audio object 6
- graph 1005g corresponds to audio object 7
- graph 1005h corresponds to audio object 8
- graph 1005i corresponds to audio object 9
- graph 1005j corresponds to audio object 10
- graph 1005k corresponds to audio object 11
- graph 10051 corresponds to audio object 12
- graph 1005m corresponds to audio object 14
- graph 1005o corresponds to audio object 15.
- the coordinates x, y and z denote the length, width and height of Atmos bins for an acoustic space, which is an example of a cubic audio environment.
- a sphere in a particular location indicates a “count,” an instance of time during which a corresponding audio object was in the corresponding (x,y,z) location during the song.
- audio object counts may be used as a basis for estimating P, the temporal persistence of an echo reference.
- Some implementations use audio object counts as the basis for a spatial importance weighting.
- the spatial importance weighting may, in some examples, be used along with various other types of importance metrics, such as audibility metrics. For example, if a spatial importance weighting is used in conjunction with an “uninformed rendering correlation matrix” such as those described with reference to Figure 9B, some implementations involve producing a “spatially informed correlation matrix,” in which spatial locations having more audio object presence are given more prominence.
- U may be based, at least in part, on a metric of correlation between each echo reference.
- the spatially informed correlation matrix may be used as proxy for an audio data-based correlation metric (for example a correlation matrix based on PCM data for each echo reference reference) to produce an importance metric for input to the echo reference importance estimator 401.
- Figures 11 A and 1 IB show examples of an uninformed rendering correlation matrix and a spatially informed correlation matrix, respectively.
- Both the uninformed rendering correlation matrix of Figure 11A and the spatially informed correlation matrix of Figure 1 IB correspond with the arrangement of audio devices shown in Figure 9A and the same audio content that was used to produce the spatial audio object counts shown in Figure 10.
- the highest possible rank is 1.0 and the lowest possible rank is zero.
- the effects of the subwoofer have been omitted.
- the values corresponding to each audio device’s correlation with itself have been omitted.
- the rankings of the spatially informed correlation matrix differ from those of the uninformed rendering correlation matrix.
- the highest-ranked non-local echo references according to the uninformed rendering correlation matrix differ from those of the spatially informed rendering correlation matrix.
- the audio played back by audio device 2 is the highest-ranked non-local echo reference for audio device 1 according to the uninformed rendering correlation matrix
- the audio played back by audio device 5 is the highest-ranked non-local echo reference for audio device 1 according to the spatially informed rendering correlation matrix of Figure 1 IB.
- One way to compare the utility of the approximation of a PCM-based correlation matrix via a spatially informed correlation matrix would be to evaluate the resulting non-local reference management schemes implemented by a local device based on each of these metrics.
- a simple indicator of how close the approximation is would be a comparison of the echo reference ranks produced by the echo reference importance estimator 401 based on each type of metric.
- Figures 12 A, 12B and 12C show examples of echo reference importance rankings produced by the echo reference importance estimator 401 based on a PCM-based correlation matrix, a spatially informed correlation matrix and an uninformed correlation matrix, respectively, using the same audio content that was used to produce the spatial audio object counts shown in Figure 10.
- the highest possible rank is 1.0 and the lowest possible rank is zero.
- the echo reference importance rankings shown in Figure 12A which correspond to those produced by the echo reference importance estimator 401 based on the PCM-based correlation matrix in some implementations, are used as a “ground truth” by which the other rankings may be evaluated.
- a comparison of Figures 12A and 12C reveals that the importance rankings based on the uninformed correlation matrix provide a very rough approximation of the importance rankings based on the PCM-based correlation matrix: for example, the highest-ranked non-local echo references based on the uninformed correlation matrix do not match any of the highest-ranked non-local echo references of the PCM-based correlation matrix.
- Figures 12A, 12B and 12C show that the importance rankings based on the spatially informed correlation matrix provide a better approximation of the importance rankings based on the PCM-based correlation matrix than those based on the uninformed correlation matrix.
- the highest-ranked non local echo references for audio devices 1 and 5 match the highest-ranked non-local echo references for audio devices 1 and 5 according to the PCM-based correlation matrix.
- the highest-ranked non-local echo references for audio devices 3 and 4 are the second-highest ranked non-local echo references according to the PCM-based correlation matrix.
- the LUPA estimates are based on an assumption that the spatial scene is stationary within an estimation time window.
- the LUPA estimates will eventually reflect any notable change in the spatially rendered scene after some (variable) time delay. This means that during significant audio scene changes, the echo references selected using these estimates, as well as the virtual echo references generated, may be incorrect.
- the echo return loss enhancement (ERLE) may decrease beyond operating limits, which could lead to echo management system instabilities. Such conditions can also trigger fast reference switching that might not be actually needed, but which is an artifact of the changing scene dynamics. To guard against these potential negative outcomes, we disclose herein two additions to the upstream data processing:
- Audio scene change metadata 703 which may include information regarding: a) Significant spatial change events of each audio object and bed sources, and/or b) Instances of audio object overlap/interaction;
- a scene change analyzer 755 which may be configured to analyze the audio scene change metadata 703 and/or corresponding audio data (such as PCM data), compare the spatial energy distribution of the current audio scene to the spatial energy distribution of an upcoming audio scene and generate audio content look-ahead based audio scene change messages 715.
- the look-ahead time window may vary according to the particular implementation.
- the audio scene change messages 715 may enable the echo reference importance estimator 401 and the metadata-based metric computation module 705 to dump their histories and reset their memory buffers, thereby enabling a fast response to an audio scene change.
- a “fast” response may be on the order of hundreds of milliseconds (such as 300 to 500 milliseconds). Such fast responses may, for example, avoid the risk of AEC divergence.
- audio object metadata files contain spatial coordinates for each time interval.
- This spatial metadata may be used as input to, for example, the scene change analyzer 755 of Figures 7 and 8.
- the scene change analyzer 755 may, in some examples, calculate audio object density in each tile (a unit of area or volume) of a spatial grid that is used to represent the audio environment. In its most fundamental form this could be the count of all audio objects in a given tile, such as shown in the examples of Figure 10.
- the audio scene change messages 715 are device-specific, because an audio device only needs information regarding audio scene changes within that audio device’s audible spatial grid subset (the grid subset that significantly affects the operation of MC-EMS 203)
- an example importance metric (I(t)) at time t could be expressed as follows:
- i a spatial grid index
- n a look-ahead window
- Ci(t + k) represents the audio object count at look-ahead time k
- a ik and ? ik represent predefined coefficients per spatial grid point depending on the audio device configuration. In most cases a ik and ? ik are less than 1.
- Such an importance metric can be designed to approximate a weighted object density, or the cumulative object persistence within the spatial and temporal region of interest.
- Metadata-Based EMS Health Prediction An integral part of the metadata-based scene analysis for the purpose of echo management is the EMS health data 423 determined by the MC-EMS performance model 405.
- the EMS health data 423 is highly sensitive to significant audio scene changes and may, for example, indicate EMS divergence caused by such audio scene changes.
- some implementations of the echo reference orchestrator 302 may be configured use such audio scene change information to predict the EMS health data 423 (e.g., via the MC-EMS performance model 405).
- the MC-EMS performance model 405 predicts, for example, a possible EMS filter divergence based on one or more EMS look-ahead statistics 732 and/or audio scene change messages 715, according to some disclosed examples the MC-EMS performance model 405 may be configured to provide corresponding EMS health data 423 to the echo reference importance estimator 401 and the echo reference selector 402, which can reset their algorithms accordingly.
- the MC-EMS performance model 405 may be configured to implement an embodiment of EMS heath prediction based on a regression model based on scene change importance look-ahead data, e.g., as follows:
- A represents EMS health data,/ represents a regression function (which may be linear or non-linear) and the set ⁇ I ik ⁇ represents the set of importance values for the total look ahead window and spatial grid set.
- Figure 13 illustrates a simplified example of determining a virtual echo reference.
- the echo reference generator 710 may be configured to produce one or more virtual echo references according to the methods disclosed in this section.
- the Tenderer 201 may be configured to produce one or more virtual echo references according to such methods.
- audio device A is a local audio device and audio devices B and C are non-local audio devices.
- the virtual echo reference corresponds with a virtual sound source D at position O.
- the position O may, for example, be obtained using room mapping data (such as audio device location data) available to the Tenderer 201 or the echo reference generator 710 via an initial and/or periodic calibration step. For example, if all speakers have no occlusions, one may determine the position O according to the centroid position of the cumulative far device heatmap, which may be generated by adding the rendering matrix slices (e.g., as shown in Figure 9B) for each non local or “far” audio device. For example, if we let the far device i th broadband gain at spatial tile j be w i; ⁇ , then this position vector o can be found as
- the virtual sound source D corresponds to the playback of audio devices B and C from the perspective of audio device A.
- Virtual sound source D is the equivalent sound source at position O that creates the same non-local audio device playback sound field that the separate played-back audio from audio devices B and C would create at the location of audio device A.
- the virtual source D need not approximate the full sound field that the separate played-back audio from audio devices B and C would create in all parts of the audio environment.
- D for far device echo references can be realized using different approaches, a few of which are described herein.
- the Tenderer produces references differently due to the differing capabilities of loudspeakers regarding playback of content at these frequencies.
- the particular low frequency range may depend on details of the particular implementation, such as loudspeaker capabilities. In some implementations in which the capability of one or more loudspeakers in the audio environment for reproducing sound in the bass range is minimal, the range of low frequencies may be 400 Hz or less, whereas for other implementations the range of low frequencies may be 350 Hz or less, 300 Hz or less, 250 Hz or less, 200 Hz or less, etc.
- the reference signals used for cancellation in the low frequencies can be determined using the Tenderer configuration.
- a weighted summation of the echo references over a proportion of low frequencies can be used.
- the amount of crossover with cancellation of higher frequencies may also be considered.
- Equation A the superscript n represents the total number of echo references.
- the weighting, the range of low frequencies to use this summation over and the proportion of crossover with higher frequency cancellation may be extracted from rendering information in some examples. Examples of weighting and low frequency ranges are described below.
- the weighting and low frequency ranges may be based, at least in part, on individual loudspeaker capability and limitations, and how content may be rendered for each device.
- One motivation for implementing low-frequency management methods is to avoid the non-uniqueness problem and high cross-correlation between echo references at low frequencies.
- Figure 14 shows an example of a low-frequency management module.
- the low-frequency management module 1410 is a component of the echo reference generator 710.
- the low-frequency management module 1410 may be configured to determine the weighting, the range of low frequencies for summation and the proportion of crossover with higher frequency cancellation (if any) referenced in Equation A.
- the elements of Figure 14 are as follows:
- Metadata which may be, or may include, metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmixing matrix and/or a matrix of loudspeaker activations;
- 220 Local echo reference for playback and cancellation
- 721 Locally produced copy of an echo reference a non-local device is playing
- a frequency selector module configured to choose low frequencies and a crossover to apply (if any).
- the frequency selector module 1402 may, for example, choose a threshold for k in Equation A;
- a weight generation module configured to generate weights for each echo reference based on the loudspeaker metadata 312;
- a summation module configured to compute the weighted sum of echo references
- the summation module 1404 may produce one or more weighted sums of low-frequency device echo references 723LF.
- the low-frequency management module 1410 may be configured to select frequencies and/or generate weights based, at least in part, on rendering information, such as information regarding the rendering matrix 722 (or the rendering matrix 722 itself).
- the frequencies to perform low frequency management over could be based on a hard cut-off frequency or on a range of frequencies in a crossover frequency range.
- a crossover frequency range may be desirable to account differing loudspeaker capabilities in an overlapping frequency region where certain audio device echo references have lower frequency content than the summed reference.
- a crossover frequency range may be desirable when a subwoofer is present and may be considered the dominant or only reference at the majority of lower frequencies.
- the cut-off frequency or range of frequencies in a crossover frequency range may be included in rendering information, which may take into account the loudspeaker capabilities of audio devices in the audio environment.
- the cut-off frequency may have a value of a few hundred Hz, such as 200 Hz, 250 Hz, 300 Hz, 350 Hz, 400 Hz, etc.
- the crossover frequency range may have a low end of 100 Hz, 150 Hz, 200 Hz, etc., and may include frequencies up to 200 Hz, 250 Hz, 300 Hz, 350 Hz, 400 Hz, etc.
- weights may be applied according to audio device configuration and capabilities, such as only using a local or subwoofer reference for low- frequency playback if a subwoofer is present. If a subwoofer is present, it will generally be desirable for most low-frequency audio content to be played back by the subwoofer rather than for the low-frequency audio content to be provided to audio devices that may be unable to play back audible lower frequencies without distortion. According to some implementations that do not include a subwoofer and in which the audio devices have the same (or similar) capabilities for low-frequency audio reproduction, in order to obtain any audible lower-frequency performance the low-frequencies reproduced by all the audio devices may be the same in order to maximize power.
- the reproduced low-frequency audio may be monophoni c/non-directi onal.
- the weighting may be 1.0 for a single reference. This is equivalent to monophonic- only echo cancellation below a certain frequency, which may be referred to herein as “max mono hz.”
- Figures 15A and 15B show examples of low-frequency management for implementations with and without a subwoofer.
- multi-channel echo cancellation occurs between frequencies min multi hz and max cancell hz.
- Figure 15A shows an example of low-frequency management for implementations with a subwoofer.
- max mono hz is in a frequency range in which multi-channel echo cancellation occurs (above min multi hz).
- This example is appropriate for a subwoofer reference (“Sub Ref’ in Figure 15 A) where it is desirable to perform echo cancellation corresponding to the subwoofer reference up to a max mono hz value of a few hundred Hz, such as 200 Hz, 300 Hz, 400 Hz, etc.
- min multi hz may be 100 Hz, 150 Hz, 200 Hz, etc. This allows there to be some crossover between the mono-only and multi-channel echo cancellation frequency range.
- Figure 15B shows an example of low-frequency management for implementations without a subwoofer.
- the local reference is used from OHz to max mono hz for mono-cancellation.
- max mono hz and min multi hz are set to be the same frequency.
- max mono hz and min multi hz may not be the same frequency for implementations without a subwoofer.
- max mono hz and min multi hz may be the same frequency.
- a “higher frequency” may refer to any audible frequency above one of the low-frequency ranges described with reference to Figure 14.
- the difference in propagation characteristics in higher frequencies as compared to lower frequencies, and the differences in audio driver (speaker) beamforming for different frequencies indicate that reference, importance, and selection would be highly frequency sensitive at higher frequencies.
- low-frequency audio content is normally much less directional than higher-frequency audio content.
- typical rendered audio scenes contain more information in high frequencies than in lower frequencies.
- rendering of echo references that have substantial high-frequency components may be relatively more complicated as compared to rendering of virtual references that have mainly lower-frequency components.
- some high- frequency management implementations may involve multiple instances of Equation A, each for a different portion of a high-frequency range and each with potentially different weighting factors. Non-uniqueness and the associated AEC divergence are lower risks in higher- frequency bands.
- some examples exploit frequency sparsity of some audio content to manage multi-band reference generation. Creating a mix that is based at least in part on frequency- dependent audibility differences can eliminate the need for having multiple echo references without degrading the quality of the AEC health.
- the rendering implementation may be similar to the frequency management implementations for echo references.
- only the weight generator and frequency selector parameters may be different.
- Figure 15C illustrates elements that may be used to implement a higher-frequency management method according to one example.
- Figure 15C illustrates blocks configured for performing a multi-band higher-frequency management method, which is implemented via frequency management modules 1410A-1410K in this example.
- each of the frequency management modules 1410A-1410K is configured to implement frequency management for one frequency band of frequency bands A through K.
- frequency band A is the frequency band adjacent to the low-frequency band that is processed by the low-frequency management module 1410 of Figure 14.
- frequency band A may overlap with the low-frequency band that is processed by the low- frequency management module 1410.
- frequency bands A through K may also overlap.
- K represents an integer of three or more.
- Frequency bands A through K may be selected according to any convenient method, such as according to a linear frequency scale, according to a logarithmic frequency scale, according to a mel scale, etc.
- each of the frequency management modules 1410A- 1410K is configured to function generally as described above with reference to the low- frequency management module 1410 of Figure 14, except that each of the frequency selectors 1402A-1402K is configured to select a different range of frequencies.
- the weights that are generated by the weight generation modules 1403A-1403K also may vary according to frequency.
- the frequency management modules 1410A-1410K are configured to output frequency banded non-local device echo references 723BA-723BK, each of which corresponds to one of the frequency bands A through K, to the bands-to-PCM converter 1515.
- the bands-to-PCM converter 1515 is configured to combine the frequency banded non-local device echo references 723BA-723BK and to output non local higher-frequency device echo references 723HF.
- Some disclosed subspace-based examples involve defining lower-dimensional embedding via statistical properties. Some subspace-based examples involve using methods such as independent component analysis or principal component analysis. By implementing such methods, a control system may be configured to find the K statistically independent audio streams that approximate the non-local references.
- Figure 16 is a block diagram that outlines an example of another disclosed method.
- the blocks of method 1600 like other methods described herein, are not necessarily performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described. In some examples, two or more blocks may be performed concurrently.
- the method 1600 may be performed by an apparatus or system, such as the apparatus 50 that is shown in Figure 1 A and described above.
- the method 1600 may, for example, be performed by the control system 60 of Figure 1 A.
- the echo reference generator710 works in tandem with the echo reference importance estimator 401 and the echo reference selector 402: in this example, blocks 1620, 1655 and 1660 are implemented by the echo reference importance estimator 401 and/or the echo reference selector 402, and blocks 1625-1650 are implemented by the echo reference generator 710.
- method 1600 starts with block 1601, after which an initial local audio device and an initial non-local (“far”) audio device are selected in block 1605.
- block 1610 it is determined whether all local audio devices have been processed. If so, the process stops (block 1615). However, if it is determined in block 1610 that the current local audio device has not been processed, the process continues to block 1620.
- block 1620 involves determining whether each far device has been evaluated for the current local audio device. If not, the process continues to block 1655, in which it is determined whether the echo reference characteristics (e.g., LUPA values) for the current far device’s audio stream exceed a threshold value.
- the threshold may be a long-term function of the audio device configuration (such as the audio device layout and audio device capabilities), characteristics of the audio environment and playback content. According to some such examples, this threshold may be approximated as the long-term mean of the echo reference characteristics for the current audio device configuration, audio environment and content type. In this context, “long-term” may be hours or days. In some examples, playback may not be continuous during the “long term” time interval.
- this example involves selecting a subset of far devices based on the echo reference characteristics 733 that are output by the metadata-based metric computation module 705 (for example, LUPA scores).
- the current playback frames of the selected far devices form the pcm matrix P for the local device currently being evaluated. Accordingly, if it is determined in block 1655 that the echo reference characteristics for the current far device’s audio stream exceed a threshold value, in block 1660 the far device’s audio frame is added to a columns of the pcm matrix P. In this example, the next far device (if any) is selected in block 1662 and then the process continues to block 1620.
- the process continues to blocks that are implemented by the echo reference generator 710.
- the process continues to block 1625, which involves obtaining the pcm matrix P (e.g., from a memory).
- a dimension reduction is done to reduce any feature redundancy.
- the dimension reduction may, for example, be achieved by a method such as Principal Component Analysis (PCA).
- PCA Principal Component Analysis
- Other examples may implement other methods of dimension reduction.
- the PCM matrix columns are made zero mean in block 1630 as follows:
- the Covariance matrix C is computed in block 1635 as
- block 1640 involves performing eigen decomposition to determine the eigen value matrix D and the eigen vector matrix V such that
- eigen values that are greater than a threshold T are retained and the redundant features are discarded.
- An example realization of such a threshold could be constructed using energy based approximation.
- D is a diagonal matrix with values decreasing along the left diagonal, we can define DT by retaining the most significant number of eigenvalues that contain a percentage (in this example, 90%) of the signal energy.
- this example block 1645 involves determining a truncated eigen value matrix D T and a truncated eigen vector matrix V T.
- the truncated eigen value matrix D T is one example of the Weight matrix in equation 0 and the corresponding eigen vectors in the truncated matrix T are, collectively, an example of the input matrix Ain the equation 0. Therefore, in this example block 1650 involves determining an echo reference by multiplying Dr by VT-
- block 1652 involves incrementing the number of the local audio device to be processed. For example, if local audio device 1 has just been processed, the audio device number is incremented to audio device 2. The process reverts to block 1610, wherein it is determined whether all local audio devices have been processed. After all local audio devices have been processed, the process ends (block 1615).
- Figure 17 is a flow diagram that outlines another example of a disclosed method.
- the blocks of method 1700 like other methods described herein, are not necessarily performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described. In some examples, two or more blocks may be performed concurrently.
- method 1700 is an audio processing method.
- the method 1700 may be performed by an apparatus or system, such as the apparatus 50 that is shown in Figure 1 A and described above.
- the method 1700 may, for example, be performed by the control system 60 of Figure 1A.
- blocks of method 1700 may be performed by one or more devices within an audio environment, e.g., by an audio system controller (such as what is referred to herein as a smart home hub) or by another component of an audio system, such as a smart speaker, a television, a television control module, a laptop computer, a mobile device (such as a cellular telephone), etc.
- at least some blocks of the method 1700 may be performed by a device that implements a cloud-based service, such as a server.
- the audio environment may include one or more rooms of a home environment.
- the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, etc.
- block 1705 involves receiving, by a control system, location information for each of a plurality of audio devices in an audio environment.
- the location information may be included in the metadata 312 that is disclosed herein, which may include information corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, etc.
- block 1705 may involve receipt of the location information by a Tenderer, such as the Tenderer 201 described herein (see, for example, Figures 7 and 8).
- block 1710 involves generating, by the control system and based at least in part on the location information, rendering information for a plurality of audio devices in an audio environment.
- the rendering information may be, or may include, a matrix of loudspeaker activations.
- method 1700 may involve rendering the audio data, based at least in part on the rendering information, to produce rendered audio data.
- the control system may be an orchestrating device control system.
- method 1700 may involve providing at least a portion of the rendered audio data to each audio device of the plurality of audio devices in the audio environment.
- block 1715 involves determining, by the control system and based at least on part on the rendering information, a plurality of echo reference metrics.
- each echo reference metric of the plurality of echo reference metrics corresponds to audio data reproduced by one or more audio devices of the plurality of audio devices.
- the control system may be an orchestrating device control system.
- method 1700 may involve providing at least one echo reference metric to each audio device of the plurality of audio devices.
- method 1700 may involve receiving, by the control system, a content stream that includes audio data and corresponding metadata.
- determining the at least one echo reference metric may be based, at least in part, on loudspeaker metadata, metadata corresponding to received audio data and/or an upmixing matrix.
- block 1715 may be performed, at least in part, by the metadata-based metric computation module 705 of Figures 7 and 8.
- at least one echo reference metric may correspond to the echo reference characteristics 733 that are output by the metadata-based metric computation module 705.
- at least one echo reference metric may correspond to a level of a corresponding echo reference, a uniqueness of a corresponding echo reference, a temporal persistence of a corresponding echo reference or an audibility of a corresponding echo reference.
- method 1700 may involve making, by the control system and based at least in part on the echo reference metrics, an importance estimation for each echo reference of a plurality of echo references.
- the control system may be an audio device control system.
- the echo reference importance estimator 401 may make the importance estimation.
- making the importance estimation may involve determining an expected contribution of each echo reference to mitigation of echo by an echo management system of an audio device of the audio environment.
- the echo management system may include an acoustic echo canceller (AEC), an acoustic echo suppressor (AES), or both an AEC and an AES.
- the echo management system may be, or may include, an instance of the MC-EMS 203 disclosed herein.
- making the importance estimation may involve determining an importance metric for a corresponding echo reference.
- determining the importance metric may be based at least in part on one or more of a current listening objective or a current ambient noise estimate.
- Some such examples may involve selecting, by the control system and based at least in part on the importance estimation, one or more selected echo references.
- the echo references may be selected by an instance of the echo reference selector 402 disclosed herein.
- Some examples may involve providing, by the control system, the one or more selected echo references to the at least one echo management system.
- Some examples may involve making, by the control system, a cost determination.
- the cost estimation module 403 may be configured to make the cost determination.
- the cost determination may, for example, involve determining a cost for at least one echo reference of the plurality of echo references.
- selecting the one or more selected echo references may be based, at least in part, on the cost determination.
- the cost determination may be based on the network bandwidth required for transmitting the at least one echo reference, an encoding computational requirement for encoding the at least one echo reference, a decoding computational requirement for decoding the at least one echo reference and/or an echo management system computational requirement for use of the at least one echo reference by the at least one echo management system.
- Some implementations may involve determining, by the control system, a current echo management system performance level.
- the MC-EMS performance model 405 may be configured to determine the current echo management system performance level.
- the importance estimation may be based, at least in part, on the current echo management system performance level.
- Some examples may involve receiving, by the control system, scene change metadata.
- the importance estimation may be based, at least in part, on the scene change metadata.
- the scene change analyzer 755 may receive the scene change metadata and may generate one or more scene change messages 715. In some such examples, the importance estimation may be based, at least in part, on one or more scene change messages 715.
- method 1700 may involve generating, by the control system, at least one echo reference.
- at least one echo reference may be generated by the echo reference generator 710.
- the echo reference generator 710 may generate at least one echo reference based, at least in part, on one or more components of the metadata 312, such a matrix of loudspeaker activations (e.g., the rendering matrix 722).
- method 1700 may involve generating, by the control system, at least one virtual echo reference.
- a virtual echo reference may, for example, correspond to two or more audio devices of the plurality of audio devices.
- method 1700 may involve generating (e.g., by the echo reference generator 710) one or more subspace-based non-local device echo references.
- the subspace-based non-local device echo references may include low-frequency non-local device echo references.
- Some such examples may involve determining, by the control system, a weighted summation of echo references over a range of low frequencies.
- Some such examples may involve providing the weighted summation to an echo management system.
- Some implementations may involve causing the echo management system to cancel or suppress echoes based, at least in part, on the one or more selected echo references.
- Figure 18 shows an example of a floor plan of an audio environment, which is a living space in this example.
- the types and numbers of elements shown in Figure 18 are merely provided by way of example. Other implementations may include more, fewer and/or different types and numbers of elements.
- the environment 1800 includes a living room 1810 at the upper left, a kitchen 1815 at the lower center, and a bedroom 1822 at the lower right. Boxes and circles distributed across the living space represent a set of loudspeakers 1805a-1805h, at least some of which may be smart speakers in some implementations, placed in locations convenient to the space, but not adhering to any standard prescribed layout (arbitrarily placed).
- the television 1830 may be configured to implement one or more disclosed embodiments, at least in part.
- the environment 1800 includes cameras 181 la-181 le, which are distributed throughout the environment.
- one or more smart audio devices in the environment 1800 also may include one or more cameras.
- the one or more smart audio devices may be single purpose audio devices or virtual assistants.
- one or more cameras of the optional sensor system 130 may reside in or on the television 1830, in a mobile phone or in a smart speaker, such as one or more of the loudspeakers 1805b, 1805d, 1805e or 1805h.
- cameras 181 la-181 le are not shown in every depiction of the audio environments presented in this disclosure, each of the audio environments may nonetheless include one or more cameras in some implementations.
- EEEs enumerated exemplary embodiments
- An audio processing method comprising: obtaining, by a control system, a plurality of echo references, the plurality of echo references including at least one echo reference for each audio device of a plurality of audio devices in an audio environment, each echo reference corresponding to audio data being played back by one or more loudspeakers of one audio device of the plurality of audio devices; making, by the control system, an importance estimation for each echo reference of the plurality of echo references, wherein making the importance estimation involves determining an expected contribution of each echo reference to mitigation of echo by at least one echo management system of at least one audio device of the audio environment, the at least one echo management system comprising an acoustic echo canceller (AEC), an acoustic echo suppressor (AES), or both an AEC and an AES; selecting, by the control system and based at least in part on the importance estimation, one or more selected echo references; and providing, by the control system, the one or more selected echo references to the at least one echo management system.
- AEC acoustic echo canceller
- AES
- EEE2 The audio processing method of EEE1, further comprising causing the at least one echo management system to cancel or suppress echoes based, at least in part, on the one or more selected echo references.
- EEE3 The audio processing method of EEE1 or EEE2, wherein obtaining the plurality of echo references involves: receiving a content stream that includes audio data; and determining one or more echo references of the plurality of echo references based on the audio data.
- EEE4 The audio processing method of EEE3, wherein the control system comprises an audio device control system of an audio device of the plurality of audio devices in the audio environment, further comprising: rendering, by the audio device control system, the audio data for reproduction on the audio device to produce local speaker feed signals; and determining a local echo reference that corresponds with the local speaker feed signals.
- EEE5. The audio processing method of EEE4, wherein obtaining the plurality of echo references involves determining one or more non-local echo references based on the audio data, each of the non-local echo references corresponding to non-local speaker feed signals for playback on another audio device of the audio environment.
- EEE6 The audio processing method of EEE4, wherein obtaining the plurality of echo references involves receiving one or more non-local echo references, each of the non-local echo references corresponding to non-local speaker feed signals for playback on another audio device of the audio environment.
- EEE7 The audio processing method of EEE6, wherein receiving the one or more non-local echo references involves receiving the one or more non-local echo references from one or more other audio devices of the audio environment.
- EEE8 The audio processing method of EEE6, wherein receiving the one or more non-local echo references involves receiving each of the one or more non-local echo references from a single other device of the audio environment.
- EEE9 The audio processing method of any one of EEEs 1-8, further comprising a cost determination, the cost determination involving determining a cost for at least one echo reference of the plurality of echo references, wherein selecting the one or more selected echo references is based, at least in part, on the cost determination.
- EEE10 The audio processing method of EEE9, wherein the cost determination is based on network bandwidth required for transmitting the at least one echo reference, an encoding computational requirement for encoding the at least one echo reference, a decoding computational requirement for decoding the at least one echo reference, an echo management system computational requirement for use of the at least one echo reference by the echo management system, or combinations thereof.
- EEE11 The audio processing method of EEE9 or EEE10, wherein the cost determination is based on a replica of the at least one echo reference in a time domain or a frequency domain, on a downsampled version of the at least one echo reference, on a lossy compression of the at least one echo reference, on banded power information for the at least one echo reference, or combinations thereof.
- EEE12 The audio processing method of any one of EEEs 9-11, wherein the cost determination is based on a method of compressing a relatively more important echo reference less than a relatively less important echo reference.
- EEE13 The audio processing method of any one of EEEs 1-12, further comprising determining a current echo management system performance level, wherein selecting the one or more selected echo references is based, at least in part, on the current echo management system performance level.
- EEE14 The audio processing method of any one of EEEs 1-13, wherein making the importance estimation involves determining an importance metric for a corresponding echo reference.
- EEE15 The audio processing method of EEE14, wherein determining the importance metric involves determining a level of the corresponding echo reference, determining a uniqueness of the corresponding echo reference, determining a temporal persistence of the corresponding echo reference, determining an audibility of the corresponding echo reference, or combinations thereof.
- EEE16 The audio processing method of EEE14 or EEE15, wherein determining the importance metric is based at least in part on metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmixing matrix, a matrix of loudspeaker activations, or combinations thereof.
- EEE17 The audio processing method of any one of EEEs 14-16, wherein determining the importance metric is based at least in part on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or combinations thereof.
- EEE18 An apparatus configured to perform the method of any one of EEEs 1- 17.
- EEE19 A system configured to perform the method of any one of EEEs 1-17.
- EEE20 One or more non-transitory media having software stored thereon, the software including instructions for controlling one or more devices to perform the method of any one of EEEs 1-17.
- Some aspects of present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer readable medium (e.g., a disc) which stores code for implementing one or more examples of the disclosed methods or steps thereof.
- some disclosed systems can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of disclosed methods or steps thereof.
- Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.
- Some embodiments may be implemented as a configurable (e.g., programmable) digital signal processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform required processing on audio signal(s), including performance of one or more examples of the disclosed methods.
- DSP digital signal processor
- embodiments of the disclosed systems may be implemented as a general purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory) which is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including one or more examples of the disclosed methods.
- PC personal computer
- microprocessor which may include an input device and a memory
- elements of some embodiments of the inventive system are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones).
- a general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device.
- Another aspect of present disclosure is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) one or more examples of the disclosed methods or steps thereof.
- code for performing e.g., coder executable to perform
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
Claims
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/275,800 US20240296822A1 (en) | 2021-02-09 | 2022-02-07 | Echo reference generation and echo reference metric estimation according to rendering information |
| CN202280013949.8A CN116830560A (en) | 2021-02-09 | 2022-02-07 | Echo reference generation and echo reference index estimation based on rendering information |
| EP22705965.6A EP4292272A1 (en) | 2021-02-09 | 2022-02-07 | Echo reference generation and echo reference metric estimation according to rendering information |
Applications Claiming Priority (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202163147573P | 2021-02-09 | 2021-02-09 | |
| US63/147,573 | 2021-02-09 | ||
| US202163201939P | 2021-05-19 | 2021-05-19 | |
| US63/201,939 | 2021-05-19 | ||
| EP21177382.5 | 2021-06-02 | ||
| EP21177382 | 2021-06-02 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022173684A1 true WO2022173684A1 (en) | 2022-08-18 |
Family
ID=80447674
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2022/015529 Ceased WO2022173706A1 (en) | 2021-02-09 | 2022-02-07 | Echo reference prioritization and selection |
| PCT/US2022/015436 Ceased WO2022173684A1 (en) | 2021-02-09 | 2022-02-07 | Echo reference generation and echo reference metric estimation according to rendering information |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2022/015529 Ceased WO2022173706A1 (en) | 2021-02-09 | 2022-02-07 | Echo reference prioritization and selection |
Country Status (3)
| Country | Link |
|---|---|
| US (2) | US20240296822A1 (en) |
| EP (2) | EP4292272A1 (en) |
| WO (2) | WO2022173706A1 (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12231602B2 (en) * | 2021-08-04 | 2025-02-18 | Nokia Technologies Oy | Apparatus, methods and computer programs for performing acoustic echo cancellation |
| US12361954B2 (en) * | 2023-06-01 | 2025-07-15 | Cisco Technology, Inc. | Ambience-adapted audio watermarking for teleconferencing |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016048381A1 (en) * | 2014-09-26 | 2016-03-31 | Nunntawi Dynamics Llc | Audio system with configurable zones |
| US9659555B1 (en) * | 2016-02-09 | 2017-05-23 | Amazon Technologies, Inc. | Multichannel acoustic echo cancellation |
| WO2018226359A1 (en) * | 2017-06-06 | 2018-12-13 | Cypress Semiconductor Corporation | System and methods for audio pattern recognition |
| WO2021021707A1 (en) | 2019-07-30 | 2021-02-04 | Dolby Laboratories Licensing Corporation | Managing playback of multiple streams of audio over multiple speakers |
Family Cites Families (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2010258941A (en) | 2009-04-28 | 2010-11-11 | Sony Corp | Echo removing apparatus, echo removing method, and communication apparatus |
| US8855295B1 (en) | 2012-06-25 | 2014-10-07 | Rawles Llc | Acoustic echo cancellation using blind source separation |
| US9497544B2 (en) * | 2012-07-02 | 2016-11-15 | Qualcomm Incorporated | Systems and methods for surround sound echo reduction |
| US9633671B2 (en) | 2013-10-18 | 2017-04-25 | Apple Inc. | Voice quality enhancement techniques, speech recognition techniques, and related systems |
| US9762742B2 (en) | 2014-07-24 | 2017-09-12 | Conexant Systems, Llc | Robust acoustic echo cancellation for loosely paired devices based on semi-blind multichannel demixing |
| US9799330B2 (en) | 2014-08-28 | 2017-10-24 | Knowles Electronics, Llc | Multi-sourced noise suppression |
| US9769587B2 (en) | 2015-04-17 | 2017-09-19 | Qualcomm Incorporated | Calibration of acoustic echo cancelation for multi-channel sound in dynamic acoustic environments |
| US9589575B1 (en) | 2015-12-02 | 2017-03-07 | Amazon Technologies, Inc. | Asynchronous clock frequency domain acoustic echo canceller |
| US10325583B2 (en) | 2017-10-04 | 2019-06-18 | Guoguang Electric Company Limited | Multichannel sub-band audio-signal processing using beamforming and echo cancellation |
| US10192567B1 (en) | 2017-10-18 | 2019-01-29 | Motorola Mobility Llc | Echo cancellation and suppression in electronic device |
| US10382092B2 (en) | 2017-11-27 | 2019-08-13 | Verizon Patent And Licensing Inc. | Method and system for full duplex enhanced audio |
| US10566008B2 (en) | 2018-03-02 | 2020-02-18 | Cirrus Logic, Inc. | Method and apparatus for acoustic echo suppression |
-
2022
- 2022-02-07 WO PCT/US2022/015529 patent/WO2022173706A1/en not_active Ceased
- 2022-02-07 WO PCT/US2022/015436 patent/WO2022173684A1/en not_active Ceased
- 2022-02-07 US US18/275,800 patent/US20240296822A1/en active Pending
- 2022-02-07 US US18/263,956 patent/US12380873B2/en active Active
- 2022-02-07 EP EP22705965.6A patent/EP4292272A1/en active Pending
- 2022-02-07 EP EP22704994.7A patent/EP4292271A1/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016048381A1 (en) * | 2014-09-26 | 2016-03-31 | Nunntawi Dynamics Llc | Audio system with configurable zones |
| US9659555B1 (en) * | 2016-02-09 | 2017-05-23 | Amazon Technologies, Inc. | Multichannel acoustic echo cancellation |
| WO2018226359A1 (en) * | 2017-06-06 | 2018-12-13 | Cypress Semiconductor Corporation | System and methods for audio pattern recognition |
| WO2021021707A1 (en) | 2019-07-30 | 2021-02-04 | Dolby Laboratories Licensing Corporation | Managing playback of multiple streams of audio over multiple speakers |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2022173706A1 (en) | 2022-08-18 |
| US20240296822A1 (en) | 2024-09-05 |
| EP4292272A1 (en) | 2023-12-20 |
| US20240304171A1 (en) | 2024-09-12 |
| US12380873B2 (en) | 2025-08-05 |
| EP4292271A1 (en) | 2023-12-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP2973552B1 (en) | Spatial comfort noise | |
| US20240267469A1 (en) | Coordination of audio devices | |
| KR102851669B1 (en) | Compensating for environmental noise in content and environmental awareness | |
| US20150248889A1 (en) | Layered approach to spatial audio coding | |
| EP2936485A1 (en) | Object clustering for rendering object-based audio content based on perceptual criteria | |
| US20240296822A1 (en) | Echo reference generation and echo reference metric estimation according to rendering information | |
| EP2779161B1 (en) | Spectral and spatial modification of noise captured during teleconferencing | |
| EP4430861A1 (en) | Distributed audio device ducking | |
| CN116830560A (en) | Echo reference generation and echo reference index estimation based on rendering information | |
| CN116783900A (en) | Acoustic state estimator based on subband domain acoustic echo canceller | |
| US12401945B2 (en) | Subband domain acoustic echo canceller based acoustic state estimator | |
| RU2823537C1 (en) | Audio encoding device and method | |
| KR20240152893A (en) | Parametric spatial audio rendering | |
| CN114175152A (en) | System and method for enhancing degraded audio signals | |
| HK1222470A1 (en) | Hybrid waveform-coded and parametric-coded speech enhancement | |
| HK1222470B (en) | Hybrid waveform-coded and parametric-coded speech enhancement |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22705965 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 18275800 Country of ref document: US |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202280013949.8 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2022705965 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2022705965 Country of ref document: EP Effective date: 20230911 |


