CN116830561A - Echo reference prioritization and selection - Google Patents

Echo reference prioritization and selection Download PDF

Info

Publication number
CN116830561A
CN116830561A CN202280013990.5A CN202280013990A CN116830561A CN 116830561 A CN116830561 A CN 116830561A CN 202280013990 A CN202280013990 A CN 202280013990A CN 116830561 A CN116830561 A CN 116830561A
Authority
CN
China
Prior art keywords
echo
audio
references
examples
echo reference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280013990.5A
Other languages
Chinese (zh)
Inventor
B·J·索斯韦尔
C·G·海因斯
D·古纳万
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby Laboratories Licensing Corp
Original Assignee
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corp filed Critical Dolby Laboratories Licensing Corp
Priority claimed from PCT/US2022/015529 external-priority patent/WO2022173706A1/en
Publication of CN116830561A publication Critical patent/CN116830561A/en
Pending legal-status Critical Current

Links

Abstract

Some embodiments relate to obtaining a plurality of echo references, the plurality of echo references including at least one echo reference for each of a plurality of audio devices in an audio environment, each echo reference corresponding to audio data played back by one or more loudspeakers of one of the plurality of audio devices. Some examples involve making an importance estimate for each echo reference of a plurality of echo references. Making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment. Some embodiments relate to selecting one or more selected echo references based at least in part on an importance estimate and providing the one or more selected echo references to at least one echo management system.

Description

Echo reference prioritization and selection
Cross Reference to Related Applications
The present application claims priority from U.S. provisional application No.63/147,573 filed on day 2 and 9 of 2021, U.S. provisional application No.63/201,939 filed on day 5 and 19 of 2021, and European application No.21177382.5 filed on day 6 and 2 of 2021, all of which are incorporated herein by reference in their entirety.
Technical Field
The present disclosure relates to devices, systems, and methods for implementing acoustic echo management.
Background
Audio devices with acoustic echo management systems have been widely deployed. The acoustic echo management system may include an acoustic echo canceller and/or an acoustic echo suppressor. While existing devices, systems, and methods for acoustic echo management provide benefits, improved devices, systems, and methods would still be desirable.
Symbols and terms
Throughout this disclosure, including in the claims, the terms "speaker (speaker)", "loudspeaker (loudspecker)" and "audio reproduction transducer" are synonymously used to denote any sound producing transducer (or set of transducers). A typical set of headphones includes two speakers. The speakers may be implemented to include multiple transducers (e.g., woofers and tweeters) that may be driven by a single common speaker feed or multiple speaker feeds. In some examples, the speaker feed(s) may undergo different processing in different circuit branches coupled to different transducers.
Throughout this disclosure, including in the claims, the expression "perform an operation on a signal or data" (performing an operation "on" a signal or data) is used in a broad sense (e.g., filter, scale, transform, or apply gain) to denote performing an operation directly on a signal or data or on a processed version of a signal or data (e.g., a version of a signal that has undergone preliminary filtering or preprocessing prior to performing an operation thereon).
Throughout this disclosure, including in the claims, the expression "system" is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem implementing a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, where the subsystem generates M inputs and the other X-M inputs are received from external sources) may also be referred to as a decoder system.
Throughout this disclosure, including in the claims, the term "processor" is used in a broad sense to mean a system or device that is programmable or otherwise configurable (e.g., in software or firmware) to perform operations on data (e.g., audio or video or other image data). Examples of processors include field programmable gate arrays (or other configurable integrated circuits or chip sets), digital signal processors programmed and/or otherwise configured to perform pipelined processing of audio or other sound data, programmable general purpose processors or computers, and programmable microprocessor chips or chip sets.
Throughout this disclosure, including in the claims, the term "coupled" or "coupled" is used to mean a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection or through an indirect connection via other devices and connections.
As used herein, a "smart device" is an electronic device that may operate interactively and/or autonomously to some degree, typically configured to communicate with one or more other devices (or networks) via various wireless protocols such as bluetooth, zigbee, near field communication, wi-Fi, light fidelity (Li-Fi), 3G, 4G, 5G, and the like. Some well-known smart device types are smart phones, smart cars, smart thermostats, smart doorbell, smart locks, smart refrigerators, tablet phones and tablet computers, smart watches, smart bracelets, smart key chains, and smart audio devices. The term "smart device" may also refer to a device that exhibits some properties of pervasive computing such as artificial intelligence.
The expression "smart audio device" is used herein to denote a smart device that is a single-purpose audio device or a multi-purpose audio device (e.g., an audio device implementing at least some aspects of the virtual assistant functionality). A single-use audio device is a device that includes or is coupled to at least one microphone (and optionally also includes or is coupled to at least one speaker and/or at least one camera) and is designed largely or primarily to achieve a single use, such as a Television (TV). For example, while a TV may generally play (and be considered capable of playing) audio from program material, in most instances, modern TVs run some operating system on which applications (including television-watching applications) run locally. In this sense, single-use audio devices having speaker(s) and microphone(s) are typically configured to run local applications and/or services to directly use the speaker(s) and microphone(s). Some single-use audio devices may be configured to be combined together to enable playback of audio over a zone or user-configured area.
One common type of multi-purpose audio device is an audio device that implements at least some aspects of virtual assistant functionality, although other aspects of virtual assistant functionality may be implemented by one or more other devices, such as one or more servers with which the multi-purpose audio device is configured to communicate. Such multi-purpose audio devices may be referred to herein as "virtual assistants. A virtual assistant is a device (e.g., a smart speaker or voice assistant integrated device) that includes or is coupled to at least one microphone (and optionally also includes or is coupled to at least one speaker and/or at least one camera). In some examples, the virtual assistant may provide the ability to use multiple devices (other than the virtual assistant) for applications that in a sense support the cloud or that are otherwise not fully implemented in or on the virtual assistant itself. In other words, at least some aspects of the virtual assistant functionality (e.g., speech recognition functionality) may be implemented (at least in part) by one or more servers or other devices with which the virtual assistant may communicate via a network (e.g., the internet). Virtual assistants can sometimes work together, for example, in a discrete and conditionally defined manner. For example, two or more virtual assistants may work together in the sense that one of them (e.g., the virtual assistant that is most confident that the wake word has been heard) responds to the wake word. In some implementations, the connected virtual assistants may form a constellation that may be managed by a host application, which may be (or implement) the virtual assistant.
In this document, the "wake word" is used in a broad sense to mean any sound (e.g., a word spoken by a human or other sound), wherein the smart audio device is configured to wake up in response to detecting ("hearing") the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone). In this context, "wake-up" means a state in which the device enters a waiting (in other words, listening) sound command. In some examples, a so-called "wake word" herein may include more than one word, e.g., a phrase.
Herein, the expression "wake word detector" means a device (or means including software for configuring the device to continuously search for an alignment between real-time sound (e.g., speech) features and a training model). Typically, a wake word event is triggered whenever the wake word detector determines that the probability of detecting a wake word exceeds a predefined threshold. For example, the threshold may be a predetermined threshold that is adjusted to give a reasonable tradeoff between false acceptance rate and false rejection rate. After the wake word event, the device may enter a state (which may be referred to as an "awake" state or an "attention" state) in which the device listens for commands and passes the received commands to a larger, more computationally intensive recognizer.
As used herein, the terms "program stream" and "content stream" refer to a collection of one or more audio signals, and in some instances, a collection of video signals, at least portions of which are intended to be heard together. Examples include music selections, movie soundtracks, movies, television programs, audio portions of television programs, podcasts, live voice conversations, synthesized voice responses from intelligent assistants, and the like. In some examples, the content stream may include multiple versions of at least a portion of the audio signal, e.g., the same conversation in more than one language. In such instances, only one version of the audio data or portion thereof (e.g., a version corresponding to a single language) is intended to be reproduced at a time.
Disclosure of Invention
At least some aspects of the present disclosure may be implemented via one or more audio processing methods. The audio processing method manages echoes in an audio system. The audio system includes a plurality of audio devices in an audio environment. Each of the plurality of audio devices includes one or more microphones. In some examples, the method(s) may be implemented at least in part by a control system and/or via instructions (e.g., software) stored on one or more non-transitory media of a first device of the plurality of audio devices of the audio system. The first device may include one or more microphones. Some such methods involve obtaining, by the control system of the first device, a plurality of echo references. The plurality of echo references may include at least one echo reference for each of the plurality of audio devices in the audio environment. Each echo reference may correspond to audio data played back by one or more loudspeakers of a corresponding audio device of the plurality of audio devices. The plurality of echo references includes at least one echo reference of the first audio device.
The method may involve making, by the control system, an importance estimate for each echo reference of the plurality of echo references. In some examples, making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment. The echo management system(s) may, for example, include an Acoustic Echo Canceller (AEC) and/or an Acoustic Echo Suppressor (AES).
The method may involve selecting, by the control system and based at least in part on the importance estimate, one or more echo references from the plurality of echo references. The selected echo reference may be a subset of one or more of the (whole) plurality of echo references. The method may involve providing, by the control system, the one or more selected echo references to the at least one echo management system. In some examples, the method may involve causing at least one echo management system to cancel or suppress echo based at least in part on the one or more selected echo references.
According to some examples, the audio devices of the audio system may be communicatively coupled via a wired or wireless communication network. The plurality of echo references (e.g., non-local echo references of other audio devices than the first audio device and/or echo references of the first audio device) may be obtained via the wired or wireless communication network.
According to some examples, obtaining the plurality of echo references may involve receiving a content stream comprising audio data and determining one or more of the plurality of echo references based on the audio data.
In some embodiments, the control system may be or may include an audio device control system for audio devices in the audio environment. In some such implementations, the method may involve rendering, by the audio device control system, audio data for reproduction on the audio device, thereby producing a local speaker feed signal. In some such embodiments, the method may involve determining a local echo reference corresponding to the local speaker feed signal.
In some examples, obtaining the plurality of echo references may involve determining one or more non-local echo references based on the audio data. In some such examples, each non-native echo reference may correspond to a non-native speaker feed for playback on another audio device of the audio environment.
According to some examples, obtaining the plurality of echo references may involve receiving one or more non-local echo references. In some such examples, each non-native echo reference may correspond to a non-native speaker feed for playback on another audio device of the audio environment. In some examples, receiving one or more non-local echo references may involve receiving one or more non-local echo references from one or more other audio devices of the audio environment. In some examples, receiving one or more non-native echo references may involve receiving each of the one or more non-native echo references from a single other device of the audio environment.
In some examples, the method may involve cost determination. According to some examples, the cost determination may involve determining a cost of at least one of the plurality of echo references. In some examples, selecting the one or more selected echo references may be determined based at least in part on the cost.
According to some examples, the cost determination may be based on network bandwidth required for transmitting the at least one echo reference, coding calculation requirements for coding the at least one echo reference, decoding calculation requirements for decoding the at least one echo reference, echo management system calculation requirements for using the at least one echo reference by the echo management system, or one or more combinations thereof.
In some examples, the cost determination may be based on a replica of the at least one echo reference in the time or frequency domain, a downsampled version of the at least one echo reference, lossy compression of the at least one echo reference, segmented power information of the at least one echo reference, or one or more combinations thereof. According to some examples, the cost determination may be based on a method of less compressing a relatively more important echo reference than a relatively less important echo reference.
In some examples, the method may involve determining a current echo management system performance level. According to some examples, selecting the one or more selected echo references may be based at least in part on the current echo management system performance level.
According to some examples, making the importance estimate may involve determining an importance metric for the corresponding echo reference. In some such examples, determining the importance metric may involve determining a level of the corresponding echo reference, determining a uniqueness of the corresponding echo reference, determining a time duration of the corresponding echo reference, determining an audibility of the corresponding echo reference, or one or more combinations thereof.
In some examples, determining the importance metric may be based at least in part on data or metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an up-mix (upmix) matrix, a loudspeaker activation matrix, or one or more combinations thereof.
According to some examples, determining the importance metric may be based at least in part on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or one or more combinations thereof.
Some or all of the operations, functions, and/or methods described herein may be performed by one or more devices in accordance with instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as the memory devices described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. Thus, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon.
At least some aspects of the present disclosure may be implemented via an apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some embodiments, the apparatus is or includes an audio processing system having an interface system and a control system. The control system may include one or more general purpose single or multi-chip processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or a combination thereof.
The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Drawings
Like reference numbers and designations in the various drawings indicate like elements.
Fig. 1A is a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the present disclosure.
Fig. 1B illustrates an example of an audio environment.
Fig. 1C and 1D illustrate examples of how audio devices 110A-110C may receive playback channels.
Fig. 1E illustrates another example of an audio environment.
Fig. 2A presents a block diagram of an audio device capable of performing at least some of the disclosed embodiments.
Fig. 2B and 2C illustrate additional examples of audio devices in an audio environment.
Fig. 3A presents a block diagram illustrating components of an audio device according to one example.
Fig. 3B and 3C are graphs showing examples of expected echo management performance and the number of echo references used for echo management.
Fig. 4 presents a block diagram illustrating components of an echo reference orchestrator according to one example.
Fig. 5A is a flow chart summarizing one example of a disclosed method.
Fig. 5B is a flow chart summarizing another example of the disclosed methods.
FIG. 6 is a flow chart summarizing one example of a disclosed method.
Fig. 7 shows an example of a plan view of an audio environment, which in this example is a living space.
Detailed Description
Fig. 1A is a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the present disclosure. As with the other figures provided herein, the types and numbers of elements shown in fig. 1A are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. According to some examples, the apparatus 50 may be configured to perform at least some of the methods disclosed herein. In some implementations, the apparatus 50 may be or may include one or more components of an audio system. For example, in some embodiments, the apparatus 50 may be an audio device, such as a smart audio device. In other examples, apparatus 50 may be a mobile device (e.g., a cellular telephone), a laptop computer, a tablet computer device, a television, or other type of device.
According to some alternative embodiments, the apparatus 50 may be or may include a server. In some such examples, the apparatus 50 may be or may include an encoder. Thus, in some examples, the apparatus 50 may be a device configured for use within an audio environment, such as a home audio environment, while in other examples, the apparatus 50 may be a device configured for use in a "cloud", e.g., a server.
In this example, the apparatus 50 includes an interface system 55 and a control system 60. In some implementations, the interface system 55 can be configured to communicate with one or more other devices of the audio environment. In some examples, the audio environment may be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, and so forth. In some implementations, the interface system 55 can be configured to exchange control information and associated data with audio devices of an audio environment. In some examples, the control information and associated data may relate to one or more software applications being executed by the apparatus 50.
In some implementations, the interface system 55 can be configured for receiving a content stream or for providing a content stream. The content stream may include audio data. The audio data may include, but may not be limited to, audio signals. In some examples, the audio data may include spatial data such as channel data and/or spatial metadata. For example, the metadata may be provided by a device that may be referred to herein as an "encoder. In some examples, the content stream may include video data and audio data corresponding to the video data.
The interface system 55 may include one or more network interfaces and/or one or more external device interfaces (e.g., one or more Universal Serial Bus (USB) interfaces). According to some embodiments, interface system 55 may include one or more wireless interfaces. The interface system 55 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system, and/or a gesture sensor system. In some examples, interface system 55 may include one or more interfaces between control system 60 and a memory system (such as optional memory system 65 shown in fig. 1A). However, in some examples, control system 60 may include a memory system. In some implementations, the interface system 55 may be configured to receive input from one or more microphones in an environment.
For example, control system 60 may include a general purpose single or multi-chip processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
In some embodiments, control system 60 may reside in more than one device. For example, in some implementations, a portion of the control system 60 may reside in a device within one of the environments depicted herein, and another portion of the control system 60 may reside in a device outside of the environment, such as a server, mobile device (e.g., smart phone or tablet computer), or the like. In other examples, a portion of control system 60 may reside in a device within one of the environments depicted herein, and another portion of control system 60 may reside in one or more other devices of the environments. For example, the functionality of the control system may be distributed across multiple intelligent audio devices of the environment, or may be shared by orchestration devices (as may be referred to herein as devices of the intelligent home hub) and one or more other devices of the environment. In other examples, a portion of control system 60 may reside in a device (e.g., a server) implementing a cloud-based service, and another portion of control system 60 may reside in another device (e.g., another server, a memory device, etc.) implementing a cloud-based service. In some examples, interface system 55 may also reside in more than one device.
In some embodiments, the control system 60 may be configured to at least partially perform the methods disclosed herein. According to some examples, the control system 60 may be configured to obtain a plurality of echo references. The plurality of echo references may include at least one echo reference for each of a plurality of audio devices in the audio environment. Each echo reference may, for example, correspond to audio data played back by one or more loudspeakers of one of the plurality of audio devices.
In some implementations, the control system 60 may be configured to make an importance estimate for each of a plurality of echo references. In some examples, making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment. The echo management system(s) may include an Acoustic Echo Canceller (AEC) and/or an Acoustic Echo Suppressor (AES).
According to some examples, control system 60 may be configured to select one or more selected echo references based at least in part on the importance estimate. In some examples, control system 60 may be configured to provide one or more selected echo references to at least one echo management system.
Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as the memory devices described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. One or more non-transitory media may reside, for example, in the optional memory system 65 and/or the control system 60 shown in fig. 1A. Accordingly, various innovative aspects of the subject matter described in this disclosure can be implemented in one or more non-transitory media having software stored thereon. For example, the software may include instructions for controlling at least one device to perform some or all of the methods disclosed herein. For example, the software may be executed by one or more components of a control system, such as control system 60 of FIG. 1A.
In some examples, the apparatus 50 may include an optional microphone system 70 shown in fig. 1A. The optional microphone system 70 may include one or more microphones. According to some examples, optional microphone system 70 may include a microphone array. In some examples, the microphone array may be configured to determine direction of arrival (DOA) and/or time of arrival (TOA) information, for example, according to instructions from the control system 60. In some examples, the microphone array may be configured for receive side beamforming, e.g., according to instructions from the control system 60. In some implementations, one or more microphones may be part of or associated with another device (e.g., a speaker of a speaker system, a smart audio device, etc.). In some examples, the apparatus 50 may not include the microphone system 70. However, in some such embodiments, the apparatus 50 may still be configured to receive microphone data for one or more microphones in an audio environment via the interface system 60. In some such embodiments, a cloud-based embodiment of the apparatus 50 may be configured to receive microphone data or data corresponding to microphone data from one or more microphones in an audio environment via the interface system 60.
According to some embodiments, the apparatus 50 may include an optional loudspeaker system 75 shown in fig. 1A. The optional loudspeaker system 75 may include one or more loudspeakers, which may also be referred to herein as "speakers" or more generally as "audio reproduction transducers". In some examples (e.g., cloud-based implementations), the apparatus 50 may not include the loudspeaker system 75.
In some embodiments, the apparatus 50 may include an optional sensor system 80 shown in fig. 1A. The optional sensor system 80 may include one or more touch sensors, gesture sensors, motion detectors, and the like. According to some embodiments, the optional sensor system 80 may include one or more cameras. In some implementations, the camera may be a standalone camera. In some examples, one or more cameras of the optional sensor system 80 may reside in a smart audio device, which may be a single-use audio device or a virtual assistant. In some such examples, one or more cameras of the optional sensor system 80 may reside in a television, mobile phone, or smart speaker. In some examples, the apparatus 50 may not include the sensor system 80. However, in some such embodiments, the apparatus 50 may still be configured to receive sensor data for one or more sensors in the audio environment via the interface system 60.
In some embodiments, the apparatus 50 may include an optional display system 85 shown in fig. 1A. The optional display system 85 may include one or more displays, such as one or more Light Emitting Diode (LED) displays. In some examples, optional display system 85 may include one or more Organic Light Emitting Diode (OLED) displays. In some examples, optional display system 85 may include one or more displays of a smart audio device. In other examples, optional display system 85 may include a television display, a laptop computer display, a mobile device display, or another type of display. In some examples where apparatus 50 includes display system 85, sensor system 80 may include a touch sensor system and/or a gesture sensor system proximate to one or more displays of display system 85. According to some such embodiments, control system 60 may be configured to control display system 85 to present one or more Graphical User Interfaces (GUIs).
According to some such examples, apparatus 50 may be or may include a smart audio device. In some such embodiments, the apparatus 50 may be or may include a wake-up word detector. For example, the apparatus 50 may be or may include a virtual assistant.
For stereo or mono playback media, it is traditionally rendered into an audio environment (e.g., living space, automobile, office space, etc.) via a pair of speakers that are connected to an audio player (e.g., CD/DVD player, television (TV), etc.) via a physical cable. With the popularity of smart speakers, users typically have more than two audio devices (which may include, but are not limited to, smart speakers or other smart audio devices) configured for wireless communication capable of playing back audio in their home (or other audio environment).
The smart speaker is typically configured to operate in accordance with voice commands. Thus, such intelligent speakers are typically configured to listen for wake-up words that will typically be followed by voice commands. Any continuous listening task (such as waiting for a wake-up word or performing any type of "continuous calibration") will preferably continue to run while content playback (such as music playback, track playback of movies and television programs, etc.) and device interactions occur (e.g., during a telephone conversation). Audio devices that need to listen to during playback of content typically need to employ some form of echo management, such as echo cancellation and/or echo suppression, to remove "echoes" (content played by the device) from the microphone signal.
Fig. 1B illustrates an example of an audio environment. As with the other figures provided herein, the types, numbers, and arrangements of elements shown in fig. 1B are provided as examples only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements.
According to this example, audio environment 100 includes audio devices 110A, 110B, and 110C. In this example, each of the audio devices 110A-110C is an example of the apparatus 50 of FIG. 1A and includes an example of the microphone system 70 and the loudspeaker system 75, but these are not shown in FIG. 1B. According to some examples, each audio device 110A-110C may be a smart audio device, such as a smart speaker.
In this example, audio devices 110A-110C play back audio content while person 130 is speaking. The microphone of audio device 110B detects not only audio content played back by its own speaker, but also voice sounds 131 of person 130 and audio content played back by audio devices 110A and 110C.
In order to utilize as many speakers as possible at the same time, a typical approach is to have all audio devices in the audio environment play back the same content and use some timing mechanism to keep the playback media synchronized. This has the advantage of simplifying distribution because all devices will receive the same copy of the playback media, whether downloaded or streamed to each audio device or broadcast and multicast by one device to all audio devices.
One major drawback of this approach is that no spatial effect is obtained. The spatial effect may be achieved by adding more playback channels (e.g. one for each speaker), e.g. by up-mixing. In some examples, the spatial effect may be implemented via a flexible rendering process such as centroid amplitude translation (CMAP), flexible Virtualization (FV), or a combination of CMAP and FV. Related examples of CMAP, FV and combinations thereof are described in international patent publication No. WO 2021/021707A1 (e.g., pages 25-41), which is hereby incorporated by reference.
Fig. 1C and 1D illustrate additional examples of audio devices in an audio environment. According to these examples, audio environment 100 includes smart home hub 105 and audio devices 110A, 110B, and 110C. In these examples, smart home hub 105 and audio devices 110A-110C are examples of apparatus 50 of FIG. 1A. According to these examples, each of the audio devices 110A-110C includes a corresponding one of the microphones 121A, 121B, and 121C. According to some examples, each audio device 110A-110C may be a smart audio device, such as a smart speaker.
Fig. 1C and 1D illustrate examples of how audio devices 110A-110C may receive playback channels. In fig. 1C, the encoded audio bitstream is multicast to all audio devices 110A-110C. In fig. 1D, each of the audio devices 110A-110C receives only the channels that are required by the audio device for playback. The choice of bitstream distribution may vary according to individual implementations, and may be based on, for example, available system bandwidth, codec efficiency of the audio codec used, capabilities of the audio devices 110A-110C, and/or other factors. The exact topology of the audio environment shown in fig. 1C and 1D is not important. However, these examples illustrate the fact that: distributing audio channels to device audio devices will incur some costs. Costs may be assessed in terms of network bandwidth required, computational costs added to encoding and decoding the audio channels, and the like.
Fig. 1E illustrates another example of an audio environment. According to this example, audio environment 100 includes audio devices 110A, 110B, 110C, and 110D. In this example, each of the audio devices 110A-110D is an example of the apparatus 50 of fig. 1A and includes at least one microphone (see microphones 120A, 120B, 120C, and 120D), at least one loudspeaker (see loudspeakers 121A, 121B, 121C, and 121D). According to some examples, each audio device 110A-110D may be a smart audio device, such as a smart speaker.
In this example, audio devices 110A-110D render content 122A, 122B, 122C, and 122D via loudspeakers 121A-121D. Each of the microphones 120A-120D detects an "echo" corresponding to content 122A-122D played back by each of the audio devices 110A-110D. In this example, audio devices 110A-110D are configured to listen for commands or wake words in speech 131 from person 130 within audio environment 100.
Fig. 2A presents a block diagram of an audio device capable of performing at least some of the disclosed embodiments. As with the other figures provided herein, the types, numbers, and arrangements of elements shown in fig. 2A are provided as examples only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements. In this example, audio device 110A is an example of audio device 110A of fig. 1E. Here, the audio device 110A includes a control system 60A, which is an example of the control system 60 of fig. 1A. According to this embodiment, the control system 60 is able to listen to the voice 131 of the person 130 in the presence of echoes corresponding to the content 122A, 122B, 122C and 122D played back by each audio device in the audio environment 100.
According to this example, control system 60 implements renderer 201A, a multi-channel acoustic echo management system (MC-EMS) 203A, and a speech processing block 240A. The MC-EMS203A may include an Acoustic Echo Canceller (AEC), an Acoustic Echo Suppressor (AES), or both AEC and AES, depending on the particular implementation. According to this example, the speech processing block 240A is configured to detect wake words and commands of the user. In some implementations, the speech processing block 240A can be configured to support a communication session, such as a telephone call.
In this embodiment, the renderer 201A is configured to provide the local echo reference 220A to the MC-EMS 203A. The local echo reference 220A corresponds to (and in this example is equivalent to) a speaker feed provided to the loudspeaker 121A for playback by the audio device 110A. According to this example, the renderer 201A is further configured to provide a non-local echo reference 221A (corresponding to content 122B, 122C, and 122D played back by other audio devices in the audio environment 100) to the MC-EMS 203A.
According to some examples, audio device 110A receives a combined bitstream of audio data (e.g., as shown in FIG. 1C) that includes all of audio devices 110A-110D of FIG. 1E. In some such examples, the renderer 201A may be configured to separate the local echo reference 220A from the non-local echo reference 221A to provide the local echo reference 220A to the loudspeaker 121A and to provide the local echo reference 220A and the non-local echo reference 221A to the MC-EMS 203A. In some alternative examples, audio device 110A may receive a bitstream intended for playback only on audio device 110A, e.g., as shown in fig. 1D. In some such examples, the smart home hub 105 (or other audio devices 110B-D) may provide the non-local echo reference 221A to the audio device 110A, as indicated by the dashed arrow next to reference 221A in fig. 2A.
In some examples, local echo reference 220A and/or non-local echo reference 221A may be full fidelity replicas of the speaker feed provided to loudspeakers 121A-121D for playback. In some alternative examples, local echo reference 220A and/or non-local echo reference 221A may be lower fidelity representations of speaker feed signals provided to microphones 121A-121D for playback. In some such examples, non-local echo reference 221A may be a downsampled version of the speaker feed provided to microphones 121B-121D for playback. According to some examples, non-local echo reference 221A may be a lossy compression of a speaker feed signal provided to loudspeakers 121B-121D for playback. In some examples, non-local echo reference 221A may be segment power information (banded power information) corresponding to speaker feed signals provided to microphones 121B-121D for playback.
According to this embodiment, the MC-EMS203A is configured to use the local echo reference 220A and the non-local echo reference 221A to predict and cancel and/or suppress echoes from the microphone signal 223A, thereby generating a residual signal 224A in which the speech-to-echo ratio (SER) may have been improved relative to the microphone signal 223A. The residual signal 224A may enable the speech processing block 240A to detect user wake-up words and commands. In some implementations, the speech processing block 240A can be configured to support a communication session, such as a telephone call.
Some aspects of the present disclosure relate to making an importance estimate for each of a plurality of echo references (e.g., for local echo reference 220A and non-local echo reference 221A). Making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment (e.g., echo mitigation by MC-EMS203A of audio device 110A). Various examples are provided below.
In the context of distributed and orchestrated devices, each audio device may obtain, for purposes of echo management, according to some examples, in addition to its own echo reference, an echo reference corresponding to content played back by one or more other audio devices in the audio environment. The impact of including a particular echo reference in a local echo management system or "EMS" (such as MC-EMS203A of audio device 110A) may vary depending on a number of parameters, such as the diversity of the audio content being played out, the network bandwidth required for transmitting the echo reference, the encoding calculation requirements for encoding the echo reference in the case of transmitting an encoded echo reference, the decoding calculation requirements for decoding the echo reference, the echo management system calculation requirements for using the echo reference by the echo management system, the relative audibility of the audio device, etc.
For example, if each audio device is rendering the same content (in other words, if mono audio is being played back), providing an additional reference to the EMS has little (though non-zero) benefit. Furthermore, due to practical limitations (such as bandwidth-limited networks), it may not be desirable for all devices to share a replica of their local echo reference. Thus, some embodiments may provide Distributed and Orchestrated EMS (DOEMS), where echo references are prioritized and transmitted (or not transmitted) accordingly. Some such examples may implement a tradeoff between cost (e.g., required network bandwidth and/or required computational overhead) and benefit (e.g., expected echo mitigation improvements, which may be measured in terms of signal-to-echo ratio (SER) and/or echo loss enhancement (ERLE)) for each additional echo reference.
Fig. 2B and 2C illustrate additional examples of audio devices in an audio environment. According to these examples, audio environment 100 includes smart home hub 105 and audio devices 110A, 110B, and 110C. In these examples, smart home hub 105 and audio devices 110A-110C are examples of apparatus 50 of FIG. 1A. According to these examples, each of the audio devices 110A-110C includes a corresponding one of the microphones 120A, 120B, and 120C and a corresponding one of the microphones 121A, 121B, and 121C. According to some examples, each audio device 110A-110C may be a smart audio device, such as a smart speaker.
In FIG. 2B, the smart home hub 105 sends the same encoded audio bitstream to all of the audio devices 110A-110C. In fig. 2C, the smart home hub 105 only transmits the audio channels required for playback by each of the audio devices 110A-110C. In both examples, audio channel 0 is intended for playback on audio device 110A, audio channel 1 is intended for playback on audio device 110B and audio channel 2 is intended for playback on audio device 110C.
Fig. 2B and 2C illustrate examples of sharing echo reference data over a local network. In these examples, audio device 110A sends echo reference 220A' to audio devices 110B and 110C over the local network, which is an echo reference corresponding to loudspeaker playback of audio device 110A. In these examples, the echo reference 220A' is different from channel 0 audio found in the bitstream. In some examples, the echo reference 220A' may be different from channel 0 audio because post-playback processing is implemented on the audio device 110A. In the example shown in fig. 2C, not all audio devices 110A-110C are provided with the combined bitstream, and thus another device (such as the audio device 110A or the smart home hub 105) provides the echo reference 220A'. In the scenario depicted in fig. 2B, even though the combined bit stream is provided to all audio devices 110A-110C, in some such instances, it may still be desirable to transmit the echo reference 220A'.
In other examples, the echo reference 220A 'may be different from channel 0 audio because the echo reference 220A' may not be a full fidelity replica of the audio data played back on the audio device 110A. In some such examples, the echo reference 220A 'may correspond to audio data played back on the audio device 110A, but may require relatively less data than a complete replica, and thus may consume relatively less local network bandwidth when the echo reference 220A' is transmitted.
According to some such examples, the audio device 110A may be configured to generate a downsampled version of the local echo reference 220A described above with reference to fig. 2A. In some such examples, the echo reference 220A' may be or may include a downsampled version.
In some examples, the audio device 110A may be configured to lossy compress the local echo reference 220A. In such an instance, the echo reference 220A' may be the result of the control system 60A applying a lossy compression algorithm to the local echo reference 220A.
According to some examples, audio device 110A may be configured to provide audio devices 110B and 110C with segment power information corresponding to local echo reference 220A. In some such examples, instead of transmitting a full fidelity replica of the audio data played back on audio device 110A, control system 60A may be configured to determine a power level in each of a plurality of frequency bands of the audio data played back on audio device 110A and transmit corresponding segment power information to audio devices 110B and 110C. In some such examples, the echo reference 220A' may be or may include segment power information.
Fig. 3A presents a block diagram illustrating components of an audio device according to one example. As with the other figures provided herein, the types, numbers, and arrangements of elements shown in fig. 3A are provided as examples only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements. For example, some implementations may be configured to send and/or receive "original" echo references (which may be a complete full-fidelity replica of audio being rendered on an audio device), low-fidelity versions or representations of audio being rendered on an audio device (such as downsampled versions, versions produced by lossy compression, or segment power information corresponding to audio being rendered on an audio device), but not to send and/or receive both the original version and the low-fidelity version.
In this example, the audio device 110A is an example of the audio device 110A of fig. 1E and includes a control system 60A, which is an example of the control system 60 of fig. 1A. According to this example, control system 60A is configured to implement a renderer 201A, a multi-channel acoustic echo management system (MC-EMS) 203A, a speech processing block 240A, an echo reference orchestrator 302A, a decoder 303A, and a noise estimator 304A. The reader may assume that the MC-EMS203A and the speech processing block 240A function as described above with reference to FIG. 2A, unless otherwise indicated by the following description of FIG. 3A. In this example, network interface 301A is an example of interface system 55 described above with reference to FIG. 1A.
In this example, the elements of fig. 3A are as follows:
110A: an audio device;
120A: a representative microphone. In some implementations, the audio device 110A may have more than one microphone;
121A: representative microphones. In some implementations, the audio device 110A may have more than one loudspeaker;
201A: a renderer that generates a reference for local playback and an echo reference that simulates audio played back by other audio devices in the audio environment;
203A: a multi-channel acoustic echo management system (MC-EMS) that may include an Acoustic Echo Canceller (AEC) and/or an Acoustic Echo Suppressor (AES);
220A: local echo references for playback and cancellation;
221A: locally generated copies of echo references being played by one or more non-local audio devices (one or more other audio devices in an audio environment);
223A: a plurality of microphone signals;
224A: a plurality of residual signals (MC-EMS 203A eliminates and/or suppresses microphone signals after the predicted echo);
240A: a voice processing block configured for wake word detection, voice command detection and/or providing telephony communications;
301A: a network interface configured for communication between audio devices, which may also be configured for communication via the internet and/or via one or more cellular networks;
302A: an echo reference composer configured to rank the echo references and select an appropriate set of one or more echo references;
303A: an audio decoder block;
304A: a noise estimator block;
310A: one or more decoded echo references received by audio device 110A from one or more other devices in the audio environment;
311A: transmitting a request for echo references from one or more other devices (such as a smart home hub or one or more of the audio devices 110B-110D) over a local network;
312A: metadata, which may be or may include metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmix matrix, and/or a loudspeaker activation matrix;
313A: an echo reference selected by echo reference composer 302A;
314A: echo references received by device 110A from one or more other devices;
315A: echo references sent from device 110A to other devices;
316A: the device 110A receives raw echo references from one or more other devices of the audio environment;
317A: a low fidelity (e.g., codec) version of the echo reference received by device 110A from one or more other devices of the audio environment;
318A: an audio environmental noise estimate;
350A: one or more indicators indicative of the current performance of the MC-EMS203A, which may be or may include adaptive filter coefficient data or other AEC statistics, speech Echo (SER) ratio data, and the like.
Echo reference orchestrator 302A may function in various ways, depending on the particular implementation. Many examples are disclosed herein. In some examples, the echo reference composer 302A may be configured to make an importance estimate for each of a plurality of echo references (e.g., for the local echo reference 220A and the non-local echo reference 221A). Making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment (e.g., echo mitigation by MC-EMS203A of audio device 110A).
Some examples of making the importance estimate may involve determining an importance metric. In some such examples, the importance metric may be based at least in part on one or more characteristics of each echo reference, such as level, uniqueness, time duration, audibility, or one or more combinations thereof. In some examples, the importance metric may be based at least in part on metadata (e.g., metadata 312A), such as metadata corresponding to the audio device layout, loudspeaker metadata, metadata corresponding to the received audio data, an upmix matrix, a loudspeaker activation matrix, or one or more combinations thereof. In some examples, the importance metric may be based at least in part on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or one or more combinations thereof.
According to some examples, echo reference orchestrator 302A may be configured to select a set of one or more echo references based at least in part on the cost determination. In some examples, echo reference orchestrator 302A may be configured to make a cost determination, while in other examples, another block of control system 60a may be configured to make a cost determination. In some examples, the cost determination may involve determining a cost of at least one of the plurality of echo references, or in some cases, each of the plurality of echo references. In some examples, the cost determination may be based on network bandwidth required for transmitting the echo reference, encoding calculation requirements for encoding the at least one echo reference, decoding calculation requirements for decoding the at least one echo reference, downsampling costs for making a downsampled version of the echo reference, echo management system calculation requirements for using the at least one echo reference by the echo management system, or one or more combinations thereof.
According to some examples, the cost determination may be based on a replica of the at least one echo reference in the time or frequency domain, a downsampled version of the at least one echo reference, a lossy compression of the at least one echo reference, segmented power information of the at least one echo reference, or one or more combinations thereof. In some examples, the cost determination may be based on a method of less compressing a relatively more important echo reference than a relatively less important echo reference. In some implementations, the echo reference orchestrator 302A (or another block of the control system 60A) may be configured to determine a current echo management system performance level (e.g., based at least in part on the indicator(s) 350A). In some such examples, selecting one or more selected echo references may be based at least in part on a current echo management system performance level.
Depending on the distributed audio device system, its configuration, and the type of audio session (e.g., communication or listening to music) and/or the nature of the rendered content, the rate at which the importance of each echo reference is estimated and the rate at which the echo reference set is estimated may be different. Furthermore, the rate at which the importance is estimated need not be equal to the rate at which the echo reference selection process makes decisions. If the two are not synchronized, the importance calculations will be more frequent in some examples. In some examples, the echo reference selection may be a discrete process in which binary decisions are made with or without specific echo references.
Fig. 3B and 3C are graphs showing examples of expected echo management performance and the number of echo references used for echo management. In fig. 3B, it can be seen that with the addition of additional references, the expected echo performance is also improved. However, in this example, it can be found that there are only a few discrete points at which the system can operate. In some examples, the points shown in fig. 3B may correspond to processing a complete, full-fidelity replica of each echo reference. For example, point 301 may correspond to an instance of processing a local echo reference (e.g., local reference 220A of fig. 2A or 3A), and point 310 may correspond to an instance of receiving a complete replica of a first non-local echo reference (e.g., a full fidelity version of one of received echo references 314A of fig. 3A, which may have been selected as the most important non-local echo reference) and processing both the local echo reference and the complete replica of the first non-local echo reference.
FIG. 3C illustrates one example of operation between any two of the discrete operating points shown in FIG. 3B. The line connecting the points in fig. 3B may, for example, correspond to a range of echo reference fidelity, including a lower fidelity version or representation of each echo reference. For example, points 303, 305, and 307 may correspond to copies or representations of the increased fidelity level of the first non-local echo reference, where point 303 corresponds to the lowest fidelity representation and point 307 corresponds to the highest fidelity representation other than the full fidelity replica. In some examples, point 303 may correspond to segment power information of the first non-local echo reference. According to some examples, points 305 and 307 may correspond to a relatively higher lossy compression of the first non-local echo reference and a relatively less lossy compression of the first non-local echo reference, respectively.
The fidelity of a copy or representation of an echo reference is generally inversely proportional to the number of bits required for each such copy or representation. Thus, the fidelity of the copy or representation of the echo reference provides an indication of the tradeoff between network cost (due to the number of bits required for transmission) and expected echo management performance (since performance should increase as fidelity increases). Note that the straight line connecting the points in fig. 3C represents only one of many different possible trajectories, in part because the incremental change from one echo reference to the next depends on which echo reference will be selected as the next echo reference, and in part because there may not be a linear relationship between expected echo management performance and fidelity.
Fig. 4 presents a block diagram illustrating components of an echo reference orchestrator according to one example. As with the other figures provided herein, the types, numbers, and arrangements of elements shown in fig. 4 are provided by way of example only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements. For example, some implementations may be configured to send and/or receive "original" echo references (which may be full-fidelity copies of audio reproduced on an audio device), low-fidelity versions or representations of audio reproduced on an audio device (such as downsampled versions, versions produced by lossy compression, or segment power information corresponding to audio reproduced on an audio device), but not send and/or receive both the original and low-fidelity versions.
In this example, the echo reference orchestrator 302A is an example of the echo reference orchestrator 302A of fig. 3A and is implemented by an example of the control system 60a of fig. 3A. According to this example, the elements of fig. 4 are as follows:
220A: local echo references for playback and cancellation;
221A: a locally generated copy of a non-local echo reference being played by another audio device of the audio environment;
302A: an echo reference composer configured to rank and select a set of one or more echo references;
310A: one or more decoded echo references received by audio device 110A from one or more other devices in the audio environment;
311A: transmitting a request for echo referencing from one or more other devices of the audio environment over a local network;
312A: metadata, which may be or may include metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmix matrix, and/or a loudspeaker activation matrix;
313A: in this example, a set of one or more echo references selected by the echo reference orchestrator 302A and sent to the MC-EMS 203A;
316A: the device 110A receives raw echo references from one or more other devices of the audio environment;
317A: a low fidelity (e.g., codec) version of the echo reference received by device 110A from one or more other devices of the audio environment;
318A: an audio environmental noise estimate;
350A: one or more indicators indicative of the current performance of the MC-EMS203A, which may be or may include adaptive filter coefficient data or other AEC statistics, speech Echo (SER) ratio data, and the like.
401A: an echo reference importance estimator configured to estimate an expected importance of each echo reference and, in this example, generate a corresponding importance metric 420A;
402: an echo reference selector configured to select the echo reference set 313A based at least in part on a current listening object (as shown at 421A), a cost per echo reference (as shown at 422A), a current state/performance of the EMS (as shown at 350A), and an estimated importance of each candidate echo reference (as shown by an importance metric 420A), in this example;
403A: a cost estimation module configured to determine cost(s) (e.g., computational and/or network cost) to include the echo references in echo reference set 313A;
404A: an optional module for determining or estimating a current listening object of the audio device 110A;
405A: a module configured to implement one or more MC-EMS performance models, which in some examples may generate data such as that shown in fig. 3B or fig. 3C;
420A: importance metrics 420A generated by the echo reference importance estimator 401A;
421A: information indicating a current listening object;
422A: information indicating cost(s) to include the echo references in echo reference set 313A; and
423A: information generated by MC-EMS performance model 405A, which in some examples may be or include data such as that shown in FIG. 3B or FIG. 3C.
The echo reference importance estimator 401A may function in various ways depending on the particular implementation. Various examples are provided in the present disclosure. In some examples, the echo reference importance estimator 401A may be configured to make an importance estimate for each echo reference of the plurality of echo references (e.g., for the local echo reference 220A and the non-local echo reference 221A). Making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment (e.g., echo mitigation by MC-EMS203A of audio device 110A).
In this example, making the importance estimate involves determining an importance metric 420A. The importance metric 420A may be based at least in part on one or more characteristics of each echo reference, such as level, uniqueness, time duration, audibility, or one or more combinations thereof. In some examples, the importance metric may be based at least in part on metadata (e.g., metadata 312A) that may include metadata corresponding to an audio device layout, loudspeaker metadata (e.g., sound Pressure Level (SPL) ratings, frequency ranges, whether the loudspeaker is an upsounding loudspeaker, etc.), metadata corresponding to received audio data (e.g., location metadata, metadata indicating a human voice or other speech, etc.), an upmix matrix, a loudspeaker activation matrix, or one or more combinations thereof. In some examples, the echo reference importance estimator 401A may provide an importance metric 420A to the MC-EMS performance model 405A, as indicated by the dashed arrow 420A.
According to this example, the importance metric 420A is based at least in part on the current listening object, as indicated by information 421A. As described in more detail below, the current listening goals may significantly change the manner in which factors such as level, uniqueness, time duration, audibility, etc. are evaluated. For example, the importance analysis during a telephone call may be distinct from waiting for a wake word.
In this example, the importance metric 420A is based at least in part on the current ambient noise estimate 318A, the index(s) 350A indicative of the current performance of the MC-EMS203A, the information 423A generated by the MC-EMS performance model 405A, or one or more combinations thereof. In some implementations, the echo reference importance estimator 401A may determine that if the room noise level is relatively high (as indicated by the current ambient noise estimate 318A), then adding an echo reference will be unlikely to help significantly mitigate echo. As described above, the information 423A may correspond to the type of information described above with reference to fig. 3B and 3C, which may provide a direct correlation between the use of echo references and the expected performance increase of the MC-EMS 203A. As described in more detail below, performance of the EMS may be based in part on the robustness of the EMS when disturbed by noise in an audio environment.
According to this embodiment, the echo reference selector 402 selects a set of one or more echo references based at least in part on: one or more metrics 350A indicating the current performance of the MC-EMS203A, an importance metric 420A, a current listening object 421A, information 422A indicating the cost(s) of including the echo references in the echo reference set 313A, and information 423A generated by the MC-EMS performance model 405A. Some detailed examples of how the echo reference selector 402 may select an echo reference are provided below.
In this example, the cost estimation module 403A is configured to determine a computational and/or network cost of including the echo references in the echo reference set 313A. The computational cost may, for example, include additional computational cost of using a particular echo reference by the MC-EMS 203A. This computational cost may in turn depend on the number of bits required to represent the echo reference. In some examples, the computational cost may include the computational cost of a lossy echo reference encoding process and/or the computational cost of a corresponding echo reference decoding process. Determining the network cost may involve determining the amount of data required to transmit a complete replica of the echo reference or a copy or representation of the echo reference across a local data network (e.g., a local wireless data network).
In some examples, the echo reference selection block 402A may generate and transmit a request 311A for another device in the audio environment to send one or more echo references thereto over a network. (element 314A of FIG. 3A indicates that one or more echo references are received by audio device 110A, which may have responded to request 311A in some instances). In some examples, the request 311A may specify the fidelity of the requested echo reference, e.g., whether an "original" copy of the echo reference (full fidelity copy) should be sent, whether an encoded version of the echo reference should be sent, whether a relatively more or relatively less lossy compression algorithm should be applied to the echo reference if an encoded version of the echo reference should be sent, whether segment power information corresponding to the echo reference should be sent, etc.
One may note that the request for the encoded echo reference not only introduces network costs due to the sending of the request and reference, but also increases the computational cost of the response device(s) (e.g., smart home hub 105 or one or more of audio devices 110B-110D) having to encode the reference, as well as the computational cost of audio device 110A decoding the received reference. However, the encoding cost may be a one-time cost. Thus, sending a request for an encoded reference over a network from one audio device to another changes the potential performance/cost tradeoff performed in the other devices (e.g., in audio devices 402C and 402D).
In some implementations, one or more blocks of the echo reference orchestrator 302A may be performed by an orchestration device (e.g., the smart home hub 105 or one of the audio devices 110A-110D). According to some such embodiments, at least some functions of the echo reference importance estimator 401A and/or the echo reference selection block 402A may be performed by an orchestration device. Some such implementations may be capable of determining a cost/benefit tradeoff for the overall system in view of performance enhancements for all instances of the MC-EMS in an audio environment, overall computing requirements for all instances of the MC-EMS, overall requirements for the local network, and/or overall computing requirements for all encoders and decoders.
Examples of various indices and components
Importance measure
Briefly, an importance metric (which may be referred to herein as "importance" or "I") may be a metric of expected improvement in EMS performance due to inclusion of a particular echo reference. In some embodiments, the importance may depend on the current state of the EMS, in particular on the echo reference sets already in use and at what level of fidelity they are being received. The importance may be obtained on different time scales depending on the particular implementation. In one extreme case, importance may be implemented on a frame-by-frame basis (e.g., based on the importance signal of each frame). In other examples, the importance may be implemented as a constant value for the duration of the content segment or as a constant value for the time of use of a particular configuration of the audio device. The configuration of the audio device may correspond to an audio device location and/or an audio device orientation.
Thus, the importance metrics may be calculated on various time scales depending on the particular implementation, for example:
analyzing the current audio content in real time, e.g., according to events in the audio environment (e.g., incoming calls), etc.;
on a longer time scale, e.g. track by track, where the tracks correspond to content segments such as songs or other pieces of music content that may last, e.g., on a time scale of a few minutes; or alternatively
Only once, for example, when the audio system is initially configured or reconfigured.
The decision about which echo references to select for echo management purposes may be made on a time scale similar to (or slower than) the time scale on which the importance measure is evaluated. For example, a device or system may estimate importance every 30 seconds and make decisions about changing the selected echo reference every few minutes.
According to some examples, the control system may be configured to determine an importance matrix, which may include all importance information of the current audio device system. In some such examples, the importance matrix may have a dimension n×m, including an entry for each audio device and an entry for each potential echo reference channel. In some such examples, N represents the number of audio devices and M represents the number of potential echo references. This type of importance matrix is not always square, as some audio devices may play back more than one channel.
In some implementations, the importance metric I may be based on one or more of the following:
l: level of echo reference;
u: uniqueness of the echo references;
p: time persistence of echo references, and/or
A: audibility of the device rendering the echo reference.
As used herein, the acronym "LUPA" generally refers to an echogenic reference characteristic from which a measure of importance may be determined, including but not limited to one or more of L, U, P and/or a.
L or "horizontal" aspects
This aspect describes the level or loudness of the echo reference. Other conditions being equal, it is known that the louder the playback signal, the greater the impact on EMS performance. As used herein, the term "level" refers to a level within a digital representation of an audio signal, and not necessarily to the actual sound pressure level of the audio signal after reproduction via a loudspeaker. In some examples, the loudness of a single channel of the echo reference may be based on a Root Mean Square (RMS) indicator or an LKFS (k-weighted loudness relative to full scale) indicator. Such an index is easily calculated in real time on the echo reference or may exist as metadata in the bit stream. According to some embodiments, L may be determined from a volume setting, such as an audio system volume setting or a volume setting within a media application.
U or "uniqueness" aspects
The uniqueness aspect aims to capture the new amount of information provided by a particular echo reference about the overall audio presentation. From a statistical perspective, multi-channel audio presentations often contain redundancy across channels. For example, such redundancy may occur because musical instruments and other sound sources are duplicated on the left and right channels of the room, or the signal is translated and thus further duplicated in multiple active loudspeakers simultaneously. Although this scenario results in the EMS having to solve the problem of superscalar, where the echo filter may infer observations from multiple echo paths, some benefits and higher performance may still be observed in practice.
U may be calculated or estimated in various ways. In some examples, U may be based at least in part on a correlation coefficient between each echo reference. In one such example, U may be estimated as follows:
wherein the subscript "r" corresponds to the particular echo reference evaluated, N represents the total number of audio devices in the audio environment, N representsA single audio device is shown, M represents the total number of potential echo references in the audio environment, and M represents a single echo reference.
Alternatively or additionally, in some examples, U may be based at least in part on decomposing the audio signal to find redundancy. Some such examples may involve instantaneous frequency estimation, fundamental frequency (F0) estimation, spectral inversion, and/or non-Negative Matrix Factorization (NMF).
According to some examples, U may be based at least in part on data for matrix decoding. Matrix decoding is an audio technique in which a small number of discrete audio channels (e.g., 2) are decoded into a large number of channels (e.g., 4 or 5) upon playback. The channels are typically arranged for transmission or recording by an encoder and decoding by a decoder for playback. Matrix decoding allows multichannel audio (e.g., surround sound) to be encoded into a stereo signal for playback as stereo on a stereo device and for playback as surround sound on a surround sound device. In one such example, if the dolby 5.1 system is receiving a stereo audio data stream, a static upmix matrix may be applied to the stereo audio data to provide correctly rendered audio for each loudspeaker in the dolby 5.1 system. According to some examples, U may be based at least in part on coefficients of an up-mix or down-mix (down-mix) matrix for each loudspeaker (e.g., each of the audio devices 110A-110D) that distributes audio to the audio environment.
In some examples, U may be based at least in part on a standard specification loudspeaker layout (e.g., dolby 5.1, dolby 7.1, etc.) used in an audio environment. Some such examples may involve utilizing ways of mixing and rendering media content traditionally in a loudspeaker layout of such specifications. For example, in dolby 5.1 or dolby 7.1 systems, artists typically place a human voice in the center channel, rather than in the surround channel. As described above, audio corresponding to musical instruments and other sound sources is typically reproduced on channels on the left and right sides of a room. In some instances, sounds, conversations, instrumentalities, etc. may be identified via metadata received with corresponding audio data.
P or "persistence" aspects
The persistence indicator aims at capturing aspects of different types of playback media that may have a wide range of temporal persistence, where the different types of content have different degrees of silence and loudspeaker activation. A spectrally dense continuous content stream, such as the audio output of a music or video game console, may have a high level of time persistence, while podcasts may have a lower level of time persistence. The time duration level of infrequent system notifications will be very low. Depending on the specific list task at hand, echo references corresponding to media with a lower degree of persistence may be less important to the EMS. For example, occasional system notifications are less likely to collide with wake words or episodic requests, and thus the relative importance of managing the echo is less.
The following are examples of metrics that may be used to measure or estimate persistence:
the percentage of time in the recent history window that the playback signal is above the specific digital loudness threshold;
metadata tags or media classification indications indicating that the content corresponds to music, broadcast content, podcasts or system sounds; and/or
The percentage of time during the last history window that the playback signal is in the typical frequency range of human voice (e.g., 100Hz to 3 KHz).
According to some examples, the audio content type may affect the estimation of L, U and/or P. For example, knowing that the audio content is stereo music will allow ranking of all echo references using only the channel assignments described above. Alternatively, if the control system does not analyze the audio content, but relies on channel assignments, knowing that the audio content is attos may alter the default L, U and/or P assumptions.
A or "audibility" aspect
Audibility index is directed to the fact that: audio devices have different playback characteristics and the distance between the audio devices may be different in any given audio environment. The following are examples of metrics that may be used to measure or estimate the audibility of an audio device:
direct measurement of audibility of an audio device;
refer to data structures including characteristics of one or more loudspeakers of the audio device, such as rated SPL, frequency response, and directivity (e.g., whether the loudspeaker is omni-directional, sounding forward, sounding upward, etc.);
based on an estimate of distance from the audio device; and/or
Any combination of the above.
Other factors may be evaluated for estimating importance and, in some instances, for determining an importance metric.
Listening object
The listening object may define the context and desired performance characteristics of the EMS. In some examples, the listening object may modify parameters and/or fields of the LUPA evaluation. The following discussion will consider 3 potential contexts in which the listening object changes. In these different contexts, we will see how probability and criticality can affect LUPA.
1. Episodic (e.g., detect wake-up word example)
There is no immediate urgency when waiting for a conversation: it is generally considered that the probability of the user speaking a wake-up word is the same in all time intervals in the future. Furthermore, wake-up word detector may be the most robust element in speech assistance, and the effect of echo leakage is less critical.
2. Command
The likelihood of a person speaking a command immediately after the person speaks a wake-up word is very high. Therefore, the probability of collision with echo is high in the near future. Furthermore, because the command recognition module may be relatively less robust than the wake word detector, the criticality of echo leakage may often be high.
3. AC power
During a voice call, the likelihood of any participant (person(s) and remote person(s) in the audio environment) talking to each other is determined. In other words, the probability of an echo colliding with the user's voice is essentially 1. However, since the person or persons at the far end are human and can cope well with background noise, the criticality is small because they are unlikely to suffer from echo leakage.
In these different listening object contexts, the manner in which the LUPA is evaluated may vary in some examples.
1. Telephone with inserting function
There may be no temporal distinction because the probability of uttering a wake-up word is considered to be the same for all future time intervals. Thus, the time frame in which the control system evaluates the LUPA may be quite long in order to obtain a better estimate of these parameters. In some such examples, the time interval at which the control system evaluates the LUPA may be set to look at a relatively far future (e.g., within a time frame of a few minutes).
2. Command
The time interval immediately after the wake-up word is spoken is likely to speak the command. Thus, after detecting the wake word, in some embodiments, the LUPA may be evaluated on a much shorter time scale (e.g., on the order of a few seconds) than in the episodic context. In some examples, during this time interval, a reference that is sparse in time and has content playing within the next few seconds after wake word detection will be considered more important because of the high likelihood of collisions.
Fig. 5A is a flow chart summarizing one example of a disclosed method. As with other methods described herein, the blocks of method 500 need not be performed in the order indicated. In some examples, one or more blocks may be performed concurrently. Moreover, such methods may include more or fewer blocks than shown and/or described. For example, some implementations may not include block 501.
In this example, method 500 is an echo reference selection method. The blocks of method 500 may be performed, for example, by a control system (such as control system 60a of fig. 2A or 3A). In some examples, blocks of method 500 may be performed by an echo reference selector module (such as echo reference selector 402A described above with reference to fig. 4).
The reference selection method of fig. 5A is an example of what may be referred to herein as a "greedy" echo reference selection method that involves evaluating cost and expected performance improvement (in other words, how many references, including the selected echo references, the MC-EMS is currently using) only at the current operating point of the MC-EMS, and evaluating the results of adding each additional echo reference, e.g., in descending order of importance. Accordingly, this example involves a process of determining whether to add a new echo reference. In some implementations, the echo references evaluated in method 500 may have been ranked according to the estimated importance (e.g., by echo reference importance estimator 401A). If more complex techniques are employed (such as tree search methods), there may be more optimal solution types in terms of cost and performance. Alternative examples may involve other search and/or optimization routines, including brute force methods. Some alternative implementations may involve determining whether to discard or discard a previously selected echo reference.
In this example, block 501 involves determining whether a current performance level of the EMS is greater than or equal to a desired performance level. If so, the process terminates (block 510). However, if it is determined that the current performance level is below the desired performance level, in this example, the process continues to block 502. According to this example, the determination of block 501 is based at least in part on one or more metrics indicative of the current performance of the EMS, such as adaptive filter coefficient data or other AEC statistics, speech-to-echo (SER) ratio data, and the like. In some examples where the determination of block 501 is made by the echo reference orchestrator 302A, this determination may be based at least in part on one or more metrics 350A from the MC-EMS 203A. As noted above, some embodiments may not include block 501.
According to this example, block 502 involves ranking the remaining unselected echo references by importance and estimating the potential EMS performance boost obtained by including the most important echo references that the EMS has not used. In some examples where the process of block 502 is performed by the echo reference orchestrator 302A, the process may be based at least in part on information 423A generated by the MC-EMS performance model 405A, which in some examples may be or include data as shown in fig. 3B or fig. 3C. In some implementations, the ranking and prediction process described above may be performed at an earlier stage of the method 500, for example, when evaluating a previous echo reference. In some examples, the ranking and prediction process described above may be performed prior to performing method 500. In some embodiments where the ranking and prediction process described above has been previously performed, block 502 may simply involve selecting the highest ranked unselected echo reference determined by such previous process.
In this example, block 503 involves comparing the performance and cost of adding the echo reference selected in block 502. In some examples where the process of block 503 is performed by echo reference orchestrator 302A, block 503 may be based at least in part on information 422A from cost estimation module 403A indicating the cost(s) to include the echo reference in echo reference set 313A.
Because performance and cost may be variables with different ranges and/or domains, directly comparing these variables may be challenging. Thus, in some embodiments, the evaluation of block 503 may be facilitated by mapping the performance and cost, which may be variables, to similar scales (such as a range between predefined minimum and maximum values).
In some embodiments, the cost of adding the estimated echo reference may simply be set to zero if adding the echo reference does not result in a budget exceeding a predetermined network bandwidth and/or computational cost. In some such examples, the cost of adding the estimated echo reference may be set to infinity if adding the echo reference would result in a budget exceeding a predetermined network bandwidth and/or computational cost. This example has the benefit of simplicity and efficiency. In this way, the control system can simply add the maximum number of echo references within the budget allowed by the predetermined network bandwidth and/or computational cost.
According to some examples, if the estimated performance improvement corresponding to adding the echo reference is not above a predetermined threshold (e.g., 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, etc.), the estimated performance improvement may be set to zero. Such an approach may prevent network bandwidth and/or computational overhead from being consumed by including echo references that increase only a negligible performance boost. Some detailed alternative examples of cost determination are described below.
In this example, block 504 involves determining whether a new echo reference is to be added given the performance/cost evaluation of block 503. In some examples, blocks 503 and 504 may be combined into a single block. According to this example, block 504 involves determining whether the cost of adding the evaluated echo reference will be less than the EMS performance boost estimated to be caused by adding the echo reference. In this example, if the estimated cost is not less than the estimated performance boost, then the process continues to block 511 and the method 500 terminates. However, in this embodiment, if the estimated cost is less than the estimated performance improvement, then the process continues to block 505.
According to this example, block 505 involves adding a new echo reference to the selected echo reference set. In some examples, block 505 may include notifying renderer 202 to output the relevant echo reference. According to some examples, block 505 may involve sending an echo reference over a local network or sending a command 311 to another device to send the echo reference over the local network.
The echo references evaluated in method 500 may be local echo references or non-local echo references, which may be determined locally (e.g., by a local renderer as described above) or received over a local network. Thus, the cost estimation of some echo references may involve evaluating both computational and network costs.
According to some examples, to evaluate the next echo reference after block 505, the control system may simply reset the selected and unselected echo references and revert to the previous blocks of fig. 5A, such as block 501, block 502, or block 503. However, more complex methods may also involve evaluating the already selected references, e.g. ranking all the references already selected, and deciding whether to discard the echo reference with the lowest estimated importance.
Alternative echo reference forms
The echo references may be transmitted in a number of forms or variations (or used locally within a device such as the device that generated the entire echo reference), which may alter the cost/benefit ratio of the particular echo reference. For example, if we transform the echo reference into a segmented power form (in other words, determine the power in each of the plurality of frequency bands and transmit segmented power information about the power in each frequency band), it is possible to reduce the cost of transmitting the echo reference through the local network. However, the potential improvement that can be achieved by EMS using low fidelity variants of echo references is typically also lower. Selection such that any particular variant of the echo reference is available may be interpreted as making it a potential selection candidate.
In some embodiments, the echo references may be one of the following forms listed below (with the first four being arranged in descending order of estimated performance):
full fidelity (original, exact) echo reference, which will result in full computational cost and network cost (if transmitted over the network)
Downsampling an echo reference, whose computational cost and network cost will be scaled down according to the downsampling factor, but which will result in the computational cost of the downsampling process;
the network cost of the encoded echo reference generated via the lossy encoding process can be reduced according to the compression ratio of the encoding scheme, but the encoding and decoding computation costs are incurred;
segmented power information corresponding to the echo reference, whose network cost can be significantly reduced because the number of frequency bands can be much lower than the number of subbands of the full fidelity echo reference, and whose computational cost can be significantly reduced because the cost of implementing segmented AES is much lower than the cost of implementing subband AEC; or alternatively
Reduce fidelity in exchange for any other form of cost reduction, whether computational, network, or other costs, such as memory.
Fig. 5B is a flow chart summarizing another example of the disclosed methods. As with other methods described herein, the blocks of method 550 need not be performed in the order indicated. In some examples, one or more blocks may be performed concurrently. Moreover, such methods may include more or fewer blocks than shown and/or described.
The blocks of method 550 may be performed, for example, by a control system (such as control system 60a of fig. 2A or 3A). In some examples, blocks of method 550 may be performed by an echo reference selector module (such as echo reference selector 402A described above with reference to fig. 4).
The method 550 takes into account the following facts: the echo reference is not necessarily transmitted or used in full fidelity form, but may be transmitted or used in one of the alternative partial fidelity forms described above. Thus, in method 550, the evaluation of performance and cost does not involve a binary decision as to whether or not to use the full fidelity form of the echo reference. Instead, method 550 involves determining whether to include one or more low fidelity versions of the echo reference, which may involve and potentially less EMS performance improvement, but at a lower cost. Methods such as method 550 provide additional flexibility for potential echo reference sets to be used by the echo management system.
In this example, method 550 is an extension of echo reference selection method 500 described above with reference to fig. 5A. Accordingly, blocks 501 (if included), 502, 503, 504, and 505 may be performed as described above with reference to fig. 5A, unless otherwise indicated below. Method 550 adds a potential iteration loop including blocks 506 and 507 to method 500. According to this example, if it is determined (here, in block 504) that the estimated cost of adding one version of the echo reference will not be less than the estimated EMS performance boost, then a determination is made in block 506 as to whether another version of the echo reference is present. In some examples, the full fidelity version of the echo reference may be evaluated before the lower fidelity version (if any is available). According to this embodiment, if it is determined in block 506 that another version of the echo reference is available, then in block 507 another version of the echo reference (e.g., the highest fidelity version that is not the full fidelity version) will be selected and evaluated in block 503.
Thus, method 550 involves evaluating a lower fidelity version of the echo reference, if any is available. Such lower fidelity versions may include downsampled versions of the echo reference, encoded versions of the echo reference generated via a lossy encoding process, and/or segment power information corresponding to the echo reference.
Cost model
The "cost" of an echo reference refers to the resources required for echo management using the reference, whether AEC or AES is used. Some disclosed embodiments may involve estimating one or more of the following types of costs:
computational cost, which may be determined with reference to the use of a limited amount of processing power available on one or more devices in an audio environment. The computational cost may refer to one or more of the following:
the cost required to perform echo management on a particular listening device using this reference. This may mean that the reference is used in AEC or AES. One will note that AEC runs on bins (bin) or subbands (which are complex) and requires much more CPU operations than AES running on bands (the number of bins/subbands used by AES is less and the band power is real instead of complex);
Cost required to encode or decode an echo reference when using a codec's reference;
the cost required to segment the signal (in other words, transform the signal from a simple linear frequency domain representation to a segmented frequency domain representation); and/or
The cost required to generate the echo reference (e.g., by the renderer).
Network cost refers to the use of a limited amount of network resources, such as bandwidth available in a local network (e.g., a local wireless network in an audio environment) for sharing echo references between devices.
The total cost of a particular set of echo references may be determined as the sum of the costs of each echo reference in the set. Some disclosed examples relate to combining network costs and computational costs. According to some examples, total cost C total Can be determined as follows:
in the above equation, R comp Representing the total amount of computing resources available for echo management, R network Representing the total amount of network resources available for echo management;represents the computational cost associated with using the mth reference, and +.>Representing the network cost associated with using the mth reference (where a total of M references are used in the EMS). One might notice that this definition implies +. >
0≤C total ≤1,
And C total Only the cost component closest to the cost that becomes limited by the available resources of the system is included.
Performance of
The "capabilities" of an Echo Management System (EMS) may refer to the following:
the amount of echo removed from the microphone feed, which can be measured in echo loss enhancement (ERLE), which is measured in decibels and is the ratio of the transmit power to the power of the residual signal. The metrics may be normalized, for example, according to an application-based metric such as a minimum ERLE required to support an Automatic Speech Recognition (ASR) processor to perform a wake word detection task that detects a particular keyword spoken in the presence of an echo;
robustness of EMS when disturbed by room noise sources, non-linearities of the local audio system, double talk etc.;
robustness of EMS when using echo references below full fidelity;
the ability of the EMS to track system changes, including the ability of the EMS to initially converge; and/or
The EMS's ability to track changes in rendered audio scenes. For example, this may refer to a shift of the echo reference covariance matrix and robustness of the EMS to non-stationary non-uniqueness issues.
Some examples may involve determining a single performance index P. Some such examples Robustness is estimated using ERLE and from adaptive filter coefficient data or other AEC statistics obtained from EMS. According to some such examples, the performance robustness index P rob The "microphone probability" extracted from the AEC may be used to determine, for example, as follows:
P Rob =1-M_prob
in the above equation, 0.ltoreq.P R o b 1.ltoreq.0.ltoreq.M_prob.ltoreq.1, and M_prob represents the microphone probability, which is the proportion of the number of sub-band adaptive filters in the AEC that produce poor echo predictions that do not provide substantial (or any) echo cancellation in the respective sub-band.
The performance of a wake-up word (WW) detector depends largely on the speech-to-echo ratio (SER), which can be scaled up by ERLE of the EMS. When SER is too low, WW detectors are more likely to trigger (false positive) and miss keywords uttered by the user (missed detection) because echoes can corrupt the microphone signal and reduce the accuracy of the system. The SER of the residual signal (e.g., residual signal 224A of fig. 2A) consumed by the ASR processor (e.g., speech processing block 240A of fig. 2A) is increased by the EMS in proportion to the ERLE of the EMS, thereby improving the performance of the WW detector.
Thus, some disclosed examples involve mapping desired WW performance levels to nominal SER levels, which in turn, in combination with knowledge of typical playback levels of devices in the system, allow the control system to map such desired WW performance levels directly to nominal ERLEs. In some examples, the method may be extended to map WW performance of the system to ERLE at various SER levels. In some such embodiments, input data having a range of SER values may be used to generate a Receiver Operating Characteristic (ROC) curve for a particular WW detector. Some examples involve selecting a particular false positive rate (FAR) of interest and regarding that particular FAR, regarding the accuracy of the WW detector as a function of SER as our application basis. In some such examples, the first and second light sources,
Acc(SER res )=ROC(SER res ,FAR l )
In the above equation, acc (SER res ) Representing accuracy of WW detectorSER as SER representing residual signal output by EMS res Is a function of (2). ROC () represents a set of ROC curves for multiple SER, and FAR I Representative false positive rates of interest may be 3 per 24 hours and 1 per 10 hours. Accuracy Acc (SER) res ) May be expressed as a percentage or normalized such that it is in the range of 0 to 1, which may be expressed as follows:
0≤Acc(SER res )≤1
with knowledge of the playback capabilities of the audio device in the audio environment, a typical SER value in a microphone signal (e.g., microphone signal 223A of fig. 2A) may be determined using, for example, the LUPA component of the actual echo level in combination with a typical speech level in the target audio environment, e.g., as follows:
in the above equations, the speech_pwr and echo_pwr represent the expected baseline Speech power level and Echo power level, respectively, of the target audio environment. SER by EMS mic Can be improved to SER proportional to ERLE res For example, the following are possible:
in the above equation, the superscript dB indicates that the variable is in decibels in this example. For completeness, some embodiments may define ERLE of EMS as follows:
using the foregoing equations, some embodiments may define EMS performance metrics based on WW applications as follows:
Wherein, the liquid crystal display device comprises a liquid crystal display device,representing the SER in the target environment. In some examples, a->May be a static default number, while in other examples,/-in>May be estimated as a function of, for example, one or more LUPA components. Some embodiments may involve defining the net performance index P as a vector containing each element, for example as follows:
P=[P ww ,P Rob ]
in some examples, one or more additional performance components may be added by increasing the size of the net performance vector. In some alternative examples, one or more additional performance components may be combined into a single scalar indicator by weighting them, for example, as follows:
P=(1-K)P ww +KP Rob
in the above equation, K represents a weighting factor selected by the system designer, which is used to determine the degree of contribution of each component to net performance. Some alternative examples may use another approach, such as simply averaging the individual performance metrics. However, it may be advantageous to combine the individual performance metrics into a single scalar metric.
Cost and performance trade-off
When comparing the estimated cost of echo references and the estimated EMS performance enhancement, a method is needed to somehow compare these two parameters, which are not typically in the same domain. One such method involves evaluating the cost estimate and the performance estimate separately and employing a predefined minimum performance criterion P that is lowest in cost and meets min Is a solution to (a). The predefined EMS performance criteria may be, for example, based on requirements of a particular downstream application (e.g., providing phone calls, music playback, etc.),Waiting WW, etc.).
For example, in embodiments where the application is WW detection, the performance may be equal to WW performance index P WW And (5) correlation. In some such examples, there may be some minimum level of WW detector accuracy that is deemed sufficient (e.g., 80% level of WW detector accuracy, 85% level of WW detector accuracy, 90% level of WW detector accuracy, 95% level of WW detector accuracy, etc.), which will have a corresponding ERLE according to the previous section dB . In some such examples, an EMS performance model (e.g., MC-EMS performance model 405 of fig. 4) may be used to estimate ERLE of the EMS. Thus, if the goal is to find only the least costly solution (e.g., for total cost C total In other words), such an embodiment does not require direct cost and performance tradeoffs.
As an alternative to meeting some minimum performance metrics, some embodiments may involve using performance metrics P and cost metrics C. Some such examples may involve using a trade-off parameter λ (e.g., a lagrangian multiplier) and expressing the cost/performance evaluation process as an optimization problem seeking to maximize some amount, such as the variable F in the following expression:
F=P-λC total
It can be observed that in the above equation, a relatively large value of F corresponds to the performance index P and λ and the total cost C total The difference between the products of (c) is relatively large. The trade-off parameter λ may be selected (e.g., by a system designer) to directly trade-off cost and performance. An optimization algorithm may then be used to find a solution to the echo reference set used by the EMS, where the echo reference set (which may include all available echo reference fidelity levels) determines the search space.
FIG. 6 is a flow chart summarizing one example of a disclosed method. As with other methods described herein, the blocks of method 600 need not be performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described. In some examples, two or more blocks may be performed simultaneously. In this example, method 600 is an audio processing method.
The method 600 may be performed by an apparatus or system of the apparatus 50 as shown in fig. 1A and described above. In some examples, blocks of method 600 may be performed by one or more devices within an audio environment, e.g., by an audio system controller (e.g., a device referred to herein as a smart home hub) or by another component of an audio system, such as a smart speaker, a television control module, a laptop computer, a mobile device (e.g., a cellular telephone), etc. In some implementations, the audio environment can include one or more rooms of a home environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, and so forth. However, in alternative embodiments, at least some of the blocks of method 600 may be performed by a device (e.g., a server) implementing a cloud-based service.
In this embodiment, block 605 involves obtaining a plurality of echo references by a control system. In this example, the plurality of echo references includes at least one echo reference for each of a plurality of audio devices in the audio environment. Here, each echo reference corresponds to audio data played back by one or more loudspeakers of one of the plurality of audio devices.
In this example, block 610 involves making, by the control system, an importance estimate for each of a plurality of echo references. According to this example, making the importance estimate involves determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment. In this example, the at least one echo management system includes an Acoustic Echo Canceller (AEC) and/or an Acoustic Echo Suppressor (AES).
In this embodiment, block 615 involves selecting, by the control system and based at least in part on the importance estimate, one or more selected echo references. In this example, block 620 involves providing, by the control system, one or more selected echo references to at least one echo management system. In some implementations, the method 600 may involve causing at least one echo management system to cancel or suppress echo based at least in part on one or more selected echo references.
In some examples, obtaining the plurality of echo references may involve receiving a content stream including audio data and determining one or more of the plurality of echo references based on the audio data. Some examples are described above with reference to renderer 201A of fig. 2A.
In some implementations, the control system may include an audio device control system of an audio device of a plurality of audio devices in an audio environment. In some such examples, the method may involve rendering, by the audio device control system, audio data for reproduction on the audio device, thereby producing a local speaker feed signal. In some such examples, the method may involve determining a local echo reference corresponding to a local speaker feed signal.
In some examples, obtaining the plurality of echo references may involve determining one or more non-local echo references based on the audio data. For example, each non-native echo reference may correspond to a non-native speaker feed for playback on another audio device of the audio environment.
According to some examples, obtaining the plurality of echo references may involve receiving one or more non-local echo references. For example, each non-native echo reference may correspond to a non-native speaker feed for playback on another audio device of the audio environment. In some examples, receiving one or more non-local echo references may involve receiving one or more non-local echo references from one or more other audio devices of the audio environment. In some examples, receiving one or more non-native echo references may involve receiving each of the one or more non-native echo references from a single other device of the audio environment.
In some examples, the method may involve cost determination. According to some such examples, the cost determination may involve determining a cost of at least one of the plurality of echo references. In some such examples, selecting one or more selected echo references may be based at least in part on a cost determination. According to some such examples, the cost determination may be based at least in part on network bandwidth required for transmitting the at least one echo reference, coding calculation requirements for coding the at least one echo reference, decoding calculation requirements for decoding the at least one echo reference, echo management system calculation requirements for using the at least one echo reference by the echo management system, or one or more combinations thereof. In some examples, the cost determination may be based at least in part on a full fidelity replica of the at least one echo reference in the time or frequency domain, a downsampled version of the at least one echo reference, lossy compression of the at least one echo reference, segmented power information of the at least one echo reference, or one or more combinations thereof. In some examples, the cost determination may be based at least in part on a method of less compressing a relatively more important echo reference than a relatively less important echo reference.
According to some examples, the method may involve determining a current echo management system performance level. In some such examples, selecting one or more selected echo references may be based at least in part on a current echo management system performance level.
In some examples, making the importance estimate may involve determining an importance metric for the corresponding echo reference. In some examples, determining the importance metric may involve determining a level of the corresponding echo reference, determining a uniqueness of the corresponding echo reference, determining a time duration of the corresponding echo reference, determining an audibility of the corresponding echo reference, or one or more combinations thereof. According to some examples, determining the importance metric may be based at least in part on metadata corresponding to the audio device layout, loudspeaker metadata, metadata corresponding to the received audio data, an up-mix matrix, a loudspeaker activation matrix, or one or more combinations thereof. In some examples, determining the importance metric may be based at least in part on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or one or more combinations thereof.
Fig. 7 shows an example of a plan view of an audio environment, which in this example is a living space. As with the other figures provided herein, the types and numbers of elements shown in fig. 7 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements.
According to this example, the environment 700 includes a living room 710 at the upper left, a kitchen 715 at the lower center, and a bedroom 722 at the lower right. The boxes and circles distributed across living space represent a set of loudspeakers 705a-705h, at least some of which may be intelligent loudspeakers in some embodiments, placed in convenient locations to the space, but not following any standard prescribed layout (arbitrarily placed). In some examples, television 730 may be configured to at least partially implement one or more of the disclosed embodiments. In this example, environment 700 includes cameras 711a-711e distributed throughout the environment. In some implementations, one or more intelligent audio devices in environment 700 may also include one or more cameras. The one or more intelligent audio devices may be single-use audio devices or virtual assistants. In some such examples, one or more cameras of the optional sensor system 130 may reside in or on the television 730, in a mobile phone, or in a smart speaker (e.g., one or more of the microphones 705b, 705d, 705e, or 705 h). Although cameras 711a-711e are not shown in each depiction of the audio environment presented in this disclosure, in some implementations, each audio environment may still include one or more cameras.
Aspects of the present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer-readable medium (e.g., disk) storing code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems may be or include a programmable general purpose processor, digital signal processor, or microprocessor programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including embodiments of the disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, memory, and a processing subsystem programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.
Some embodiments may be implemented as a configurable (e.g., programmable) Digital Signal Processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform the required processing on the audio signal(s), including the execution of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general-purpose processor (e.g., a Personal Computer (PC) or other computer system or microprocessor, which may include an input device and memory) programmed and/or otherwise configured with software or firmware to perform any of a variety of operations, including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general-purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more microphones and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or keyboard), memory, and a display device.
Another aspect of the disclosure is a computer-readable medium (e.g., a disk or other tangible storage medium) storing code (e.g., an encoder executable to perform one or more examples of the disclosed methods or steps thereof) for performing one or more examples of the disclosed methods or steps thereof.
While specific embodiments of, and applications for, the present disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many more modifications than mentioned herein are possible without departing from the scope of the disclosure described and claimed herein. It is to be understood that while certain forms of the disclosure have been illustrated and described, the disclosure is not to be limited to the specific embodiments described and illustrated or to the specific methods described.
Aspects of the invention may be understood from the example embodiments (EEEs) enumerated below:
1. an audio processing method, comprising:
obtaining, by a control system, a plurality of echo references, the plurality of echo references comprising at least one echo reference for each of a plurality of audio devices in an audio environment, each echo reference corresponding to audio data played back by one or more loudspeakers of one of the plurality of audio devices;
Making, by the control system, an importance estimate for each echo reference of the plurality of echo references, wherein making the importance estimate involves determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment, the at least one echo management system comprising an Acoustic Echo Canceller (AEC), an Acoustic Echo Suppressor (AES), or both AEC and AES;
selecting, by the control system and based at least in part on the importance estimates, one or more selected echo references; and
the one or more selected echo references are provided by the control system to the at least one echo management system.
2. The audio processing method of EEE 1, further comprising causing at least one echo management system to cancel or suppress echo based at least in part on the one or more selected echo references.
3. The audio processing method of EEE 1 or EEE 2, wherein obtaining the plurality of echo references involves:
receiving a content stream comprising audio data; and
one or more echo references of the plurality of echo references are determined based on the audio data.
4. The audio processing method of EEE 3, wherein the control system comprises an audio device control system of an audio device of the plurality of audio devices in the audio environment, the audio processing method further comprising:
rendering, by the audio device control system, the audio data for reproduction on the audio device to generate a local speaker feed signal; and
a local echo reference corresponding to the local speaker feed signal is determined.
5. The audio processing method of EEE 4, wherein obtaining the plurality of echo references involves determining one or more non-local echo references based on the audio data, each of the non-local echo references corresponding to a non-local speaker feed for playback on another audio device of the audio environment.
6. The audio processing method of EEE 4, wherein obtaining the plurality of echo references involves receiving one or more non-local echo references, each of the non-local echo references corresponding to a non-local speaker feed for playback on another audio device of the audio environment.
7. The audio processing method of EEE 6, wherein receiving the one or more non-local echo references involves receiving the one or more non-local echo references from one or more other audio devices of the audio environment.
8. The audio processing method of EEE 6, wherein receiving the one or more non-local echo references involves receiving each of the one or more non-local echo references from a single other device of the audio environment.
9. The audio processing method of any of EEEs 1-8, further comprising a cost determination involving determining a cost of at least one of the plurality of echo references, wherein selecting the one or more selected echo references is based at least in part on the cost determination.
10. The audio processing method of EEE 9, wherein the cost determination is based on network bandwidth required for transmitting the at least one echo reference, coding calculation requirements for coding the at least one echo reference, decoding calculation requirements for decoding the at least one echo reference, echo management system calculation requirements for using the at least one echo reference by the echo management system, or a combination thereof.
11. The audio processing method of EEE 9 or EEE 10, wherein the cost determination is based on a replica of the at least one echo reference in the time or frequency domain, a downsampled version of the at least one echo reference, lossy compression of the at least one echo reference, segment power information of the at least one echo reference, or a combination thereof.
12. The audio processing method of any of EEEs 9-11, wherein the cost determination is based on a method of less compressing a relatively more important echo reference than a relatively less important echo reference.
13. The audio processing method of any of EEEs 1-12, further comprising determining a current echo management system performance level, wherein selecting the one or more selected echo references is based at least in part on the current echo management system performance level.
14. The audio processing method of any of EEEs 1-13, wherein making the importance estimate involves determining an importance measure for a corresponding echo reference.
15. The audio processing method of EEE 14, wherein determining the importance metric involves determining a level of the corresponding echo reference, determining a uniqueness of the corresponding echo reference, determining a time duration of the corresponding echo reference, determining an audibility of the corresponding echo reference, or a combination thereof.
16. The audio processing method of EEE 14 or EEE 15, wherein determining the importance metric is based at least in part on metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmix matrix, a loudspeaker activation matrix, or a combination thereof.
17. The audio processing method of any of EEEs 14-16, wherein determining the importance metric is based at least in part on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or a combination thereof.
18. An apparatus configured to perform the method of any of EEEs 1-17.
19. A system configured to perform the method of any of EEEs 1-17.
20. One or more non-transitory media having software stored thereon, the software comprising instructions for controlling one or more devices to perform the method of any of EEEs 1-17.

Claims (20)

1. An audio processing method for managing echoes of a first audio device of a plurality of audio devices of an audio system, wherein each audio device of the plurality of audio devices comprises one or more loudspeakers, wherein the first audio device further comprises a control system, wherein the control system comprises an echo management system comprising an Acoustic Echo Canceller (AEC), an Acoustic Echo Suppressor (AES), or both AEC and AES, the method comprising:
Obtaining, by the control system of the first audio device, a plurality of echo references including at least one echo reference for each of the plurality of audio devices, each echo reference corresponding to audio data played back by the one or more loudspeakers of the corresponding audio device;
making, by the control system, an importance estimate for each echo reference of the plurality of echo references, wherein making the importance estimate involves determining, by the echo management system of the first audio device, an expected contribution of each echo reference to echo mitigation;
selecting, by the control system and based at least in part on the importance estimate, one or more echo references from the plurality of echo references;
providing, by the control system, the one or more selected echo references to the echo management system; and
an echo is suppressed or canceled by the echo management system of the first audio device based at least in part on the one or more selected echo references.
2. The audio processing method of claim 1, wherein obtaining the plurality of echo references involves:
Receiving a content stream comprising audio data; and
one or more echo references of the plurality of echo references are determined based on the audio data.
3. The audio processing method of claim 2, further comprising:
rendering, by the control system, the audio data for reproduction on the first audio device to produce a local speaker feed; and
a local echo reference corresponding to the local speaker feed signal is determined.
4. The audio processing method of claim 3, wherein obtaining the plurality of echo references involves determining one or more non-local echo references based on the audio data, each of the non-local echo references corresponding to a non-local speaker feed for playback on another audio device of the audio environment.
5. The audio processing method of claim 3, wherein obtaining the plurality of echo references involves receiving one or more non-local echo references, each of the non-local echo references corresponding to a non-local speaker feed for playback on another audio device of the audio environment.
6. The audio processing method of claim 5, wherein receiving the one or more non-local echo references involves receiving the one or more non-local echo references from one or more other audio devices of the audio environment.
7. The audio processing method of claim 5, wherein receiving the one or more non-local echo references involves receiving each of the one or more non-local echo references from a single other device of the audio environment.
8. The audio processing method of any of claims 1 to 7, further comprising a cost determination involving determining a cost of at least one of the plurality of echo references, wherein selecting the one or more selected echo references is based at least in part on the cost determination.
9. The audio processing method of claim 8, wherein the cost determination is based on network bandwidth required for transmitting the at least one echo reference, coding calculation requirements for coding the at least one echo reference, decoding calculation requirements for decoding the at least one echo reference, echo management system calculation requirements for using the at least one echo reference by the echo management system, or a combination thereof.
10. The audio processing method of claim 8 or claim 9, wherein the cost determination is based on a replica of the at least one echo reference in the time or frequency domain, a downsampled version of the at least one echo reference, lossy compression of the at least one echo reference, segmented power information of the at least one echo reference, a method of less compressing a relatively more important echo reference than a relatively less important echo reference, or a combination thereof.
11. The audio processing method of any of claims 8 to 10, wherein the cost determination is based on a method of less compressing a relatively more important echo reference than a relatively less important echo reference.
12. The audio processing method of any of claims 1 to 11, further comprising determining a current echo management system performance level, wherein selecting the one or more selected echo references is based at least in part on the current echo management system performance level.
13. The audio processing method of any of claims 1 to 12, wherein making the importance estimate involves determining an importance measure for a corresponding echo reference.
14. The audio processing method of claim 13, wherein determining the importance metric is based at least in part on a level of the corresponding echo reference, a uniqueness of the corresponding echo reference, a temporal persistence of the corresponding echo reference, an audibility of the corresponding echo reference, or a combination thereof.
15. The audio processing method of claim 13 or claim 14, wherein determining the importance metric is based at least in part on metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmix matrix, a loudspeaker activation matrix, or a combination thereof.
16. The audio processing method of any of claims 13 to 15, wherein determining the importance metric is based at least in part on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the echo management system, or a combination thereof.
17. The audio processing method of any of the preceding claims, wherein the audio devices of the audio system are communicatively coupled via a wired or wireless communication network, and wherein the plurality of echo references are obtained via the wired or wireless communication network.
18. An apparatus configured to perform the method of any one of claims 1 to 17.
19. A system configured to perform the method of any one of claims 1 to 17.
20. One or more non-transitory media having software stored thereon, the software comprising instructions for controlling one or more devices to perform the method of any of claims 1-17.
CN202280013990.5A 2021-02-09 2022-02-07 Echo reference prioritization and selection Pending CN116830561A (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US63/147,573 2021-02-09
US202163201939P 2021-05-19 2021-05-19
US63/201,939 2021-05-19
EP21177382.5 2021-06-02
PCT/US2022/015529 WO2022173706A1 (en) 2021-02-09 2022-02-07 Echo reference prioritization and selection

Publications (1)

Publication Number Publication Date
CN116830561A true CN116830561A (en) 2023-09-29

Family

ID=88114965

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202280013949.8A Pending CN116830560A (en) 2021-02-09 2022-02-07 Echo reference generation and echo reference index estimation based on rendering information
CN202280013990.5A Pending CN116830561A (en) 2021-02-09 2022-02-07 Echo reference prioritization and selection

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202280013949.8A Pending CN116830560A (en) 2021-02-09 2022-02-07 Echo reference generation and echo reference index estimation based on rendering information

Country Status (1)

Country Link
CN (2) CN116830560A (en)

Also Published As

Publication number Publication date
CN116830560A (en) 2023-09-29

Similar Documents

Publication Publication Date Title
US10251009B2 (en) Audio scene apparatus
US20210035563A1 (en) Per-epoch data augmentation for training acoustic models
US10224046B2 (en) Spatial comfort noise
JP2018528479A (en) Adaptive noise suppression for super wideband music
US20230319190A1 (en) Acoustic echo cancellation control for distributed audio devices
US20230026347A1 (en) Methods for reducing error in environmental noise compensation systems
JP2024507916A (en) Audio signal processing method, device, electronic device, and computer program
CN116830561A (en) Echo reference prioritization and selection
WO2022173706A1 (en) Echo reference prioritization and selection
US20230421952A1 (en) Subband domain acoustic echo canceller based acoustic state estimator
RU2648632C2 (en) Multi-channel audio signal classifier
WO2023086273A1 (en) Distributed audio device ducking
US20240046927A1 (en) Methods and systems for voice control
US20230138240A1 (en) Compensating Noise Removal Artifacts
CN116783900A (en) Acoustic state estimator based on subband-domain acoustic echo canceller
CN116964666A (en) Dereverberation based on media type

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination