CN116830599A

CN116830599A - Pervasive acoustic mapping

Info

Publication number: CN116830599A
Application number: CN202180089790.3A
Authority: CN
Inventors: M·R·P·托马斯; B·J·索斯威尔; A·布鲁尼; O·M·唐森德; D·阿提加; D·斯凯尼; C·G·海恩斯; A·J·希菲尔德; D·古那万; C·P·布朗
Original assignee: Dolby International AB; Dolby Laboratories Licensing Corp
Current assignee: Dolby International AB; Dolby Laboratories Licensing Corp
Priority date: 2020-12-03
Filing date: 2021-12-02
Publication date: 2023-09-29
Also published as: CN116806431A

Abstract

Some methods may involve receiving a first content stream comprising a first audio signal, rendering the first audio signal to produce a first audio playback signal, generating a first calibration signal, generating a first modified audio playback signal by inserting the first calibration signal into the first audio playback signal, and causing a loudspeaker system to play back the first modified audio playback signal to generate a first audio device playback sound. The method(s) may involve receiving a microphone signal corresponding to at least a first audio device playback sound and a second through nth audio device playback sound corresponding to second through nth modified audio playback signals (including second through nth calibration signals) played back by the second through nth audio devices, extracting the second through nth calibration signals from the microphone signal, and estimating at least one acoustic scene metric based at least in part on the second through nth calibration signals.

Description

Pervasive acoustic mapping

Cross Reference to Related Applications

The present application claims spanish patent application No. p202031212 filed on 12/3/2020; U.S. provisional patent application Ser. No.63/120,963, filed on 12/3/2020; U.S. provisional patent application Ser. No.63/120,887, filed on 12/3/2020; U.S. provisional patent application No.63/121,007 filed on 12/3/2020; U.S. provisional patent application No.63/121,085, filed on 12/3/2020; U.S. provisional patent application Ser. No.63/155,369, filed 3/2 at 2021; U.S. provisional patent application Ser. No.63/201,561, filed 5/4 at 2021; spanish patent application No. p202130458 filed 5/20/2021; U.S. provisional patent application No.63/203,403, filed on 7/21 at 2021; U.S. provisional patent application Ser. No.63/224,778, filed at 22, 7, 2021; spanish patent application No. p202130724 filed at 26, 7, 2021; U.S. provisional patent application Ser. No.63/260,528, filed 8/24 at 2021; U.S. provisional patent application Ser. No.63/260,529, filed 8/24 at 2021; U.S. provisional patent application No.63/260,953, filed on 7, 9, 2021; U.S. provisional patent application No.63/260,954 filed on 7/9/2021; U.S. provisional patent application Ser. No.63/261,769, filed on 28 at 9 at 2021, the priority benefits of which are incorporated herein by reference.

Technical Field

The present disclosure relates to audio processing systems and methods.

Background

Audio devices and systems are widely deployed. While existing systems and methods for estimating acoustic scene metrics (e.g., audio device audibility) are known, improved systems and methods are desired.

Sign and nomenclature

Throughout this disclosure, including in the claims, the terms "speaker (speaker)", "loudspeaker (loudspecker)" and "audio reproduction transducer" are used synonymously to denote any sound producing transducer (or set of transducers). A typical set of headphones includes two speakers. The speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), which may be fed by a single common speaker or driven by multiple speaker feeds. In some examples, the speaker feed(s) may undergo different processing in different circuitry branches coupled to different transducers.

Throughout this disclosure, including in the claims, the expression "performing an operation on" a signal or data (e.g., filtering, scaling, transforming, or applying a gain to a signal or data) is used in a broad sense to mean performing an operation directly on a signal or data or on a processed version of a signal or data (e.g., a version of a signal that has been preliminarily filtered or preprocessed prior to performing an operation thereon).

Throughout this disclosure, including in the claims, the expression "system" is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem implementing a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, where the subsystem generates M inputs and the other X-M inputs are received from external sources) may also be referred to as a decoder system.

Throughout this disclosure, including in the claims, the term "processor" is used in a broad sense to mean that a system or device is programmable or otherwise configurable (e.g., using software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include field programmable gate arrays (or other configurable integrated circuits or chip sets), digital signal processors programmed and/or otherwise configured to perform pipelined processing of audio or other sound data, programmable general purpose processors or computers, and programmable microprocessor chips or chip sets.

Throughout this disclosure, including in the claims, the term "coupled" or "coupled" is used to mean a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections.

As used herein, a "smart device" is an electronic device that is generally configured to communicate with one or more other devices (or networks) that may interact and/or operate autonomously to some degree via various wireless protocols, such as bluetooth, zigbee, near field communication, wi-Fi, light-fidelity (Li-Fi), 3G, 4G, 5G, and the like. Several notable smart device types are smart phones, smart cars, smart thermostats, smart doorbell, smart locks, smart refrigerators, tablet phones and tablets, smart watches, smart bracelets, smart key chains, and smart audio devices. The term "smart device" may also refer to a device that exhibits some of the characteristics of pervasive computing (such as artificial intelligence).

Herein, we use the expression "smart audio device" to denote a smart device, which may be either a single-use audio device or a multi-purpose audio device (e.g., an audio device implementing at least some aspects of the virtual assistant functionality). A single-use audio device is a device, such as a Television (TV), that includes or is coupled to at least one microphone (and optionally also at least one speaker and/or at least one camera), and that is designed largely or primarily for a single purpose. For example, while a TV may typically play (and be considered capable of playing) audio in program material, in most cases, modern TVs run some operating system on which applications run locally, including television-watching applications. In this sense, single-use audio devices having speaker(s) and microphone(s) are typically configured to run local applications and/or services to directly use the speaker(s) and microphone(s). Some single-use audio devices may be configured to be grouped together to enable audio playback on zones (zones) or user-configured areas (areas).

One common type of multi-purpose audio device is an audio device that implements at least some aspects of virtual assistant functionality, but other aspects of virtual assistant functionality may be implemented by one or more other devices, such as one or more servers with which the multi-purpose audio device is configured to communicate. Such multi-purpose audio devices may be referred to herein as "virtual assistants". A virtual assistant is a device (e.g., a smart speaker or voice assistant integrated device) that includes or is coupled to at least one microphone (and optionally also includes or is coupled to at least one speaker and/or at least one camera). In some examples, the virtual assistant may provide the ability to use multiple devices (other than the virtual assistant) for applications that are cloud-enabled in a sense or that are otherwise not implemented entirely in or on the virtual assistant itself. In other words, at least some aspects of the virtual assistant functionality, such as the voice recognition functionality, may be implemented (at least in part) by one or more servers or other devices with which the virtual assistant may communicate via a network such as the internet. Virtual assistants can sometimes work together, for example, in a discrete and conditionally defined manner. For example, two or more virtual assistants may work together in the sense that one of them (e.g., the one that hears the wake word most confident) responds to the wake word. In some implementations, the connected virtual assistants may form a constellation that may be managed by a host application that may be (or implement) the virtual assistant.

In this document, the "wake-up word" is used in a broad sense to refer to any sound (e.g., a word spoken by a person, or some other sound), wherein the smart audio device is configured to wake up in response to detecting ("hearing") the sound (using at least one microphone contained in or coupled to the smart audio device, or at least one other microphone). In this context, "wake-up" means that the device enters a state in which it waits (in other words, is listening to) for a voice command. In some cases, what may be referred to herein as a "wake word" may include more than one word, e.g., a phrase.

Herein, the expression "wake word detector" means a device (or software including instructions for configuring the device) configured to continuously search for an alignment between real-time sound (e.g., speech) features and a training model. Typically, a wake word event is triggered whenever the wake word detector determines that the probability of detecting a wake word exceeds a predefined threshold. For example, the threshold may be a predetermined threshold that is adjusted to give a reasonable tradeoff between false acceptance rate and false rejection rate. After the wake word event, the device may enter a state (which may be referred to as an "awake" state or an "awake" state) in which it listens for commands and passes the received commands to a larger, computationally intensive recognizer.

As used herein, the terms "program stream" and "content stream" refer to a collection of one or more audio signals, and in some cases, video signals, at least a portion of which are intended to be heard together. Examples include music selections, movie soundtracks, movies, television programs, audio portions of television programs, podcasts, real-time voice calls, synthesized voice responses from intelligent assistants, and the like. In some cases, the content stream may include multiple versions of at least a portion of the audio signal, e.g., the same dialog in more than one language. In this case, only one version of the audio data or a portion thereof (e.g., a version corresponding to one language) is intended to be reproduced at a time.

Disclosure of Invention

At least some aspects of the present disclosure may be implemented via one or more audio processing methods. In some cases, the method(s) may be implemented at least in part by a control system and/or via instructions (e.g., software) stored on one or more non-transitory media. Some methods may involve causing, by a control system, a first audio device of an audio environment to generate a first calibration signal, and causing, by the control system, the first calibration signal to be inserted into a first audio playback signal corresponding to a first content stream to generate a first modified audio playback signal for the first audio device. Some such methods may involve causing, by the control system, the first audio device to play back the first modified audio playback signal to generate a first audio device playback sound.

Some such methods may involve causing, by the control system, a second audio device of the audio environment to generate a second calibration signal; causing, by the control system, a second calibration signal to be inserted into the second content stream to generate a second modified audio playback signal for the second audio device; and causing, by the control system, the second audio device to play back the second modified audio playback signal to generate a second audio device playback sound.

Some such methods may involve causing, by a control system, at least one microphone of an audio environment to detect at least a first audio device playback sound and a second audio device playback sound and to generate microphone signals corresponding to the at least first audio device playback sound and the second audio device playback sound. Some such methods may involve causing, by the control system, extraction of the first calibration signal and the second calibration signal from the microphone signal. Some such methods may involve estimating, by the control system, at least one acoustic scene metric based at least in part on the first calibration signal and the second calibration signal.

In some embodiments, the control system may be an orchestration device control system.

In some examples, the first calibration signal may correspond to a first sub-audio (sub-audio) component of the first audio device playback sound and the second calibration signal may correspond to a second sub-audio component of the second audio device playback sound. According to some examples, the first calibration signal may be or may comprise a first DSSS signal and wherein the second calibration signal may be or may comprise a second DSSS signal.

Some methods may involve causing, by the control system, the first gap to be inserted into a first frequency range of the first audio playback signal or the first modified audio playback signal during a first time interval of the first content stream. The first gap may be or may include an attenuation of the first audio playback signal in the first frequency range. In some such examples, the first modified audio playback signal and the first audio device playback sound may include a first gap.

Some methods may involve causing, by the control system, the first gap to be inserted into a first frequency range of the second audio playback signal or the second modified audio playback signal during a first time interval. In some such examples, the second modified audio playback signal and the second audio device playback sound may include a first gap.

Some methods may involve extracting, by a control system, audio data from microphone signals in at least a first frequency range to produce extracted audio data. Some such methods may involve estimating, by the control system, at least one acoustic scene metric based at least in part on the extracted audio data.

Some methods may involve controlling gap insertion and calibration signal generation such that the calibration signal corresponds to neither the gap time interval nor the gap frequency range. Some methods may involve controlling gap insertion and calibration signal generation based at least in part on time since noise was estimated in at least one frequency band. Some methods may involve controlling gap insertion and calibration signal generation based at least in part on a signal-to-noise ratio of a calibration signal of at least one audio device in at least one frequency band.

Some methods may involve causing the target audio device to play back an unmodified audio playback signal of the target device content stream to generate target audio device playback sound. Some such methods may involve estimating, by the control system, target audio device audibility and/or target audio device location based at least in part on the extracted audio data. In some such examples, the unmodified audio playback signal does not include the first gap. According to some such examples, the microphone signal may also correspond to the target audio device playback sound. In some cases, the unmodified audio playback signal does not include gaps inserted into any frequency range.

In some examples, the at least one acoustic scene metric includes time of flight, time of arrival, direction of arrival, range, audio device audibility, audio device impulse response, angle between audio devices, audio device location, audio ambient noise, signal to noise ratio, or a combination thereof. According to some implementations, such that estimating the at least one acoustic scene metric may involve estimating the at least one acoustic scene metric. In some implementations, such that estimating the at least one acoustic scene metric may involve or cause another device to estimate the at least one acoustic scene metric. Some examples may involve controlling one or more aspects of audio device playback based at least in part on at least one acoustic scene metric.

According to some implementations, the first content stream component of the first audio device playback sound may cause a perceptual masking (perceptual masking) of the first calibration signal component of the first audio device playback sound. In some such implementations, the second content stream component of the second audio device playback sound may cause perceptual masking of the second calibration signal component of the second audio device playback sound.

Some examples may involve causing, by the control system, third through nth calibration signals to be generated by third through nth audio devices of the audio environment. Some such examples may involve causing, by the control system, third through nth calibration signals to be inserted into the third through nth content streams to generate third through nth modified audio playback signals for the third through nth audio devices. Some such examples may involve causing, by the control system, the third through nth audio devices to play back corresponding instances of the third through nth modified audio playback signals to generate third through nth instances of audio device playback sounds.

Some such examples may involve causing, by the control system, at least one microphone of each of the first through nth audio devices to detect first through nth instances of audio device playback sound and to generate microphone signals corresponding to the first through nth instances of audio device playback sound. In some cases, the first through nth instances of the audio device playback sound may include the first audio device playback sound, the second audio device playback sound, and the third through nth instances of the audio device playback sound. Some such examples may involve extracting, by the control system, first through nth calibration signals from the microphone signal. In some implementations, at least one acoustic scene metric may be estimated based at least in part on the first through nth calibration signals.

Some examples may involve determining one or more calibration signal parameters for a plurality of audio devices in an audio environment. In some cases, one or more calibration signal parameters may be available to generate the calibration signal. Some examples may involve providing one or more calibration signal parameters to each of a plurality of audio devices. In some such embodiments, determining the one or more calibration signal parameters may involve scheduling a time slot for playback of the modified audio playback signal for each of the plurality of audio devices. In some examples, the first time slot of the first audio device may be different from the second time slot of the second audio device.

In some examples, determining the one or more calibration signal parameters may involve determining a frequency band for playback of the modified audio playback signal for each of the plurality of audio devices. In some such examples, the first frequency band of the first audio device may be different from the second frequency band of the second audio device.

According to some examples, determining the one or more calibration signal parameters may involve determining a DSSS spreading code for each of a plurality of audio devices. In some cases, the first spreading code of the first audio device may be different from the second spreading code of the second audio device. Some examples may involve determining at least one spreading code length based at least in part on audibility of a corresponding audio device.

In some examples, determining the one or more calibration signal parameters may involve applying an acoustic model that is based at least in part on a mutual audibility (mutual audibility) of each of a plurality of audio devices in the audio environment.

Some methods may involve determining that a calibration signal parameter of an audio device is at a maximum level of robustness. Some such methods may involve determining that a calibration signal from an audio device cannot be successfully extracted from a microphone signal. Some such methods may involve muting all other audio devices of at least a portion of their corresponding audio device playback sounds. In some examples, the portion may be or may include a calibration signal component.

Some implementations may involve having each of a plurality of audio devices in an audio environment play back a modified audio playback signal simultaneously.

According to some examples, at least a portion of the first audio playback signal, at least a portion of the second audio playback signal, or at least a portion of each of the first audio playback signal and the second audio playback signal corresponds to silence (silence).

Some or all of the operations, functions, and/or methods described herein may be performed by one or more devices in accordance with instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. Thus, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon.

At least some aspects of the present disclosure may be implemented via an apparatus or system. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some embodiments, the apparatus is or includes an audio processing system having an interface system and a control system. The control system may include one or more general purpose single or multi-chip processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or a combination thereof.

The details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.

Drawings

Like reference numbers and designations in the various drawings indicate like elements.

Fig. 1A illustrates an example of an audio environment.

Fig. 1B is a block diagram illustrating an example of components of an apparatus capable of implementing aspects of the present disclosure.

Fig. 2 is a block diagram illustrating an example of an audio device element according to some disclosed embodiments.

Fig. 3 is a block diagram illustrating an example of an audio device element according to another disclosed embodiment.

Fig. 4 is a block diagram illustrating an example of an audio device element according to another disclosed embodiment.

Fig. 5 is a diagram showing an example of the content stream component of audio device playback sound and the level of Direct Sequence Spread Spectrum (DSSS) signal component of audio device playback sound over a frequency range.

Fig. 6 is a graph showing an example of the power of two calibration signals having different bandwidths but at the same center frequency.

FIG. 7 illustrates elements of an orchestration module according to one example.

Fig. 8 shows another example audio environment.

Fig. 9 shows an example of acoustic calibration signals generated by the audio devices 100B and 100C of fig. 8.

Fig. 10 is a diagram providing an example of a Time Domain Multiple Access (TDMA) method.

Fig. 11 is a diagram showing an example of a Frequency Domain Multiple Access (FDMA) method.

Fig. 12 is a diagram showing another example of the arrangement method.

Fig. 13 is a diagram showing another example of the arrangement method.

Fig. 14 illustrates elements of an audio environment according to another example.

Fig. 15 is a flowchart outlining another example of the disclosed audio device orchestration method.

Fig. 16 illustrates another example audio environment.

Fig. 17 is a block diagram illustrating an example of a calibration signal demodulator element, a baseband processor element, and a calibration signal generator element, according to some disclosed embodiments.

Fig. 18 shows elements of a calibration signal demodulator according to another example.

Fig. 19 is a block diagram illustrating an example of baseband processor elements according to some disclosed embodiments.

Fig. 20 shows an example of a delay waveform.

Fig. 21 shows another example audio environment.

Fig. 22A is an example of modifying a spectrogram of an audio playback signal.

Fig. 22B is a diagram showing an example of a gap in the frequency domain.

Fig. 22C is a diagram showing an example of a gap in the time domain.

Fig. 22D illustrates an example of modifying an audio playback signal, including orchestrated gaps for multiple audio devices of an audio environment.

Fig. 23A is a diagram showing an example of a filter response for creating a gap and a filter response for measuring a frequency interval of a microphone signal used during a measurement session.

Fig. 23B, 23C, 23D, and 23E are diagrams showing examples of gap allocation strategies.

Fig. 24 shows another example audio environment.

Fig. 25A is a flowchart outlining one example of a method that may be performed by an apparatus such as that shown in fig. 1B.

FIG. 25B is a block diagram of elements of one example of an embodiment configured to implement a region classifier.

FIG. 26 presents a block diagram of one example of a system for orchestrated gap insertion.

27A and 27B illustrate a system block diagram showing examples of elements of an orchestration device and elements of an orchestrated audio device according to some disclosed embodiments.

Fig. 28 is a flowchart outlining another example of the disclosed audio device orchestration method.

Fig. 29 is a flowchart outlining another example of the disclosed audio device orchestration method.

Fig. 30 shows an example of time-frequency allocation of the calibration signal, the gap for noise estimation, and the gap for hearing a single audio device.

Fig. 31 depicts an audio environment, which in this example is a living space.

Fig. 32, 33 and 34 are block diagrams representing three types of disclosed embodiments.

Fig. 35 shows an example of a heat map.

Fig. 36 is a block diagram showing an example of another embodiment.

Fig. 37 is a flowchart outlining one example of another method that may be performed by an apparatus or system such as that disclosed herein.

Fig. 38 is a block diagram showing an example of a system according to another embodiment.

FIG. 39 is a flowchart outlining one example of another method that may be performed by an apparatus or system such as that disclosed herein.

Fig. 40 shows a plan view example of another audio environment, which in this case is living space.

Fig. 41 shows an example of a geometric relationship between four audio devices in an environment.

Fig. 42 shows an audio transmitter located within the audio environment of fig. 41.

Fig. 43 illustrates an audio receiver located within the audio environment of fig. 41.

FIG. 44 is a flowchart outlining another example of a method that may be performed by a control system of an apparatus such as that shown in FIG. 1B.

Fig. 45 is a flowchart outlining an example of a method for automatically estimating device location and orientation based on direction of arrival (DOA) data.

Fig. 46 is a flowchart outlining an example of a method for automatically estimating device location and orientation based on DOA data and time of arrival (TOA) data.

FIG. 47 is a flowchart outlining another example of a method for automatically estimating device location and orientation based on DOA data and TOA data.

Fig. 48A illustrates another example of an audio environment.

Fig. 48B shows an example of determining listener angle orientation data.

Fig. 48C shows an additional example of determining listener angular orientation data.

Fig. 48D shows one example of determining an appropriate rotation of the audio device coordinates according to the method described with reference to fig. 48C.

Fig. 49 is a flowchart outlining another example of a positioning method.

Fig. 50 is a flowchart outlining another example of a positioning method.

Fig. 51 depicts a plan view of another listening environment, which in this example is a living space.

Fig. 52 is a diagram indicating points of speaker activation in an example embodiment.

Fig. 53 is a diagram of tri-linear interpolation between points indicating speaker activation according to one example.

FIG. 54 is a block diagram of a minimum version of another embodiment.

Fig. 55 depicts another (more capable) embodiment with additional features.

FIG. 56 is a flowchart outlining another example of the disclosed method.

Detailed Description

To achieve attractive spatial playback of media and entertainment content, the physical layout and relative capabilities of available speakers should be evaluated and considered. Also, in order to provide high quality voice-driven interactions (with virtual assistants and remote talkers), the user needs to be heard as well as hearing conversations reproduced via the loudspeakers. It is expected that as more cooperating devices are added to the audio environment, the combined utility to the user will increase, as the devices will be more generally within the convenient voice range. A greater number of speakers allows for better immersion due to the space available for media presentation.

Adequate coordination and cooperation between devices may allow these opportunities and experiences to be realized. Acoustic information about each audio device is a key component of this coordination and collaboration. Such acoustic information may include audibility of each loudspeaker from different locations in the audio environment, as well as the amount of noise in the audio environment.

Some previous methods of mapping and calibrating smart audio device constellations require a special calibration procedure to play known stimuli from the audio device (typically one audio device at a time) while one or more microphones are recorded. While this process can be made attractive to a particular user population by the inventive sound design, the process needs to be repeatedly re-executed as devices are added, removed, and even simply repositioned, which prevents widespread adoption. Imposing such a process on the user can interfere with the normal operation of the device and can frustrate some users.

A more basic but also popular approach is manual user intervention via a software application ("application") and/or a guidance process by which a user indicates the physical location of an audio device in an audio environment. This approach presents a further obstacle to user adoption and may provide relatively less information to the system than a dedicated calibration process.

Calibration and mapping algorithms typically require some basic acoustic information for each audio device in an audio environment. Many such methods have been proposed using a range of different basic acoustic measurements and acoustic properties being measured. Examples of acoustic properties derived from microphone signals (also referred to herein as "acoustic scene metrics") for such algorithms include:

o estimation of physical distance between devices (acoustic ranging);

an estimate of the angle between the o devices (direction of arrival (DoA));

o estimation of impulse response between devices (e.g., by swept sine wave stimulation or other measurement signals); and

o background noise estimation.

However, existing calibration and mapping algorithms are generally not implemented to respond to changes in the acoustic scene of the audio environment, such as movement of people within the audio environment, repositioning of audio devices within the audio environment, and the like.

Orchestration systems such as the intelligent audio devices disclosed herein may provide a user with flexibility to place the devices anywhere in a listening environment (also referred to herein as an audio environment). In some implementations, the audio device is configured for self-organization and auto-calibration.

Calibration can be conceptually divided into two or more layers. One such layer relates to a layer that may be referred to herein as a "geometric map". The geometric mapping may involve discovering the physical location and orientation of the intelligent audio device and one or more persons in the audio environment. In some examples, the geometric mapping may involve discovering physical locations of noise sources and/or traditional audio devices such as televisions ("TVs") and speakers bars. Geometric mapping is important for a number of reasons. For example, it is important to provide accurate geometric mapping information to a flexible renderer to properly render sound scenes. In contrast, conventional systems employing canonical loudspeaker layouts (such as 5.1) are designed based on the following assumptions: the loudspeakers will be placed in predetermined positions and the listener sits "sweet spot" facing the center loudspeaker and/or halfway between the left and right front loudspeakers.

The second conceptual layer of calibration involves the processing of audio data (e.g., audio calibration and equalization) to account for manufacturing variations in loudspeakers, room placement, and acoustic effects, among others. In conventional cases, particularly with sound bars and audio/video receivers (AVR), the user may optionally apply manual gain and EQ curves, or insert a dedicated reference microphone at the listening site for automatic calibration. However, it is well known that the proportion of the population willing to go to great lengths is very small. Thus, the orchestration system of the intelligent audio device requires a method to automate audio processing (especially calibration and EQ calibration) without the use of reference microphones at the listener site, which may be referred to herein as "audibility mapping". The geometric map and audibility map constitute two main components that will be referred to herein as "acoustic maps".

The present disclosure describes a variety of techniques that may be used in various combinations to provide automated acoustic mapping. Acoustic mapping may be pervasive (or) and persistent. Such acoustic mapping may sometimes be referred to as "continuous" in the sense that it may continue after the initial setup process and may be responsive to changing conditions in the audio environment, such as changing noise sources and/or levels, loudspeaker repositioning, deployment of additional loudspeakers, repositioning and/or redirecting of one or more listeners, etc.

Some disclosed methods involve generating a calibration signal that is injected (e.g., mixed) into audio content being rendered by an audio device in an audio environment. In some such examples, the calibration signal may be or may include an acoustic Direct Sequence Spread Spectrum (DSSS) signal.

In other examples, the calibration signal may be or may include other types of acoustic calibration signals, such as swept sinusoidal acoustic signals, white noise, a "colored noise" such as pink noise (a frequency spectrum whose intensity decreases at a rate of three decibels per octave), acoustic signals corresponding to music, and the like. Such an approach may enable an audio device to generate observations after receiving calibration signals transmitted by other audio devices in an audio environment. In some implementations, each participating audio device in the audio environment may be configured to generate an acoustic calibration signal, inject the acoustic calibration signal into the rendered loudspeaker feed signal to produce a modified audio playback signal, and cause the loudspeaker system to play back the modified audio playback signal to generate a first audio device playback sound. In some implementations, each participating audio device in the audio environment may be configured to do the foregoing while also detecting audio device playback sounds from other orchestrated audio devices in the audio environment and processing the audio device playback sounds to extract the acoustic calibration signal. Thus, although detailed examples using acoustic DSSS signals are provided herein, these should be considered as specific examples in the broader class of acoustic calibration signals.

DSSS signals have been previously deployed in telecommunications environments. When DSSS signals are used in a telecommunications environment, the DSSS signals are used to spread transmitted data over a channel to a receiver before the data is sent over the channel to a wider frequency range. In contrast, most or all of the disclosed embodiments do not involve using DSSS signals to modify or transmit data. Rather, such disclosed embodiments relate to transmitting DSSS signals between audio devices in an audio environment. What happens to the transmitted DSSS signal between transmission and reception is itself the transmitted information. This is an important distinction between how DSSS signals are used in a telecommunications environment and how DSSS signals are used in the disclosed embodiments.

Furthermore, the disclosed embodiments relate to transmitting and receiving acoustic DSSS signals, rather than electromagnetic DSSS signals. In many disclosed embodiments, the acoustic DSSS signal is inserted into a content stream that has been rendered for playback such that the acoustic DSSS signal is included in the audio that is played back. According to some such embodiments, the acoustic DSSS signal is inaudible to humans, so that humans in the audio environment will not perceive the acoustic DSSS signal, but will only detect the audio content being played back.

Another distinction between the use of acoustic DSSS signals disclosed herein and how DSSS signals are used in a telecommunications environment relates to a problem that may be referred to herein as a "near/far problem. In some cases, the acoustic DSSS signals disclosed herein may be transmitted and received by a number of audio devices in an audio environment. The acoustic DSSS signals may potentially overlap in time and frequency. Some disclosed embodiments rely on how DSSS spreading codes are generated to separate acoustic DSSS signals. In some cases, the audio devices may be so close to each other that signal levels may affect the acoustic DSSS signal separation, and thus it may be difficult to separate the signals. This is a manifestation of near/far problems for which some solutions are disclosed herein.

Some methods may involve receiving a first content stream comprising a first audio signal, rendering the first audio signal to produce a first audio playback signal, generating a first calibration signal, generating a first modified audio playback signal by inserting the first calibration signal into the first audio playback signal, and causing a loudspeaker system to play back the first modified audio playback signal to generate a first audio device playback sound. The method(s) may involve receiving a microphone signal corresponding to at least a first audio device playback sound and a second through nth audio device playback sound corresponding to a second through nth modified audio playback signal (including second through nth calibration signals) played back by the second through nth audio devices, extracting the second through nth calibration signals from the microphone signal, and estimating at least one acoustic scene metric based at least in part on the second through nth calibration signals.

The acoustic scene metric(s) may be or may include audio device audibility, audio device impulse response, angle between audio devices, audio device location, and/or audio environmental noise. Some disclosed methods may involve controlling one or more aspects of audio device playback based at least in part on the acoustic scene metric(s).

Some disclosed methods may involve orchestrating a plurality of audio devices to perform a method involving calibrating signals. Some such methods may involve causing, by a control system, a first audio device of an audio environment to generate a first calibration signal, causing, by the control system, the first calibration signal to be inserted into a first audio playback signal corresponding to a first content stream to generate a first modified audio playback signal for the first audio device, and causing, by the control system, the first audio device to play back the first modified audio playback signal to generate a first audio device playback sound.

Some such methods may involve causing, by the control system, a second audio device of the audio environment to generate a second calibration signal, causing, by the control system, the second calibration signal to be inserted into the second content stream to generate a second modified audio playback signal for the second audio device, and causing, by the control system, the second audio device to play back the second modified audio playback signal to generate a second audio device playback sound.

Some such embodiments may involve causing, by the control system, at least one microphone of the audio environment to detect at least a first audio device playback sound and a second audio device playback sound and to generate microphone signals corresponding to the at least first audio device playback sound and the second audio device playback sound. Some such methods may involve extracting, by the control system, at least a first calibration signal and a second calibration signal from the microphone signal, and estimating, by the control system, at least one acoustic scene metric based at least in part on the first calibration signal and the second calibration signal.

Fig. 1A illustrates an example of an audio environment. As with the other figures provided herein, the types and amounts of elements shown in FIG. 1A are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements.

According to this example, the audio environment 130 is a living space of a home. In the example shown in fig. 1A, audio devices 100A, 100B, 100C, and 100D are located within audio environment 130. In this example, each of the audio devices 100A-100D includes a corresponding one of the loudspeaker systems 110A, 110B, 110C, and 110D in this example. According to this example, the loudspeaker system 110B of the audio device 100B comprises at least a left loudspeaker 110B1 and a right loudspeaker 110B2. In this case, the audio devices 100A-100D include loudspeakers of various sizes and with various capabilities. At the time shown in FIG. 1A, the audio devices 100A-100D are producing corresponding instances of audio device playback sounds 120A, 120B1, 120B2, 120C, and 120D.

In this example, each of the audio devices 100A-100D includes a corresponding one of the microphone systems 111A, 111B, 111C, and 111D. Each of microphone systems 111A-111D includes one or more microphones. In some examples, the audio environment 130 may include at least one audio device without a loudspeaker system or at least one audio device without a microphone system.

In some cases, at least one acoustic event may be occurring in the audio environment 130. For example, one such acoustic event may be caused by a person speaking, in some cases he may be speaking a voice command. In other cases, the acoustic event may be caused, at least in part, by a variable element such as a door or window of the audio environment 130. For example, when the door is open, sound from outside the audio environment 130 may be more clearly perceived inside the audio environment 130. Furthermore, changing the angle of the gate may change some echo paths within the audio environment 130.

Fig. 1B is a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the present disclosure. As with the other figures provided herein, the types and amounts of elements shown in FIG. 1B are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. According to some examples, the apparatus 150 may be configured to perform at least some of the methods disclosed herein. In some implementations, the apparatus 150 may be or may include one or more components of an audio system. For example, in some implementations, the apparatus 150 may be an audio device, such as a smart audio device. In other examples, the apparatus 150 may be a mobile device (such as a cellular telephone), a laptop computer, a tablet device, a television, or other type of device.

In the example shown in fig. 1A, audio devices 100A-100D are examples of apparatus 150. According to some examples, the audio environment 100 of fig. 1A may include orchestration devices, such as devices that may be referred to herein as smart home hubs. A smart home hub (or other orchestration device) may be an example of the apparatus 150. In some implementations, one or more of the audio devices 100A-100D may be capable of functioning as an orchestration device.

According to some alternative embodiments, the apparatus 150 may be or may include a server. In some such examples, the apparatus 150 may be or may include an encoder. Thus, in some cases, the apparatus 150 may be a device configured for use in an audio environment, such as a home audio environment, while in other cases, the apparatus 150 may be a device configured for use in a "cloud", e.g., a server.

In this example, the apparatus 150 includes an interface system 155 and a control system 160. In some implementations, the interface system 155 can include a wired or wireless interface configured to communicate with one or more other devices of the audio environment. In some examples, the audio environment may be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automotive environment, a train environment, a street or sidewalk environment, a park environment, and so forth. In some implementations, the interface system 155 can be configured to exchange control information and associated data with an audio device of an audio environment. In some examples, the control information and associated data may relate to one or more software applications being executed by the apparatus 150.

In some implementations, the interface system 155 can be configured to receive or provide a content stream. The content stream may include audio data. The audio data may include, but is not limited to, audio signals. In some cases, the audio data may include spatial data, such as channel data and/or spatial metadata. For example, metadata may have been provided by what is referred to herein as an "encoder". In some examples, the content stream may include video data and audio data corresponding to the video data.

The interface system 155 may include one or more network interfaces and/or one or more external device interfaces, such as one or more Universal Serial Bus (USB) interfaces. According to some embodiments, interface system 155 may include, for example, a device configured for Wi-Fi or Bluetooth ^TM One or more wireless interfaces for communication.

In some examples, interface system 155 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system, and/or a gesture sensor system. In some examples, interface system 155 may include one or more interfaces between control system 160 and a memory system, such as optional memory system 165 shown in fig. 1B. However, in some cases, control system 160 may include a memory system. In some implementations, the interface system 155 can be configured to receive input from one or more microphones in an environment.

In some embodiments, the control system 160 may be configured to at least partially perform the methods disclosed herein. Control system 160 may include, for example, a general purpose single or multi-chip processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.

In some implementations, the control system 160 may reside in more than one device. For example, in some implementations, a portion of control system 160 may reside in a device within one of the environments described herein, while another portion of control system 160 may reside in a device outside of the environment, such as a server, mobile device (e.g., smart phone or tablet computer), etc. In other examples, a portion of control system 160 may reside in a device within one of the environments described herein, while another portion of control system 160 may reside in one or more other devices of the environments. For example, control system functionality may be distributed across multiple intelligent audio devices of an environment, or may be shared by orchestration devices of the environment (such as may be referred to herein as a smart home hub) and one or more other devices. In other examples, a portion of control system 160 may reside in a device (such as a server) implementing a cloud-based service, while another portion of control system 160 may reside in another device (such as another server, a memory device, etc.) implementing a cloud-based service. In some examples, the interface system 155 may also reside in more than one device.

Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. One or more non-transitory media may reside, for example, in the optional memory system 165 and/or the control system 160 shown in fig. 1B. Thus, various innovative aspects of the subject matter described in this disclosure can be implemented in one or more non-transitory media having software stored thereon. For example, the software may include instructions for controlling at least one device to perform some or all of the methods disclosed herein. For example, the software may be executed by one or more components of a control system, such as control system 160 of FIG. 1B.

In some examples, the apparatus 150 may include the optional microphone system 111 shown in fig. 1B. The optional microphone system 111 may include one or more microphones. According to some examples, optional microphone system 111 may include an array of microphones. In some cases, the array of microphones may be configured to receive side beam shaping, e.g., according to instructions from control system 160. In some examples, the array of microphones may be configured to determine direction of arrival (DoA) and/or time of arrival (ToA) information, e.g., according to instructions from control system 160. Alternatively or additionally, the control system 160 may be configured to determine direction of arrival (DoA) and/or time of arrival (ToA) information, e.g., from microphone signals received from the microphone system 111.

In some implementations, one or more of the microphones may be part of or associated with another device (such as a speaker of a speaker system, a smart audio device, etc.). In some examples, the apparatus 150 may not include the microphone system 111. However, in some such implementations, the apparatus 150 may still be configured to receive microphone data for one or more microphones in an audio environment via the interface system 160. In some such implementations, a cloud-based implementation of the apparatus 150 may be configured to receive microphone data or data corresponding to microphone data from one or more microphones in an audio environment via the interface system 160.

According to some embodiments, the apparatus 150 may include an optional loudspeaker system 110 shown in fig. 1B. The optional loudspeaker system 110 may include one or more loudspeakers, which may also be referred to herein as "speakers," or more generally as "audio reproduction transducers. In some examples (e.g., cloud-based implementations), the apparatus 150 may not include the loudspeaker system 110.

In some embodiments, the apparatus 150 may include an optional sensor system 180 shown in fig. 1B. The optional sensor system 180 may include one or more touch sensors, gesture sensors, motion detectors, and the like. According to some embodiments, the optional sensor system 180 may include one or more cameras. In some implementations, the camera may be a standalone camera. In some examples, one or more cameras of optional sensor system 180 may reside in a smart audio device, which may be a single-use audio device or a virtual assistant. In some such examples, one or more cameras of optional sensor system 180 may reside in a television, mobile phone, or smart speaker. In some examples, the apparatus 150 may not include the sensor system 180. However, in some such embodiments, the apparatus 150 may still be configured to receive sensor data for one or more sensors in the audio environment via the interface system 160.

In some implementations, the apparatus 150 may include an optional display system 185 shown in fig. 1B. The optional display system 185 may include one or more displays, such as one or more Light Emitting Diode (LED) displays. In some cases, the optional display system 185 may include one or more Organic Light Emitting Diode (OLED) displays. In some examples, optional display system 185 may include one or more displays of a smart audio device. In other examples, the optional display system 185 may include a television display, a laptop display, a mobile device display, or other type of display. In some examples where the apparatus 150 includes the display system 185, the sensor system 180 may include a touch sensor system and/or a gesture sensor system proximate to one or more displays of the display system 185. According to some such embodiments, the control system 160 may be configured to control the display system 185 to present one or more Graphical User Interfaces (GUIs).

According to some such examples, apparatus 150 may be or may include a smart audio device. In some such embodiments, the apparatus 150 may be or may include a wake word detector. For example, the apparatus 150 may be or include a virtual assistant.

Fig. 2 is a block diagram illustrating an example of an audio device element according to some disclosed embodiments. As with the other figures provided herein, the types and amounts of elements shown in fig. 2 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. In this example, the audio device 100A of fig. 2 is an example of the apparatus 150 described above with reference to fig. 1B. In this example, the audio device 100A is one of a plurality of audio devices in an audio environment, and may be an example of the audio device 100A shown in fig. 1A in some cases. In this example, the audio environment includes at least two other orchestrated audio devices, audio device 100B and audio device 100C.

According to such an implementation, the audio device 100A includes the following elements:

-110A: an example of a loudspeaker system 110 of fig. 1B, which includes one or more loudspeakers;

-111A: an example of the microphone system 111 of fig. 1B includes one or more microphones;

-120A, B, C: audio devices corresponding to the rendered content being played back by audio devices 100A-100C in the same acoustic space play back sound;

-201A: an audio playback signal output by rendering module 210A;

-202A: a modified audio playback signal output by the calibration signal injector 211A;

-203A: a calibration signal output by the calibration signal generator 212A;

-204A: a copy of the calibration signal corresponding to the calibration signal generated by the other audio devices of the audio environment (in this example, at least audio devices 100B and 100C). In some examples, the audio signal may be received from an external source (e.g., via a device such as Wi-Fi or Bluetooth) such as an orchestration device (which may be another audio device of an audio environment, another local device such as a smart home hub, etc.) ^TM Wireless communication protocol of) receives a calibration signal replica 204A;

-205A: calibration information related to and/or used by one or more audio devices in an audio environment. The calibration information 205A may include parameters used by the control system 160 of the audio device 100A to generate a calibration signal, modulate the calibration signal, demodulate the calibration signal, and so forth. In some examples, the calibration information 205A may include one or more DSSS spreading code parameters and one or more DSSS carrier parameters.

The DSSS spreading code parameters may include, for example, DSSS spreading code length information, chip rate information (or chip period information), and the like. One chip period is the time it takes for one chip (bit) of the spreading code to play back. The inverse of the chip period is the chip rate.

The bits in the DSSS spreading code may be referred to as "chips" to indicate that they contain no data (since the bits typically contain data). In some cases, the DSSS spreading code parameters may include a pseudorandom number sequence. In some examples, the calibration information 205A may indicate which audio devices are generating acoustic calibration signals. In some examples, the calibration information 205A may be received (e.g., via wireless communication) from an external source such as an orchestration device;

-206A: microphone signals received by microphone(s) 111A; -208A: a demodulated coherent baseband signal;

-210A: a rendering module configured to render audio signals of a content stream, such as audio data of music, movies, and TV programs, etc., to generate audio playback signals;

-211A: a calibration signal injector configured to insert the calibration signal 230A modulated by the calibration signal modulator 220A into the audio playback signal generated by the rendering module 210A to generate a modified audio playback signal. The insertion process may be, for example, a mixing process in which the calibration signal 230A modulated by the calibration signal modulator 220A is mixed with the audio playback signal generated by the rendering module 210A to generate a modified audio playback signal.

-212A: a calibration signal generator configured to generate a calibration signal 203A and provide the calibration signal 203A to the calibration signal modulator 220A and the calibration signal demodulator 214A. In some examples, calibration signal generator 212A may include a DSSS spreading code generator and a DSSS carrier generator. In this example, calibration signal generator 212A provides calibration signal replica 204A to calibration signal demodulator 214A;

-214A: an optional calibration signal demodulator configured to demodulate microphone signal 206A received by microphone(s) 111A. In this example, the calibration signal demodulator 214A outputs a demodulated coherent baseband signal 208A. Demodulation of the microphone signal 206A may be performed, for example, using standard correlation techniques, including integrating and dumping (integrate and dump) a matched filter correlator bank.

Some detailed examples are provided below. To improve the performance of these demodulation techniques, in some embodiments, the microphone signal 206A may be filtered prior to demodulation to remove unwanted content/phenomena. According to some embodiments, the demodulated coherent baseband signal 208A may be filtered before being provided to the baseband processor 218A. The signal-to-noise ratio (SNR) typically increases with increasing integration time (with increasing length of the spreading code used). Not all types of calibration signals (e.g., white noise and acoustic signals corresponding to music) need to be modulated before being mixed with the rendered audio data for playback. Thus, some embodiments may not include a calibration signal demodulator;

-218A: a baseband processor configured for baseband processing of the demodulated coherent baseband signal 208A. In some examples, baseband processor 218A may be configured to implement techniques such as non-coherent averaging to improve SNR by reducing the variance of the square waveform to produce a delayed waveform. Some detailed examples are provided below. In this example, the baseband processor 218A is configured to output one or more estimated acoustic scene metrics 225A;

-220A: an optional calibration signal modulator configured to modulate the calibration signal 203A generated by the calibration signal generator to produce the calibration signal 230A. As described elsewhere herein, not all types of calibration signals need to be modulated before being mixed with rendered audio data for playback. Thus, some embodiments may not include a calibration signal modulator;

-225A: one or more observations derived from the calibration signal(s),

which is also referred to herein as an acoustic scene metric. The acoustic scene metric(s) 225A may include or may be data corresponding to time of flight, time of arrival, range, audio device audibility, audio device impulse response, angle between audio devices, audio device location, audio environmental noise and/or signal-to-noise ratio;

-233A: an acoustic scene metric processing module configured to receive and apply acoustic scene metrics 225A. In this example, the acoustic scene metric processing module 233A is configured to generate the information 235A (and/or command) based at least in part on the at least one acoustic scene metric 225A and/or the at least one audio device characteristic. Depending on the particular implementation, the audio device characteristic(s) may correspond to the audio device 100A or another audio device of the audio environment. The audio device characteristic(s) may be stored, for example, in a memory of the control system 160, or may be accessible by the control system 160; and

-235A: information for controlling one or more aspects of audio processing and/or playback by an audio device. For example, information 235A may include information for controlling a rendering process, an audio environment mapping process (such as an audio device auto-localization process), an audio device calibration process, a noise suppression process, and/or an echo attenuation process (and

or a command).

Acoustic scene metric example

As described above, in some implementations, the baseband processor 218A (or another module of the control system 160) may be configured to determine one or more acoustic scene metrics 225A. The following are some examples of acoustic scene metrics 225A.

Distance measurement

The calibration signal received by the audio device from the other device contains information about the distance between the two devices in the form of the time of flight (ToF) of the signal. According to some examples, the control system may be configured to extract delay information from the demodulated calibration signal and convert the delay information into pseudorange measurements, e.g., as follows:

ρ＝τc

in the above equation, τ represents delay information (also referred to herein as ToF), ρ represents a pseudorange measurement, and c represents the speed of sound. We mention "pseudoranges" because the distance itself is not measured directly, and thus the distance between devices is estimated from a timing estimate. In a distributed asynchronous system of audio devices, each audio device runs on its own clock, so there is a bias in the raw delay measurement. Given a set of sufficient delay measurements, these deviations can be resolved and sometimes estimated. Detailed examples of extracting delay information, generating and using pseudorange measurements, and determining and resolving clock bias are provided below.

DoA

In a manner similar to ranging, using a plurality of microphones available on a listening device, the control system may be configured to estimate a direction of arrival (DoA) by processing the demodulated acoustic calibration signals. In some such embodiments, the resulting DoA information may be used as input to an automatic positioning method for the DoA-based audio device.

Audibility of

The signal strength of the demodulated acoustic calibration signal is proportional to the audibility of the listening audio device in the frequency band in which the audio device transmits the acoustic calibration signal. In some embodiments, the control system may be configured to make multiple observations within the frequency band to obtain a banded estimate of the entire frequency range. Knowing the digital signal level of the transmitting audio device, in some examples, the control system may be configured to estimate the absolute acoustic gain of the transmitting audio device.

Fig. 3 is a block diagram illustrating an example of an audio device element according to another disclosed embodiment. As with the other figures provided herein, the types and amounts of elements shown in fig. 3 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. In this example, the audio device 100A of fig. 3 is an example of the apparatus 150 described above with reference to fig. 1B and 2. However, according to such an embodiment, the audio device 100A is configured to orchestrate a plurality of audio devices in an audio environment, including at least the audio devices 100B, 100C, and 100D.

The embodiment shown in fig. 3 includes all elements of fig. 2 as well as some additional elements. Elements common to fig. 2 and 3 are not described herein, except that their functions may differ in the embodiment of fig. 3. According to such an implementation, the audio device 100A includes the following elements and functions:

-120A, B, C, D: audio devices corresponding to the rendered content being played back by audio devices 100A-100D in the same acoustic space play back sound;

-204A, B, C, D: a copy of the calibration signal corresponding to the calibration signal generated by the other audio devices of the audio environment (in this example, at least audio devices 100B, 100C, and 100D). In this example, the calibration signal copies 204A-204D are provided by the orchestration module 213A. Here, orchestration module 213A provides calibration information 204B-204D to audio devices 100B-100D, e.g., via wireless communication;

-205A, B, C, D: these elements correspond to calibration information associated with and/or used by each of the audio devices 100A-100D. The calibration information 205A may include parameters (such as one or more DSSS spreading code parameters and one or more DSSS carrier parameters) that are used by the control system 160 of the audio device 100A to generate a calibration signal, to modulate the calibration signal, to demodulate the calibration signal, and so on. The calibration information 205B, 205C, and 205D may include parameters (e.g., one or more DSSS spreading code parameters and one or more DSSS carrier parameters) used by the audio devices 100B, 100C, and 100D, respectively, to generate a calibration signal, to modulate the calibration signal, to demodulate the calibration signal, etc. In some examples, the calibration information 205A-205D may indicate which audio devices are generating acoustic calibration signals;

-213A: and (5) arranging a module. In this example, orchestration module 213A generates calibration information 205A-205D, provides calibration information 205A to calibration signal generator 212A, provides calibration information 205A-205D to a calibration signal demodulator, and provides calibration information 205B-205D to audio devices 100B-100D, e.g., via wireless communication. In some examples, orchestration module 213A generates calibration information 205A-205D based at least in part on information 235A-235D and/or acoustic scene metrics 225A-225D;

-214A: a calibration signal demodulator configured to demodulate at least microphone signal 206A received by microphone(s) 111A. In this example, the calibration signal demodulator 214A outputs a demodulated coherent baseband signal 208A. In some alternative implementations, the calibration signal demodulator 214A may receive and demodulate the microphone signals 206B-206D from the audio devices 100B-100D and may output the demodulated coherent baseband signals 208B-208D;

-218A: a baseband processor configured for baseband processing of at least the demodulated coherent baseband signal 208A and, in some examples, for baseband processing of the demodulated coherent baseband signals 208B-208D received from the audio devices 100B-100D. In this example, the baseband processor 218A is configured to output one or more estimated acoustic scene metrics 225A-225D. In some implementations, the baseband processor 218A is configured to determine the acoustic scene metrics 225B-225D based on the demodulated coherent baseband signals 208B-208D received from the audio devices 100B-100D. However, in some cases, the baseband processor 218A (or the acoustic scene metric processing module 233A) may receive the acoustic scene metrics 225B-225D from the audio devices 100B-100D;

-233A: an acoustic scene metric processing module configured to receive and apply acoustic scene metrics 225A-225D. In this example, the acoustic scene metric processing module 233A is configured to generate the information 235A-235D based at least in part on the acoustic scene metrics 225A-225D and/or at least one audio device characteristic. The audio device characteristic(s) may correspond to one or more of the audio device 100A and/or the audio devices 100B-100D.

Fig. 4 is a block diagram illustrating an example of an audio device element according to another disclosed embodiment. As with the other figures provided herein, the types and amounts of elements shown in fig. 4 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. In this example, the audio device 100A of fig. 4 is an example of the apparatus 150 described above with reference to fig. 1B, 2, and 3. The embodiment shown in fig. 4 includes all elements of fig. 3 as well as additional elements. Elements common to fig. 2 and 3 are not described herein, except that their functions may differ in the embodiment of fig. 4.

According to such an embodiment, the control system 160 is configured to process the received microphone signal 206A to produce a preprocessed microphone signal 207A. In some implementations, processing the received microphone signal may involve applying a band pass filter and/or echo cancellation. In this example, the control system 160 (and more specifically, the calibration signal demodulator 214A) is configured to extract the calibration signal from the preprocessed microphone signal 207A.

According to this example, microphone system 111A includes a microphone array, which in some cases may be or include one or more directional microphones. In such an embodiment, processing the received microphone signal involves receive-side beamforming, in this example via beamformer 215A. In this example, the preprocessed microphone signal 207A output by the beamformer 215A is or includes a spatial microphone signal.

In such an embodiment, the calibration signal demodulator 214A processes the spatial microphone signal, which may enhance the performance of an audio system in which the audio devices are spatially distributed around the audio environment. Receive side beamforming is one way to solve the aforementioned "near/far problem": for example, the control system 160 may be configured to use beamforming to compensate for closer and/or louder audio devices in order to receive audio device playback sounds from more distant and/or less loud audio devices.

For example, receive side beamforming may involve delaying the sum of signals from each microphone in the array of microphones by a different factor. In some examples, beamformer 215A may apply a Dolph-Chebyshev weighting pattern. However, in other embodiments, the beamformer 215A may apply a different weighting pattern. According to some such examples, a main lobe may be generated, as well as nulls (nulls) and side lobes. In addition to controlling the main lobe width (beam width) and side lobe level, the location of the null may also be controlled in some examples.

Subsonic frequency signal

According to some embodiments, the calibration signal component of the audio device playback sound may be inaudible to a person in the audio environment. In some such implementations, the content stream component of the audio device playback sound may cause a perceived masking of the calibration signal component of the audio device playback sound.

Fig. 5 is a diagram showing horizontal examples of content stream components of audio device playback sound and DSSS signal components of audio device playback sound over a frequency range. In this example, curve 501 corresponds to the level of the content stream component and curve 530 corresponds to the level of the DSSS signal component.

DSSS signals typically include data, carrier signals, and spreading codes. If we omit the need to transmit data over the channel, we can express the modulated signal s (t) as follows:

s(t)＝AC(t)sin(2πf ₀ t)

in the above equation, a represents the amplitude of the DSSS signal, C (t) represents the spreading code, and Sin () represents the carrier frequency f ₀ Hz sinusoidal carrier. Curve 530 in fig. 5 corresponds to an example of s (t) in the above equation.

One of the potential advantages of some disclosed embodiments involving acoustic DSSS signals is that by expanding the signal, the perceptibility of the DSSS signal component of the audio device playback sound may be reduced, as the amplitude of the DSSS signal component is reduced by a given amount of energy in the acoustic DSSS signal.

This allows us to place the DSSS signal component of the audio device playback sound (e.g., as shown by curve 530 of fig. 5) at a level sufficiently lower than the content stream component level of the audio device playback sound (e.g., as shown by curve 501 of fig. 5) such that the DSSS signal component is imperceptible to a listener.

Some disclosed embodiments utilize masking characteristics of the human auditory system to optimize parameters of the calibration signal in a manner that maximizes the signal-to-noise ratio (SNR) observed by the derived calibration signal and/or reduces the perceived probability of the calibration signal component. Some disclosed examples relate to applying weights to the level of content stream components and/or applying weights to the level of calibration signal components. Some such examples apply a noise compensation method, where the acoustic calibration signal component is considered a signal and the content stream component is considered noise. Some such examples involve applying one or more weights according to (e.g., proportionally to) the play/listen target metric.

DSSS spreading code

As described elsewhere herein, in some examples, the calibration information 205 provided by the orchestration device (e.g., the information provided by orchestration module 213A described above with reference to fig. 3) may include one or more DSSS spreading code parameters.

The spreading code used to spread the carrier to create the DSSS signal(s) may be important. The set of DSSS spreading codes is preferably selected such that the corresponding DSSS signal has the following characteristics:

1. sharp main lobes in the autocorrelation waveform;

2. low side lobes at non-zero delays in the autocorrelation waveform;

3. if multiple devices are to access the medium simultaneously (e.g., to play back a modified audio playback signal containing DSSS signal components simultaneously), then a low cross-correlation between any two spreading codes within a set of spreading codes is to be used; and

dsss signals are unbiased (have zero DC component).

The family of spreading codes (e.g., gold codes commonly used in the GPS context) typically represent the four points above. If multiple audio devices are simultaneously playing back a modified audio playback signal containing DSSS signal components, and each audio device uses a different spreading code (with good cross-correlation properties, e.g., low cross-correlation), then the receiving audio device should be able to simultaneously receive and process all acoustic DSSS signals using a code-domain multiple access (CDMA) method. By using the CDMA method, multiple audio devices may in some cases transmit acoustic DSSS signals simultaneously using a single frequency band. The spreading code may be generated during run-time and/or pre-generated and stored in memory, for example, in a data structure such as a look-up table.

To implement DSSS, binary Phase Shift Keying (BPSK) modulation may be utilized in some examples. Further, in some examples, DSSS spreading codes may be placed in quadrature (inter-multiplex)) with each other to implement a Quadrature Phase Shift Keying (QPSK) system, e.g., as follows:

s(t)＝A _I C _I (t)cos(2πf ₀ t)+A _Q C _Q (t)sin(2πf ₀ t)

in the above equation, A _I And A _Q Separate tableShowing the amplitude of the in-phase and quadrature signals, C _I And C _Q Code sequences representing in-phase and quadrature signals, respectively, and f ₀ The center frequency of the DSSS signal is represented (8200). The foregoing are examples of parameterized DSSS carriers and coefficients of DSSS spreading codes according to some examples. These parameters are examples of the calibration signal information 205 described above. As described above, the calibration signal information 205 may be provided by an orchestration device, such as orchestration module 213A, and may be used, for example, by the signal generator block 212 to generate DSSS signals.

Fig. 6 is a graph showing an example of the power of two calibration signals having different bandwidths but at the same center frequency. In these examples, fig. 6 shows the spectra of two calibration signals 630A and 630B, both centered on the same center frequency 605. In some examples, the calibration signal 630A may be generated by one audio device of the audio environment (e.g., by the audio device 100A) and the calibration signal 630B may be generated by another audio device of the audio environment (e.g., by the audio device 100B).

According to this example, the calibration signal 630B is chipped at a higher rate than the calibration signal 630A (in other words, using a greater number of bits per second in the extension signal), resulting in the bandwidth 610B of the calibration signal 630B being greater than the bandwidth 610A of the calibration signal 630A. For a given amount of energy per calibration signal, the greater bandwidth of calibration signal 630B results in the amplitude and perceptibility of calibration signal 630B being relatively lower than that of calibration signal 630A. Higher bandwidth calibration signals also result in higher delay resolution of baseband data products, resulting in higher resolution estimates (such as time-of-flight estimates, time-of-arrival (ToA) estimates, range estimates, direction-of-arrival (DoA) estimates, etc.) based on the acoustic scene metrics of the calibration signals. However, higher bandwidth calibration signals also increase the noise bandwidth of the receiver, thereby reducing the SNR of the extracted acoustic scene metric. Furthermore, if the bandwidth of the calibration signal is too large, coherence and fading problems associated with the calibration signal may occur.

The length of the spreading code used to generate the DSSS signal limits the amount of cross-correlation suppression. For example, a 10 bit Gold code suppresses neighboring codes by only-26 dB. This may lead to the situation of the near/far problem described above, where a relatively low amplitude signal may be masked by the cross-correlation noise of another, louder signal. Similar problems may occur with respect to other types of calibration signals. Some of the novelty of the systems and methods described in this disclosure relates to orchestration schemes designed to alleviate or avoid such problems.

Arrangement method

FIG. 7 illustrates elements of an orchestration module according to one example. As with the other figures provided herein, the types and amounts of elements shown in fig. 7 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. According to some examples, orchestration module 213 may be implemented by the example of apparatus 150 described above with reference to fig. 1B. In some such examples, orchestration module 213 may be implemented by an instance of control system 160. In some examples, orchestration module 213 may be an example of the orchestration module described above with reference to fig. 3.

According to such an embodiment, orchestration module 213 comprises a perception model application module 710, an acoustic model application module 711, and an optimization module 712.

In this example, the perceptual model application module 710 is configured to apply a model of the human auditory system to make one or more perceptual impact estimates 702 of the perceptual impact of the acoustic calibration signal on a listener in the acoustic space based at least in part on the prior information 701. The acoustic space may be, for example, an audio environment in which the orchestration module 213 is to be orchestrated with the audio device, a room of such an audio environment, etc. The estimate(s) 702 may change over time. In some examples, perceptual impact estimate(s) 702 may be an estimate of a listener's ability to perceive an acoustic calibration signal based on, for example, the type and level of audio content (if any) currently being played back in the acoustic space. The perceptual model application module 710 may be configured, for example, to apply one or more auditory masking models, such as masking as a function of frequency and loudness, spatial auditory masking, and the like. The perception model application module 710 may be configured, for example, to apply one or more human loudness perception models, e.g., human loudness perception as a function of frequency.

According to some examples, prior information 701 may be or may include information related to an acoustic space, information related to transmission of an acoustic calibration signal in an acoustic space, and/or information related to a listener known to use an acoustic space. For example, the prior information 701 may include information regarding the number of audio devices (e.g., programmed audio devices) in the acoustic space, the location of the audio devices, the loudspeaker system and/or microphone system capabilities of the audio devices, information related to impulse responses of the audio environment, information related to one or more doors and/or windows of the audio environment, information related to audio content currently being played back in the acoustic space, and so forth. In some cases, the prior information 701 may include information regarding the hearing abilities of one or more listeners.

In such an embodiment, the acoustic model application module 711 is configured to make one or more acoustic calibration signal performance estimates 703 for the acoustic calibration signals in the acoustic space based at least in part on the prior information 701. For example, the acoustic model application module 711 may be configured to estimate how well the microphone system of each audio device can detect acoustic calibration signals from other audio devices in the acoustic space, which may be referred to herein as an aspect of audio device "mutual audibility". In some cases, such mutual audibility may be an acoustic scene metric previously estimated by the baseband processor based at least in part on a previously received acoustic calibration signal. In some such embodiments, the mutual audibility estimate may be part of the a priori information 701, and in some such embodiments, the orchestration module 213 may not include the acoustic model application module 711. However, in some embodiments, the mutual audibility estimation may be performed independently by the acoustic model application module 711.

In this example, optimization module 712 is configured to determine calibration parameters 705 for all audio devices that are programmed by programming module 213 based at least in part on perceptual impact estimate(s) 702 and acoustic calibration signal performance estimate 703, as well as current play/listen target information 704. The current play/listen target information 704 may, for example, indicate the relative need for new acoustic scene metrics based on the acoustic calibration signal.

For example, if one or more audio devices are newly powered on in an acoustic space, the demand level for new acoustic scene metrics related to automatic positioning of the audio devices, mutual audibility of the audio devices, etc. may be high. At least some of the new acoustic scene metrics may be based on the acoustic calibration signal. Similarly, if an existing audio device has moved within the acoustic space, the demand level for new acoustic scene metrics may be high. Also, if a new noise source is in or near the acoustic space, the level of demand for determining new acoustic scene metrics may be high.

If the current play/listen target information 704 indicates a high level of demand for determining new acoustic scene metrics, the optimization module 712 may be configured to determine the calibration parameters 705 by placing relatively higher weights on the acoustic calibration signal performance estimate(s) 703 than on the perceptual impact estimate(s) 702. For example, the optimization module 712 may be configured to determine the calibration parameters 705 by emphasizing the ability of the system to produce high SNR observations of the acoustic calibration signal and de-emphasizing the user's impact/perceptibility of the acoustic calibration signal. In some such examples, the calibration parameter 705 may correspond to an audible acoustic calibration signal.

However, if no recent changes are detected in or near the acoustic space and there is at least an initial estimate of one or more acoustic scene metrics, then the demand level for new acoustic scene metrics may not be high. If no recent change is detected in or near the acoustic space, there is at least a preliminary estimate of one or more acoustic scene metrics, and audio content is currently being rendered within the acoustic space, then the relative importance of immediately estimating one or more new acoustic scene metrics may be further reduced.

If the current play/listen target information 704 indicates that the level of demand for determining new acoustic scene metrics is low, the optimization module 712 may be configured to determine the calibration parameters 705 by placing relatively lower weights on the acoustic calibration signal performance estimate(s) 703 than on the perceptual impact estimate(s) 702. In such examples, the optimization module 712 may be configured to determine the calibration parameters 705 by de-emphasizing the ability of the system to produce high SNR observations of the acoustic calibration signal and emphasizing the user's impact/perceptibility on the acoustic calibration signal. In some such examples, the calibration parameter 705 may correspond to an subsonic acoustic calibration signal.

As described later in this document (e.g., in other examples of audio device programming), parameters of the acoustic calibration signal provide a rich diversity in the manner in which the programming device can modify the acoustic calibration signal to enhance the performance of the audio system.

Fig. 8 shows another example audio environment. In fig. 8, audio devices 100B and 100C are separated from device 100A by distances 810 and 811, respectively. In this particular case, distance 811 is greater than distance 810. Assuming that audio devices 100B and 100C are producing audio device playback sounds at approximately the same level, this means that the acoustic calibration signal received by audio device 100A from audio device 100C is at a lower level than the acoustic calibration signal from audio device 100B because of the additional acoustic loss caused by the longer distance 811. In some embodiments, audio devices 100B and 100C may be arranged to enhance the ability of audio device 100A to extract acoustic calibration signals and determine acoustic scene metrics based on the acoustic calibration signals.

Fig. 9 shows an example of acoustic calibration signals generated by the audio devices 100B and 100C of fig. 8. In this example, the acoustic calibration signals have the same bandwidth and are located at the same frequency, but have different amplitudes. Here, the acoustic calibration signal 230B is generated by the audio device 100B and the main lobe of the acoustic calibration signal 230C is generated by the audio device 100C. According to this example, the peak power of the acoustic calibration signal 230B is 905B and the peak power of the acoustic calibration signal 230C is 905C. Here, the acoustic calibration signal 230B and the acoustic calibration signal 230C have the same center frequency 901.

In this example, the orchestration device (which in some examples may include an instance of orchestration module 213 of fig. 7 and which in some examples may be audio device 100A of fig. 8) has enhanced the ability of audio device 100A to extract the audio calibration signal by equalizing the digital levels of the acoustic calibration signals generated by audio devices 100B and 100C such that the peak power of acoustic calibration signal 230C is greater than the peak power of acoustic calibration signal 230B by a factor that counteracts the difference in acoustic losses due to the different distances of 810 and 811. Thus, according to this example, audio device 100A receives acoustic calibration signal 230B from audio device 100C at approximately the same level as the acoustic calibration signal received from audio device 100B due to the parasitic acoustic loss caused by longer distance 811.

The surface area around the point source increases with the square of the distance from the source. This means that the same acoustic energy from the source is distributed over a larger area according to the inverse square law and the energy intensity decreases with the square of the distance from the source. Distance 810 is set to B and distance 811 is set to c, the acoustic energy received by audio device 100A from audio device 100B is equal to 1/B ² Proportional to, and the acoustic energy received by audio device 100A from audio device 100C is 1/C ² Proportional to the ratio. The difference of the acoustic energy and 1/(c) ² -b ² ) Proportional to the ratio. Thus, in some implementations, the orchestration device may double the energy generated by the audio device 100C (C ² -b ² ). This is an example of how calibration parameters may be altered to enhance performance.

In some embodiments, the optimization process may be more complex and may take into account more factors than the inverse square law. In some examples, equalization may be accomplished via full band gain applied to the calibration signal or via an Equalization (EQ) curve that enables equalization of the non-flat (frequency dependent) response of the microphone system 110A.

Fig. 10 is a diagram providing an example of a Time Domain Multiple Access (TDMA) method. One way to avoid near/far problems is to schedule a plurality of audio devices that are transmitting and receiving acoustic calibration signals such that each audio device is scheduled a different time slot to play its acoustic calibration signal. This is called TDMA method. In the example shown in fig. 10, the orchestration device causes the audio devices 1, 2 and 3 to transmit acoustic calibration signals according to the TDMA method. In this example, audio devices 1, 2 and 3 transmit acoustic calibration signals in the same frequency band. According to this example, the orchestration device causes the audio device 3 to follow from time t ₀ By time t ₁ Transmitting an acoustic calibration signal, after which the programming device causes the audio device 2 to transmit the acoustic calibration signal from time t ₁ By time t ₂ Transmitting an acoustic calibration signal, after which the programming device causes the audio device 1 to transmit an acoustic calibration signal from time t ₂ By time t ₃ An acoustic calibration signal is transmitted, and so on.

Thus, in this example, no two calibration signals are transmitted or received simultaneously. Thus, the remaining calibration signal parameters such as amplitude, bandwidth and length (long enough that each calibration signal remains within its assigned time slot) are independent of multiple access. However, such calibration signal parameters do remain related to the quality of the observations extracted from the calibration signal.

Fig. 11 is a diagram showing an example of a Frequency Domain Multiple Access (FDMA) method. In some implementations (e.g., due to limited bandwidth of the calibration signal), the orchestration device may be configured to cause the audio device to receive acoustic calibration signals from two other audio devices in the audio environment simultaneously. In some such examples, if each audio device transmitting an acoustic calibration signal plays its respective acoustic calibration signal in a different frequency band, the acoustic calibration signals are significantly different in terms of the received power level. This is the FDMA method. In the FDMA method example shown in fig. 11, the calibration signals 230B and 230C are transmitted simultaneously by different audio devices, but with different center frequencies (f ₁ And f ₂ ) And a different frequency band (b ₁ And b ₂ ). In this example, band b of the main lobe ₁ And b ₂ And do not overlap. Such FDMA approaches may be advantageous for cases where the acoustic calibration signal has a large difference in acoustic losses associated with its path.

In some implementations, the orchestration device may be configured to change FDMA, TDMA, or CDMA methods in order to mitigate near/far issues. In some DSSS examples, the length of the DSSS spreading code may be altered according to the relative audibility of the devices in the room. As described above with reference to fig. 6, given the same energy in an acoustic DSSS signal, if the spreading code increases the bandwidth of the acoustic DSSS signal, the acoustic DSSS signal will have a relatively low maximum power and will be relatively inaudible. Alternatively or additionally, in some embodiments, the calibration signals may be arranged orthogonal to each other. Some such embodiments allow the system to have DSSS signals with different spreading code lengths at the same time. Alternatively or additionally, in some embodiments, the energy in each calibration signal may be modified to reduce the effects of near/far issues (e.g., to increase the level of acoustic calibration signals produced by relatively louder sounds and/or more distant transmitting audio devices) and/or to obtain an optimal signal-to-noise ratio for a given operational objective.

Fig. 12 is a diagram showing another example of the arrangement method. The elements of fig. 12 are as follows:

1210. 1211 and 1212: bands that do not overlap each other;

230Ai, bi and Ci: a plurality of acoustic calibration signals time-domain multiplexed within the frequency band 1210. While it appears that audio devices 1, 2, and 3 are using different portions of frequency band 1210, in this example acoustic calibration signals 230Ai, bi, and Ci extend across most or all of frequency band 1210;

230D and E: a plurality of acoustic calibration signals code-domain multiplexed within the frequency band 1211. While it appears that audio devices 4 and 5 are using different portions of frequency band 1211, in this example acoustic calibration signals 230D and 230E extend across most or all of frequency band 1211; and

230Aii, bii and Cii: a plurality of acoustic calibration signals code-domain multiplexed within frequency band 1212. While it appears that audio devices 1, 2, and 3 are using different portions of frequency band 1210, in this example acoustic calibration signals 230Aii, bii, and Cii extend across most or all of frequency band 1212.

Fig. 12 shows an example of how TDMA, FDMA and CDMA may be used together in some embodiments of the invention. In band 1 (1210), TDMA is used to schedule acoustic calibration signals 230Ai, bi, and Ci, respectively, to be transmitted by audio devices 1-3. The frequency band 1210 is a single frequency band in which the acoustic calibration signals 230Ai, bi, and Ci cannot fit simultaneously without overlapping.

In band 2 (1211), CDMA is used to program acoustic calibration signals 230D and E from audio devices 4 and 5, respectively. In this particular example, acoustic calibration signal 230D is longer in time than acoustic calibration signal 230E. If the audio device 5 is louder than the audio device 4, a shorter calibration signal duration of the audio device 5 may be useful from the point of view of the receiving audio device if the shorter calibration signal duration corresponds to an increase in bandwidth and a lower calibration signal peak frequency. The signal-to-noise ratio (SNR) may also increase with a relatively longer duration of the acoustic calibration signal 230D.

In band 3 (1212), CDMA is used to program acoustic calibration signals 230Aii, bii, and Cii, which are transmitted by audio devices 1-3, respectively. These acoustic calibration signals correspond to the alternative calibration signals used by audio devices 1-3 that simultaneously transmit TDMA-programmed acoustic calibration signals for the same audio device in frequency band 1210. This is a form of FDMA where longer calibration signals are located in one frequency band (1212) and transmitted simultaneously (without TDMA), while shorter calibration signals are located in another frequency band (1210) using TDMA.

Fig. 13 is a diagram showing another example of the arrangement method. According to this embodiment, the audio device 4 is transmitting acoustic calibration signals 230Di and 230Dii that are orthogonal to each other, while the audio device 5 is transmitting acoustic calibration signals 230Ei and 230Eii that are also orthogonal to each other. According to this example, all acoustic calibration signals are transmitted simultaneously within a single frequency band 1310. In this case, the quadrature acoustic calibration signals 230Di and 230Ei are longer than the in-phase calibration signals 230Di and 230Eii transmitted by the two audio devices. This results in each audio device having a faster and noisier set of observations derived from the acoustic calibration signals 230Di and 230Eii in addition to a higher set of SNR observations derived from the acoustic calibration signals 230Di and 230Ei, albeit at a lower update rate. This is an example of a CDMA-based orchestration method in which two audio devices are sending acoustic calibration signals designed for an acoustic space shared by the two audio devices. In some cases, the scheduling method may also be based at least in part on the current listening objective.

Fig. 14 illustrates elements of an audio environment according to another example. In this example, the audio environment 1401 is a multi-room residence that includes acoustic spaces 130A, 130B, and 130C. According to this example, gates 1400A and 1400B may change the coupling of each acoustic volume. For example, if door 1400A is open, acoustic spaces 130A and 130C are acoustically coupled to at least some extent, whereas if door 1400A is closed, acoustic spaces 130A and 130C are not acoustically coupled to any significant extent. In some implementations, the orchestration device may be configured to detect that the door is open (or that another acoustic obstacle is moved) based on detecting or not detecting audio device playback sounds in the adjacent acoustic space.

In some examples, the orchestration device may orchestrate all of the audio devices 100A-100E in all of the acoustic spaces 130A, 130B, and 130C. However, because there is a significant level of acoustic isolation between the acoustic spaces 130A, 130B, and 130C when the doors 1400A and 1400B are closed, in some examples, the orchestration device may treat the acoustic spaces 130A, 130B, and 130C as independent when the doors 1400A and 1400B are closed. In some examples, the orchestration device may treat acoustic spaces 130A, 130B, and 130C as independent, even when doors 1400A and 1400B are open. However, in some cases, the orchestration device may manage audio devices near doors 1400A and/or 1400B such that when the acoustic space is coupled due to the door opening, the audio devices near the open door are considered to correspond to the rooms on both sides of the door. For example, if the orchestration device determines that the door 1400A is open, the orchestration device may be configured to treat the audio device 100C as being both an audio device of the acoustic space 130A and an audio device of the acoustic space 130C.

Fig. 15 is a flowchart outlining another example of the disclosed audio device orchestration method. The blocks of method 1500, as with other methods described herein, are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described. The method 1500 may be performed by a system comprising an orchestration device and an orchestrated audio device. The system may include an example of the apparatus 150 shown in fig. 1B and described above, one of which is configured as an orchestration device. In some examples, the orchestration device may include an instance of the orchestration module 213 disclosed herein.

According to this example, block 1505 relates to steady state operation of all participating audio devices. In this context, "steady state" operation means operation according to a set of calibration signal parameters recently received from the orchestration device. According to some embodiments, the set of parameters may include one or more DSSS spreading code parameters and one or more DSSS carrier parameters.

In this example, block 1505 also relates to one or more devices waiting for a trigger condition. The triggering condition may be, for example, an acoustic change in the audio environment in which the programmed audio device is located. The acoustic changes may be or may include noise from a noise source, changes to doors or windows that open or close (e.g., audibility of playback sound from one or more loudspeakers in an adjacent room increases or decreases), movement of an audio device detected in an audio environment, movement of a person detected in an audio environment, utterances of a person (e.g., of a wake-up word) detected in an audio environment, start of playback of audio content (e.g., start of a movie, television program, music content, etc.), change of playback of audio content (e.g., volume change equal to or greater than a decibel threshold change), and the like. In some cases, for example, as disclosed herein, acoustic changes are detected via an acoustic calibration signal (e.g., one or more acoustic scene metrics 225A estimated by the baseband processor 218 of an audio device in an audio environment).

In some cases, the trigger condition may be an indication that a new audio device has been powered on in the audio environment. In some such examples, the new audio device may be configured to produce one or more characteristic sounds that may or may not be audible to humans. According to some examples, the new audio device may be configured to play back acoustic calibration signals reserved for the new device.

In this example, a determination is made in block 1510 as to whether a trigger condition has been detected. If so, processing proceeds to block 1515. If not, processing returns to block 1505. In some implementations, block 1505 may include block 1510.

According to this example, block 1515 involves determining, by the orchestration device, one or more (and in some cases, all) of the updated acoustic calibration signal parameters of the orchestrated audio device, and providing the updated acoustic calibration signal parameter(s) to the orchestrated audio device(s). In some examples, block 1515 may involve providing, by the orchestration device, calibration signal information 205 described elsewhere herein. The determination of the updated acoustic calibration signal parameter(s) may involve using prior knowledge and estimation of the acoustic space, such as:

Device location;

device range;

device orientation and relative angle of incidence;

relative clock skew and tilt between devices;

the relative audibility of the device;

room noise estimation;

the number of microphones and loudspeakers in each device;

directivity of the loudspeaker of each device;

directivity of the microphone of each device;

the type of content rendered into the acoustic space;

the location of one or more listeners in the acoustic space; and/or

Knowledge of the acoustic space, including specular reflection and occlusion.

In some examples, these factors may be combined with the operation target to determine a new operation point. Note that many of these parameters, which are used as prior knowledge in determining the updated calibration signal parameters, may in turn be derived from the acoustic calibration signal. Thus, it can be readily appreciated that in some examples, the orchestration system may iteratively improve its performance as the system obtains more information, more accurate information, etc.

In this example, block 1520 relates to reconfiguring, by one or more orchestrated audio devices, one or more parameters for generating an acoustic calibration signal according to updated acoustic calibration signal parameter(s) received from the orchestration device. According to such an embodiment, after completion of block 1520, processing returns to block 1505. Although the flowchart of fig. 15 does not show an end, the method 1500 may end in various ways, for example, when the audio device is powered down.

Fig. 16 illustrates another example audio environment. The audio environment 130 shown in fig. 16 is the same as the audio environment shown in fig. 8, but also shows the angular separation of the audio device 100B from the audio device 100C from the perspective of (relative to) the audio device 100A. In fig. 16, audio devices 100B and 100C are separated from device 100A by distances 810 and 811, respectively. In this particular case, distance 811 is greater than distance 810. Assuming that audio devices 100B and 100C are producing audio device playback sounds at approximately the same level, this means that the acoustic calibration signal received by audio device 100A from audio device 100C is at a lower level than the acoustic calibration signal from audio device 100B due to the additional acoustic loss caused by longer distance 811.

In this example, we focus on the orchestration of devices 100B and 100C to optimize the ability of device 100A to hear both. As described above, there are other factors to consider, but this example focuses on the diversity of angles of arrival caused by the angular separation of audio device 100B from audio device 100C relative to audio device 100A. Due to the differences in distances 810 and 811, the orchestration may result in the code lengths of audio devices 100B and 100C being set longer to mitigate near-far issues by reducing cross-channel correlation. However, if the receive side beamformer (215) is implemented by the audio device 100A, the near/far problem is somewhat alleviated because the angular separation between the audio devices 100B and 100C places microphone signals corresponding to sound from the audio devices 100B and 100C in different lobes and provides additional separation of the two received signals. Thus, this additional separation may allow the orchestration device to reduce the acoustic calibration signal length and obtain observations at a faster rate.

This applies not only to acoustic DSSS spreading code lengths, for example. When audio device 100A (and/or audio devices 100B and 100C) uses a spatial microphone feed instead of an omni-directional microphone feed, any acoustic calibration parameters that may be altered to mitigate near-far issues (e.g., even using FDMA or TDMA) may no longer be needed.

The arrangement according to the spatial mode, in this case angular diversity, depends on an estimate of these already available characteristics. In one example, calibration parameters may be optimized for an omnidirectional microphone feed (206), and then after the DoA estimation is available, acoustic calibration parameters may be optimized for a spatial microphone feed. This is one implementation of the trigger condition described above with reference to fig. 15.

Fig. 17 is a block diagram illustrating an example of a calibration signal demodulator element, a baseband processor element, and a calibration signal generator element, according to some disclosed embodiments. As with the other figures provided herein, the types and amounts of elements shown in fig. 17 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. Other examples may implement other methods, such as frequency domain correlation. In this example, the calibration signal demodulator 214, baseband processor 218, and calibration signal generator 212 are implemented by the example of the control system 160 described above with reference to fig. 1B.

According to some embodiments, there is one instance of the calibration signal demodulator 214, baseband processor 218, and calibration signal generator 212 for each acoustic calibration signal transmitted (played back) from each audio device for which the acoustic calibration signal is to be received. In other words, for the embodiment shown in fig. 16, the audio device 100A will implement one instance of the calibration signal demodulator 214, the baseband processor 218, and the calibration signal generator 212 corresponding to the acoustic calibration signal received from the audio device 100B, and one instance of the calibration signal demodulator 214, the baseband processor 218, and the calibration signal generator 212 corresponding to the acoustic calibration signal received from the audio device 100C.

For purposes of illustration, the following description of fig. 17 will continue to use this example of the audio device 100A of fig. 16 as a local device, i.e., in this example, an example of the calibration signal demodulator 214, the baseband processor 218, and the calibration signal generator 212 are implemented. More specifically, the following description of fig. 17 will assume that the microphone signal 206 received by the calibration signal demodulator 214 includes playback sound generated by the loudspeaker of the audio device 100B, which includes an acoustic calibration signal generated by the audio device 100B, and that the examples of the calibration signal demodulator 214, baseband processor 218, and calibration signal generator 212 shown in fig. 17 correspond to the acoustic calibration signal played back by the loudspeaker of the audio device 100B.

In this particular embodiment, the calibration signal is a DSSS signal. Thus, according to such an embodiment, the calibration signal generator 212 includes an acoustic DSSS carrier module 1715, the acoustic DSSS carrier module 1715 being configured to provide the calibration signal demodulator 214 with a DSSS carrier copy 1705 of the DSSS carrier being used by the audio device 100B to generate its acoustic DSSS signal. In some alternative embodiments, the acoustic DSSS carrier module 1715 may be configured to provide the calibration signal demodulator 214 with one or more DSSS carrier parameters used by the audio device 100B to generate its acoustic DSSS signal. In some alternative examples, the calibration signal is another type of calibration signal generated by modulating a carrier wave, such as a maximum length sequence or other type of pseudo-random binary sequence.

In such an embodiment, the calibration signal generator 212 further comprises an acoustic DSSS spreading code module 1720, the acoustic DSSS spreading code module 1720 being configured to provide the DSSS spreading code 1706, which the audio device 100B uses to generate its acoustic DSSS signal, to the calibration signal demodulator 214. DSSS spreading code 1706 corresponds to spreading code C (t) in the equations disclosed herein. DSSS spreading code 1706 may be, for example, a pseudo-random number (PRN) sequence.

According to this embodiment, the calibration signal demodulator 214 includes a bandpass filter 1703, the bandpass filter 1703 being configured to produce a bandpass filtered microphone signal 1704 from the received microphone signal 206. In some cases, the passband of the bandpass filter 1703 may be centered around the center frequency of the acoustic DSSS signal from the audio device 100B being processed by the calibration signal demodulator 214. The bandpass filter 1703 may, for example, pass a main lobe of the acoustic DSSS signal. In some examples, the passband of the bandpass filter 1703 may be equal to the frequency band used to transmit the acoustic DSSS signal from the audio device 100B.

In this example, the calibration signal demodulator 214 includes a multiplication block 1711A, the multiplication block 1711A configured to convolve the band-pass filtered microphone signal 1704 with the DSSS carrier replica 1705 to produce the baseband signal 1700. According to such an embodiment, the calibration signal demodulator 214 further includes a multiplication block 1711B, the multiplication block 1711B configured to apply the DSSS spreading code 1706 to the baseband signal 1700 to produce a despread baseband signal 1701.

According to this example, the calibration signal demodulator 214 includes an accumulator 1710A and the baseband processor 218 includes an accumulator 1710B. Accumulators 1710A and 1710B may also be referred to herein as summation elements. Accumulator 1710A operates during a time corresponding to the code length of each acoustic calibration signal (in this example, the code length of the acoustic DSSS signal currently being played back by audio device 100B), which may be referred to herein as a "coherence time. In this example, accumulator 1710A implements an "integrate and dump" process; in other words, after summing the despread baseband signal 1701 for the coherence time, the accumulator 1710A outputs ("dumps") the demodulated coherent baseband signal 208 to the baseband processor 218. In some embodiments, the demodulated coherent baseband signal 208 may be a single number.

In this example, baseband processor 218 includes a square law module 1712, which square law module 1712 is configured in this example to square the absolute value of demodulation coherent baseband signal 208 and output a power signal 1722 to accumulator 1710B. After absolute and square processing, the power signal may be considered as an incoherent signal. In this example, accumulator 1710B operates within "incoherent time". In some examples, the incoherence time may be based on an input from the orchestration device. In some examples, the incoherent time may be based on a desired SNR. According to this example, accumulator 1710B outputs delay waveform 400 with multiple delays (also referred to herein as instances of "tau" or tau (tau)).

Stages 1704 through 208 in fig. 17 may be expressed as follows:

in the above equation, Y (tau) represents the coherent demodulator output (208), d [ n ] represents the band pass filtered signal (1704 or A in FIG. 17), CA represents the local replica of the code spread by the remote device in the room (audio device 100B in this example) for modulating the calibration signal (DSSS signal in this example), and the last term is the carrier signal. In some examples, all of these signal parameters are orchestrated between audio devices in an audio environment (e.g., may be determined and provided by an orchestration device).

The signal chain from Y (tau) (208) to < Y (tau) > (400) in fig. 17 is non-coherent integration, where the coherent demodulator output is squared and averaged. The number of averages (the number of non-coherent accumulator 1710B runs) is a parameter that may be determined and provided by the orchestration device in some examples, e.g., based on a determination that sufficient SNR has been achieved. In some cases, the audio device implementing baseband processor 218 may determine the average number of times, e.g., based on determining that sufficient SNR has been achieved.

Incoherent integration can be expressed mathematically as follows:

the above equation involves simple averaging of the square coherent delay waveform over a period of time defined by N, where N represents the number of blocks used in the incoherent integration.

Fig. 18 shows elements of a calibration signal demodulator according to another example. According to this example, the calibration signal demodulator 214 is configured to produce a delay estimate, a DoA estimate, and an audibility estimate. In this example, the calibration signal demodulator 214 is configured to perform coherent demodulation and then non-coherent integration of the full delay waveform. As in the example described above with reference to fig. 17, we will assume in this example that the calibration signal demodulator 214 is implemented by the audio device 100A and is configured to demodulate an acoustic DSSS signal played back by the audio device 100B.

In this example, the calibration signal demodulator 214 includes a bandpass filter 1703, the bandpass filter 1703 being configured to remove unwanted energy from other audio signals, such as some audio content rendered for the listener's experience and acoustic DSSS signals that have been placed in other frequency bands to avoid near/far problems. For example, the band pass filter 1703 may be configured to pass energy from one of the frequency bands shown in fig. 12 and 13.

The matched filter 1811 is configured to calculate the delay waveform 1802 by correlating the band pass filtered signal 1704 with a local replica of the acoustic calibration signal of interest: in this example, the local replica is an instance of the DSSS signal replica 204 that corresponds to the DSSS signal generated by the audio device 100B. The matched filter output 1802 is then low pass filtered by a low pass filter 712 to produce a coherently demodulated complex delay waveform 208. In some alternative implementations, the low pass filter 712 may be placed in the baseband processor 218 after a squaring operation that produces an incoherent average delay waveform, such as in the example described above with reference to fig. 17.

In this example, the channel selector 1813 is configured to control the bandpass filter 1703 (e.g., passband of the bandpass filter 1703) and the matched filter 1811 according to the calibration signal information 205. As described above, the calibration signal information 205 may include parameters that the control system 160 uses to demodulate the calibration signal, etc. In some examples, the calibration signal information 205 may indicate which audio devices are generating acoustic calibration signals. In some examples, the calibration signal information 205 may be received (e.g., via wireless communication) from an external source such as an orchestration device.

Fig. 19 is a block diagram illustrating an example of baseband processor elements according to some disclosed embodiments. Like the other figures provided herein, the types and amounts of elements shown in fig. 19 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. In this example, baseband processor 218 is implemented by the example of control system 160 described above with reference to fig. 1B.

In this particular embodiment, no coherent technique is applied. Thus, the first operation performed is to obtain the power of the complex delay waveform 208 via square law module 1712 to produce a non-coherent delay waveform 1922. Incoherent delay waveform 1922 is integrated by accumulator 1710B for a period of time (which is specified in this example in calibration signal information 205 received from the orchestration device, but which may be determined locally in some examples) to produce incoherent average delay waveform 400. According to this example, the delay waveform 400 is then processed in a number of ways, as follows:

1. the leading edge estimator 1912 is configured to perform a delay estimation 1902, which is an estimated time delay of the received signal. In some examples, delay estimate 1902 may be based at least in part on an estimate of a leading edge position of delay waveform 400. According to some such examples, delay estimate 1902 may be determined from a number of time samples of a signal portion (e.g., a positive portion) of the delay waveform up to and including a time sample corresponding to a leading edge position of delay waveform 400 or less than one chip period (inversely proportional to a signal bandwidth) after the leading edge position of delay waveform 400. In the latter case, the delay may be used to compensate for the autocorrelation width of the DSSS code, according to some examples. As the chip rate increases, the peak width of the autocorrelation narrows until it reaches a minimum when the chip rate is equal to the sampling rate. For a given DSSS code, this condition (chip rate equal to sample rate) produces a delayed waveform 400 that is closest to the real impulse response of the audio environment. As the chip rate increases, spectral overlap (aliasing) may occur after calibrating the signal modulator 220A. In some examples, if the chip rate is equal to the sampling rate, the calibration signal modulator 220A may be bypassed or omitted. A chip rate close to the sampling rate (e.g., a chip rate of 80% of the sampling rate, 90% of the sampling rate, etc.) may provide a delayed waveform 400, the delayed waveform 400 being a satisfactory approximation of the actual impulse response for some purposes. In some such examples, delay estimate 1902 may be based in part on information about the calibration signal characteristics (e.g., based on DSSS signal characteristics). In some examples, the leading edge estimator 1912 may be configured to estimate the leading edge position of the delay waveform 400 from a first instance of a value greater than a threshold during a time window. Some examples will be described below with reference to fig. 20. In other examples, the leading edge estimator 1912 may be configured to estimate the leading edge position of the delay waveform 400 from the position of the maximum (e.g., local maximum within a time window), which is an example of "peak picking.

Note that many other techniques may be used to estimate the delay (e.g., peak picking).

2. In this example, baseband processor 218 is configured to make the DoA estimate 1903 by windowing (with windowing block 1913) delay waveform 400 before using delay-and-sum DoA estimator 1914. Delay-and-sum DoA estimator 1914 may perform a DoA estimation based at least in part on the determination of the controlled response power (SRP) of delay waveform 400. Thus, delay-and-sum DoA estimator 1914 may also be referred to herein as an SRP module or delay-and-sum beamformer. Windowing helps isolate the time interval around the leading edge so that the resulting DoA estimate is more based on the signal rather than noise. In some examples, the window size may be in the range of tens or hundreds of milliseconds, for example, in the range of 10 to 200 milliseconds. In some cases, the window size may be selected based on knowledge of typical room fading times or knowledge of fading times of the audio environment in question. In some cases, the window size may be adaptively updated over time. For example, some embodiments may involve determining a window size that results in at least some portion of the window being occupied by a signal portion of the delayed waveform 400. Some such implementations may involve estimating noise power from time samples that occur before the leading edge. Some such implementations may involve selecting a window size that will result in at least a threshold percentage of the window being occupied by portions of the delay waveform corresponding to at least a threshold signal level (e.g., at least 6dB greater than the estimated noise power, at least 8dB greater than the estimated noise power, at least 10dB greater than the estimated noise power, etc.).

3. According to this example, baseband processor 218 is configured to perform audibility estimation 1904 by estimating signal-to-noise ratio power using SNR estimation block 1915. In this example, SNR estimation block 1915 is configured to extract signal power estimate 402 and noise power estimate 401 from delay waveform 400. According to some such examples, SNR estimation block 1915 may be configured to determine a signal portion and a noise portion of delay waveform 400, as described below with reference to fig. 20. In some such examples, SNR estimation block 1915 may be configured to determine signal power estimate 402 and noise power estimate 401 by averaging the signal portion and the noise portion over the selected time window. In some such examples, SNR estimation block 1915 may be configured to perform SNR estimation based on a ratio of signal power estimate 402 to noise power estimate 401. In some cases, baseband processor 218 may be configured to perform audibility estimation 1904 from SNR estimation. For a given amount of noise power, the SNR is proportional to the audibility of the audio device. Thus, in some embodiments, the SNR may be directly used as a representation (e.g., a proportional value) of an estimate of the audibility of the actual audio device. Some implementations involving calibrating microphone feed may involve measuring absolute audibility (e.g., in dBSPL) and converting SNR to an absolute audibility estimate. In some such embodiments, the method for determining an absolute audibility estimate will take into account acoustic losses due to the distance between audio devices and the variability of noise in the room. In other embodiments, other techniques are used to estimate signal power, noise power, and/or relative audibility from the delayed waveform.

Fig. 20 shows an example of a delay waveform. In this example, the delay waveform 400 has been output by an instance of the baseband processor 218. According to this example, the vertical axis indicates power and the horizontal axis indicates pseudorange in meters. As described above, the baseband processor 218 is configured to extract delay information, sometimes referred to herein as τ, from the demodulated acoustic calibration signal. The value of τ may be converted to pseudorange measurements, sometimes referred to herein as ρ, as follows:

ρ＝Tc

in the above expression, c represents the sound velocity. In fig. 20, the delay waveform 400 includes a noise portion 2001 (which may also be referred to as a noise floor) and a signal portion 2002. Negative values in the pseudorange measurements (and corresponding delay waveforms) may be identified as noise: since the negative range (distance) has no physical meaning, the power corresponding to the negative pseudo range is assumed to be noise.

In this example, the signal portion 2002 of waveform 400 includes a leading edge 2003 and a trailing edge. If the power of the signal portion 2002 is relatively strong, the leading edge 2003 is a significant feature of the delay waveform 400. In some examples, the leading edge estimator 1912 of fig. 19 may be configured to estimate the location of the leading edge 2003 based on the first instance that the power value is greater than the threshold during the time window. In some examples, the time window may begin when τ (or ρ) is zero. In some cases, the window size may be in the range of tens or hundreds of milliseconds, for example, in the range of 10 to 200 milliseconds. According to some embodiments, the threshold may be a previously selected value, e.g., -5dB, -4dB, -3dB, -2dB, etc. In some alternative examples, the threshold may be based on power in at least a portion of the delay waveform 400, e.g., an average power of the noise portion.

However, as described above, in other examples, the leading edge estimator 1912 may be configured to estimate the location of the leading edge 2003 based on the location of the maximum (e.g., a local maximum within a time window). In some cases, the time window may be selected as described above.

In some examples, SNR estimation block 1915 of fig. 19 may be configured to determine an average noise value corresponding to at least a portion of noise portion 2001 and an average or peak signal value corresponding to at least a portion of signal portion 2002. In some such examples, SNR estimation block 1915 of fig. 19 may be configured to estimate the SNR by dividing the average signal value by the average noise value.

Noise compensation (e.g., auto-leveling of speaker playback content) to compensate for ambient noise conditions is a well known and desirable feature, but has not previously been achieved in an optimal manner. Measuring ambient noise conditions using microphones also measures speaker playback content, which presents a significant challenge to achieving noise estimation (e.g., online noise estimation) required for noise compensation.

Since people in an audio environment may typically be located outside the critical acoustic distance of any given room, echoes introduced from other devices that are similarly distant may still represent significant echo effects. Even if complex multi-channel echo cancellation is available and in some way achieves the required performance, the logistical effort to provide remote echo references for the canceller can have unacceptable bandwidth and complexity costs.

Some disclosed embodiments provide a method of continuously calibrating a constellation of audio devices in an audio environment via persistent (e.g., continuous or at least still ongoing) characterization of an acoustic space that includes people, devices, and audio conditions such as noise and/or echoes. In some disclosed examples, such processing continues even though the media is being played back via an audio device of the audio environment.

As used herein, a "gap" in a playback signal refers to a time (or time interval) of the playback signal at which (or in) playback content is lost (or has a level less than a predetermined threshold). For example, a "gap" (also referred to herein as a "forced gap" or "parameterized forced gap") may be the decay of playback content over a range of frequencies during a time interval. In some disclosed embodiments, gaps may be inserted in one or more frequency ranges of an audio playback signal of a content stream to produce a modified audio playback signal and the modified audio playback signal may be reproduced or "played back" in an audio environment. In some such embodiments, N gaps may be inserted into N frequency ranges of the audio playback signal during N time intervals.

According to some such embodiments, M audio devices may program their gaps in time and frequency, allowing accurate detection of far fields (for each device) in the gap frequency and time interval. These "programmed gaps" are an important aspect of the present disclosure. In some examples, M may be a number corresponding to all audio devices of the audio environment. In some cases, M may be a number corresponding to all audio devices in the audio environment except the target audio device, which is an audio device whose playback audio is sampled by one or more microphones of M orchestrated devices of the audio environment (e.g., one or more microphones of M orchestrated audio devices of the audio environment), e.g., to evaluate the relative audibility, location, nonlinearity, and/or other characteristics of the target audio device. In some examples, the target audio device may reproduce an unmodified audio playback signal that does not include a gap inserted into any frequency range. In other examples, M may be a number corresponding to a subset of audio devices of the audio environment (e.g., a plurality of participating non-target audio devices).

Desirably, the orchestrated gap should have a low perceived impact (e.g., negligible perceived impact) on a listener in the audio environment. Thus, in some examples, the gap parameter may be selected to minimize perceived effects.

In some examples, the target device may reproduce an unmodified audio playback signal that does not include a gap inserted into any frequency range when the modified audio playback signal is being played back in an audio environment. In such an example, the relative audibility and/or location of the target device may be estimated from the perspective of the M audio devices that are rendering the modified audio playback signal.

Fig. 21 shows another example audio environment. As with the other figures provided herein, the types and amounts of elements shown in fig. 21 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements.

According to this example, the audio environment 2100 includes a main living space 2101a and a room 2101b adjacent to the main living space 2101 a. Here, wall 2102 and door 2111 separate main living space 2101a from room 2101b. In this example, the amount of acoustic separation between the main living space 2101a and the room 2101b depends on whether the door 2111 is open or closed, and if so, on the degree to which the door 2111 is open.

At a time corresponding to fig. 21, a smart Television (TV) 2103a is located within an audio environment 2100. According to this example, the smart TV 2103a includes a left microphone 2103b and a right microphone 2103c.

In this example, at times corresponding to fig. 21, intelligent audio devices 2104, 2105, 2106, 2107, 2108, 2109, and 2113 are also located within audio environment 2100. According to this example, each smart audio device 2104-2109 includes at least one microphone and at least one loudspeaker. However, in this case, the smart audio devices 2104-2109 and 2113 include loudspeakers of various sizes and with various capabilities.

According to this example, at least one acoustic event is occurring in the audio environment 2100. In this example, an acoustic event is caused by a talker 2110 who is speaking a voice command 2112.

In this example, another acoustic event is caused at least in part by the variable element 2115. Here, the variable element 2115 is a gate of the audio environment 2100. According to this example, as the door 2115 is opened, sound from outside the environment may be more clearly perceived inside the audio environment 2100. In addition, the varying angle of the gate 2115 changes some echo paths within the audio environment 2100. According to this example, element 2114 represents a variable element of the impulse response of the audio environment 2100 caused by the varying position of the gate 2115.

In some examples, a series of mandatory slots are inserted in the playback signal, each mandatory slot being located in a different frequency band (or set of frequency bands) of the playback signal to allow a normal listener to monitor non-playback sounds that occur "in" each mandatory slot "in the sense that occur during the time interval in which the slot occurs and in the frequency band(s) in which the slot is inserted. Fig. 22A is an example of modifying a spectrogram of an audio playback signal. In this example, according to one example, modifying the audio playback signal is created by inserting a gap into the audio playback signal. More specifically, to generate the spectrogram of fig. 22A, the disclosed method is performed on an audio playback signal to introduce forced gaps (e.g., gaps G1, G2, and G3 shown in fig. 22A) in its frequency band, thereby generating a modified audio playback signal. In the spectrogram shown in fig. 22A, the position along the horizontal axis indicates time and the position along the vertical axis indicates frequency of modifying the audio playback signal content at a certain time. The density of points in each small region (in this example, each such region centered on a point having vertical and horizontal coordinates) indicates the energy of the content of the modified audio playback signal at the corresponding frequency and time: more dense areas indicate content having more energy and less dense areas indicate content having lower energy. Therefore, the gap G1 occurs at a time (in other words, during the time interval in which the gap G2 or G3 occurs) earlier than the time at which the gap G2 or G3 occurs (in other words, during the time interval), and the gap G1 has been inserted into a frequency band higher than the frequency band in which the gap G2 or G3 is inserted.

The introduction of forced gaps into the playback signal according to some disclosed methods is different from simplex device operation where the device pauses the content playback stream (e.g., to better hear the user and the user's environment). Introducing a forcing gap into a playback signal according to some disclosed methods may be optimized to significantly reduce (or eliminate) perceptibility of artifacts produced by the introduced gap during playback, preferably such that the forcing gap has no or minimal perceptible impact on a user, but such that output signals of microphones in the playback environment are indicative of the forcing gap (e.g., so a pervasive listening method may be implemented with gaps). By using the forced gaps that have been introduced according to some disclosed methods, a pervasive listening system can monitor non-playback sounds (e.g., sounds indicative of background activity and/or noise in the playback environment) even without the use of acoustic echo cancellers.

With reference to fig. 22B and 22C, we next describe examples of parameterized forcing gaps that may be inserted into the frequency band of an audio playback signal, and criteria for selecting parameters of such forcing gaps. Fig. 22B is a graph showing an example of a gap in the frequency domain. Fig. 22C is a graph showing an example of a gap in the time domain. In these examples, parameterized forced gaps are using bands The attenuation of playback content by the attenuation G, the distribution of the band attenuation G in time and frequency is similar to that shown in fig. 22B and 22C. Here, by applying attenuation G to the signal generated by the center frequency f ₀ The playback signal over a frequency range ("band") defined by (indicated in fig. 22B) and bandwidth B (also indicated in fig. 22B) forces a gap, the attenuation varying as a function of time at each frequency in the frequency band (e.g., in each frequency interval within the frequency band), with a distribution similar to that shown in fig. 22C. The maximum of the attenuation G (as a function of frequency across the band) can be controlled to increase from 0dB (at the lowest frequency of the band) to the center frequency f ₀ Maximum attenuation (suppression depth) Z (as indicated in fig. 22B) at (as the frequency increases above the center frequency) and decreases to 0dB (at the highest frequency of the band).

In this example, the graph of fig. 22B indicates a distribution of band attenuation G applied to frequency components of an audio signal to force gaps in the audio content of the in-band signal as a function of frequency (i.e., frequency bins). The audio signal may be a playback signal (e.g., channels of a multi-channel playback signal), and the audio content may be playback content.

According to this example, the graph of FIG. 22C shows that the center frequency f is applied ₀ The frequency components at that point force the distribution of the band attenuation G of the gap shown in fig. 22B in the audio content of the in-band signal as a function of time. For each other frequency component in the band, the band gain as a function of time may have a distribution similar to that shown in fig. 22C, but the suppression depth Z of fig. 22C may be replaced by an interpolated suppression depth kZ, where k is a factor ranging from 0 to 1 (as a function of frequency) in this example, so kZ has the distribution shown in fig. 22B. In some examples, the attenuation G may also be interpolated from 0dB to a suppression depth kZ (e.g., k=1, as shown in fig. 22C, at the center frequency) for each frequency component, e.g., to reduce music artifacts due to introducing gaps. Three regions (time intervals) t1, t2, and t3 of the latter interpolation are shown in fig. 22C.

Thus, when the frequency band is specific (e.g., at the center frequency f ₀ A band that is the center, as shown in fig. 22B) a gap forcing operation occurs, in this example, the attenuation G applied to each frequency component in the band (e.g., to each section within the band) follows a trajectory as shown in fig. 22C. Starting from 0dB, it drops to a depth-kZdB in t1 seconds, where it remains for t2 seconds, and finally rises back to 0dB in t3 seconds. In some implementations, the total time t1+t2+t3 may be selected to account for the time resolution of analyzing the microphone feed using any frequency transform, as well as the reasonable duration of less annoying users. Some examples of t1, t2, and t3 for single device implementations are shown in table 1 below.

Some of the disclosed methods involve inserting a forced gap according to a predetermined, fixed banding structure that covers the entire spectrum of the audio playback signal and includes B _count Personal band (wherein B) _count Is a number, e.g. B _count =49). To force a gap in any band, band attenuation is applied in the band in such examples. Specifically, for the j-th band, the attenuation Gj may be applied to the frequency region defined by the band.

Table 1 below shows example values of parameters t1, t2, t3, depth Z for each band, and number of bands B for a single device embodiment _count Is an example of (a).

TABLE 1

In determining the number of bands and the width of each band, there is a trade-off between perceived impact and usefulness of the gap: the narrower the bands with gaps are, for example, better in response to background noise or changes in playback environment conditions in all bands of the full spectrum, because they generally have less perceived impact, while the wider the bands with gaps are better for achieving noise estimation (and other pervasive listening methods) and reducing the time required to converge to a new noise estimation (or other value monitored by pervasive listening) ("convergence" time). If only a limited number of gaps can be forced at a time, sequential forcing of gaps in a large number of small bands requires a longer time than sequential forcing of gaps in a smaller number of larger bands, resulting in a relatively longer convergence time. The larger frequency band (with gaps) provides a large amount of information about background noise (or other values monitored by pervasive listening) at the same time, but generally has a larger perceived impact.

In the early work of the present inventors, gaps were created in the single device context where the echo effects were primarily (or entirely) near field. Near-field echoes are largely affected by the direct path of the audio from the speaker to the microphone. This feature holds for almost all compact duplex audio devices (such as smart audio devices), with the exception of devices with a large housing and significant acoustic decoupling. By introducing a short, perceptually masked gap in playback, such as the gap shown in table 1, the audio device can obtain a glance at the acoustic space where the audio device is deployed through its own echo.

However, when other audio devices play content in the same audio environment as well, the inventors have found that the gap of individual audio devices becomes less useful due to far-field echo corruption. Far-field echo corruption often reduces the performance of local echo cancellation, significantly degrading overall system performance. Far field echo damage is difficult to remove for a variety of reasons. One reason is that obtaining a reference signal may require increasing network bandwidth and increasing the complexity of additional delay estimation. Moreover, as noise conditions increase and responses are longer (more reverberations and time spread), it becomes more difficult to estimate the far field impulse response. In addition, far field echo corruption is often associated with near field echoes and other far field echo sources, further challenging far field impulse response estimation.

The inventors have also found that if multiple audio devices in an audio environment program their gaps in time and frequency, a clearer far-field perception (relative to each audio device) can be obtained when the multiple audio devices reproduce the modified audio playback signal. The inventors have also found that if the target audio device plays back an unmodified audio playback signal when the plurality of audio devices reproduce the modified audio playback signal, the relative audibility and position of the target device can be estimated from the perspective of each of the plurality of audio devices even if the media content is being played.

Moreover, and perhaps counterintuitive, the inventors have found that breaking the guidelines previously used for single device embodiments (e.g., keeping the gap open longer than indicated in table 1) results in an embodiment that is suitable for multiple devices to make collaborative measurements via the programmed gap.

For example, in some programmed gap embodiments, t2 may be longer than indicated in table 1 in order to accommodate various acoustic path lengths (acoustic delays) between multiple distributed devices in an audio environment, which may be on the order of meters (as opposed to fixed microphone-speaker acoustic path lengths on a single device, which are at most tens of centimeters apart). In some examples, the default t2 value may be, for example, 25 milliseconds greater than the 80 millisecond value indicated in table 1, so as to allow up to 8 meters of separation between the orchestrated audio devices. In some programmed gap embodiments, the default t2 value may be longer than the 80 millisecond value indicated in table 1 for another reason: in the programmed gap embodiment, t2 is preferably longer to accommodate timing misalignment of the programmed audio devices to ensure that a sufficient time has elapsed during which all programmed audio devices reach the value of Z-decay. In some examples, an additional 5 milliseconds may be added to the default value of t2 to accommodate timing misalignment. Thus, in some programmed gap embodiments, t2 may default to 110 milliseconds, with a minimum of 70 milliseconds and a maximum of 150 milliseconds.

In some programmed gap embodiments, t1 and/or t3 may also be different from the values indicated in table 1. In some examples, t1 and/or t3 may be adjusted because the listener cannot perceive different times at which the device enters or exits its decay period due to time problems and physical distance differences. Due at least in part to spatial masking (caused by multiple devices playing back audio from different locations), listeners often have a lower ability to perceive different times at which programmed audio devices enter or exit the decay period than in a single device scene. Thus, in some programmed gap embodiments, the minimum value of t1 and t3 may be reduced and the maximum value of t1 and t3 may be increased, as compared to the single device example shown in table 1. According to some such examples, the minimum value of t1 and t3 may decrease to 2, 3, or 4 milliseconds and the maximum value of t1 and t3 may increase to 20, 25, or 30 milliseconds.

Measurement examples using orchestrated gaps

Fig. 22D illustrates an example of modifying an audio playback signal that includes programmed gaps for a plurality of audio devices in an audio environment. In this embodiment, multiple intelligent devices of the audio environment orchestrate the gap to estimate each other's relative audibility. In this example, one measurement session corresponding to one gap is conducted during the time interval, and the measurement session includes only devices in the main living space 2101a of fig. 21. According to this example, the previous audibility data has shown that the intelligent audio device 2109 located in room 2101b has been classified as being almost inaudible to other audio devices and has been placed in a separate zone.

In the example shown in FIG. 22D, the programmed gap is the use of a band attenuation G _k Attenuation of playback content, where k represents the center frequency of the band being measured. The elements shown in fig. 22D are as follows:

graph 2203 is a G in dB for smart audio device 2113 of FIG. 21 _k Is a diagram of (1);

graph 2204 is a G in dB for smart audio device 2104 of fig. 21 _k Is a diagram of (1);

graph 2205 is a G in dB for smart audio device 2105 of fig. 21 _k Is a diagram of (1);

graph 2206 is a G in dB for smart audio device 2106 of fig. 21 _k Is a diagram of (1);

graph 2207 is a G in dB for smart audio device 2107 of fig. 21 _k Is a diagram of (1);

graph 2208 is a plot d for smart audio device 2108 of fig. 21G in B _k Is a diagram of (1); and

graph 2209 is a G in dB for smart audio device 2109 of fig. 21 _k Is a diagram of (a).

As used herein, the term "session" (also referred to herein as "measurement session") refers to a period of time during which measurements of a frequency range are performed. During a measurement session, a set of frequencies with associated bandwidths, and a set of participating audio devices, may be specified.

One audio device may optionally be designated as the "target" audio device for the measurement session. If the target audio device is involved in a measurement session, then according to some examples, the target audio device will be allowed to ignore the forced gaps and will play the unmodified audio playback signal during the measurement session. According to some such examples, other participating audio devices will listen to the target device playback sound, including the target device playback sound in the frequency range being measured.

As used herein, the term "audibility" refers to the degree to which a device can hear the speaker output of another device. Some examples of audibility are provided below.

According to the example shown in fig. 22D, at time t1, the orchestration device initiates a measurement session with the intelligent audio device 2113 as the target audio device, selecting one or more interval center frequencies to be measured, including frequency k. In some examples, the orchestration device may be an intelligent audio device that acts as a leader. In other examples, the orchestration device may be another orchestration device, such as a smart home hub. This measurement session runs from time t1 to time t2. Other participating smart audio devices, smart audio devices 2104-2108, will apply gaps in their outputs and will render modified audio playback signals, while smart audio device 2113 will play unmodified audio playback signals.

The subset of intelligent audio devices (intelligent audio devices 2104-2108) that are rendering the audio environment 2100 that includes the modified audio playback signal for the scheduled gap is one example that may be referred to as M audio devices. According to this example, the smart audio device 2109 will also play an unmodified audio playback signal. Thus, the smart audio device 2109 is not one of the M audio devices. However, because the smart audio device 2109 cannot be heard by other smart audio devices of the audio environment, the smart audio device 2109 is not the target audio device in this example, although both the smart audio device 2109 and the target audio device (smart audio device 2113 in this example) will play back the unmodified audio playback signal.

During the measurement session, the gaps that are desirably orchestrated should have a low perceived impact (e.g., negligible perceived impact) on listeners in the audio environment. Thus, in some examples, the gap parameter may be selected to minimize perceived effects. Some examples are described below with reference to fig. 22B-22E.

During this time (the measurement session from time t1 to time t 2), the smart audio devices 2104-2108 will receive a reference audio interval from the target audio device (smart audio device 2113) for the time-frequency data of this measurement session. In this example, the reference audio interval corresponds to a playback signal that the smart audio device 2113 uses as a local reference for echo cancellation. The smart audio device 2113 may access these reference audio intervals for audibility measurement and echo cancellation purposes.

According to this example, at time t2, the first measurement session ends and the orchestration device initiates a new measurement session, this time selecting one or more interval center frequencies that do not include frequency k. In the example shown in fig. 22D, during the period t2 to t3, the frequency k does not apply a gap, and thus the graph shows the unit gains for all the devices. In some such examples, the orchestration device may cause a series of gaps to be inserted into each of the multiple frequency ranges for a series of measurement sessions that do not include the interval center frequency of frequency k. For example, for the purpose of a second through nth subsequent measurement session, the orchestration device may insert a second through nth gap into the second through nth frequency ranges of the audio playback signal during the second through nth time intervals, while the intelligent audio device 2113 remains the target audio device.

In some such examples, the orchestration device may then select another target audio device, e.g., the intelligent audio device 2104. The orchestration device may indicate that the intelligent audio device 2113 is one of M intelligent audio devices that are playing back a modified audio playback signal with the orchestrated gap. The orchestration device may instruct the new target audio device to reproduce the unmodified audio playback signal. According to some such examples, after the orchestration device has caused N measurement sessions to occur for the new target audio device, the orchestration device may select another target audio device. In some such examples, the orchestration device may continue to cause the measurement session to occur until the measurement session has been performed for each participating audio device in the audio environment.

In the example shown in fig. 22D, different types of measurement sessions occur between times t3 and t 4. According to this example, at time t3, in response to user input (e.g., a voice command to an intelligent audio device acting as an orchestration device), the orchestration device initiates a new session to fully calibrate loudspeaker settings of the audio environment 2100. In general, during a measurement session such as a "set" or "recalibration" that occurs between times t3 and t4, the user may be relatively more tolerant of programmed gaps having relatively high perceived effects. Thus, in this example, a large set of consecutive frequencies is selected for measurement, including k. According to this example, the smart audio device 2106 is selected as the first target audio device during this measurement session. Thus, during the first phase of the measurement session from time t3 to t4, all intelligent audio devices except intelligent audio device 2106 will apply the gap.

Gap bandwidth

Fig. 23A is a graph showing an example of a filter response for creating a gap and a filter response for measuring a frequency region of a microphone signal used during a measurement session. According to this example, the elements of fig. 23A are as follows:

Element 2301 represents the amplitude response of the filter used to create a gap in the output signal;

element 2302 represents the amplitude response of the filter for measuring the frequency region corresponding to the gap caused by element 2301;

elements 2303 and 2304 represent the-3 dB points of 2301 with frequencies f1 and f2; and

elements 2305 and 2306 represent the-3 dB points of 2302 with frequencies f3 and f4.

The bandwidth (bw_gap) of gap response 2301 can be found by taking the difference between-3 dB points 2303 and 2304: bw_gap=f2-f 1 and bw_measure (the bandwidth of measurement response 2302) =f4-f 3.

According to one example, the measured quality (quality) may be expressed as follows:

because the bandwidth of the measurement response is typically fixed, the quality of the measurement can be adjusted by increasing the bandwidth (e.g., widening the bandwidth) of the gap filter response. However, the bandwidth of the incoming gap is proportional to its perceptibility. Therefore, the bandwidth of the gap filter response should generally be determined based on the measured quality and perceptibility of the gap. Some examples of quality values are shown in table 2:

TABLE 2

Although table 2 indicates "minimum" and "maximum" values, these values apply only to this example. Other embodiments may involve lower mass values than 1.5 and/or higher mass values than 3.

Gap allocation strategy

The gap may be defined by:

bottom partition of spectrum, with center frequency and measurement bandwidth;

aggregation of these minimum measurement bandwidths in a structure called "banding";

the duration, the attenuation depth, and the inclusion of one or more consecutive frequencies that meet the agreed spectral division; and

other temporal behavior such as the slope of the decay depth at the beginning and end of the gap.

According to some embodiments, the gap may be selected according to a strategy aimed at measuring and observing as much of the audible spectrum as possible in as short a time as possible, while meeting applicable perceptibility constraints.

Fig. 23B, 23C, 23D, and 23E are graphs showing examples of gap allocation strategies. In these examples, time is represented by distance along the horizontal axis and frequency is represented by distance along the vertical axis. These graphs provide examples to illustrate the patterns generated by the various gap allocation strategies, and the time they take to measure the complete audio spectrum. In these examples, the length of each orchestrated gap measurement session is 10 seconds. As with other disclosed embodiments, these graphs are provided by way of example only. Other embodiments may include more, fewer, and/or different types, numbers, and/or orders of elements. For example, in other embodiments, each orchestrated gap measurement session may be longer or shorter than 10 seconds. In these examples, the unshaded region 2310 (which may be referred to herein as a "tile") of the time/frequency space represented in fig. 23B-23E represents the gap at the indicated time-frequency period (10 seconds). The moderately shaded region 2315 represents a frequency bin that has been measured at least once. Light shaded area 2320 has not been measured.

Assuming that the task at hand requires that the participating audio devices insert into the orchestrated gap to "listen to the room" (e.g., evaluate noise, echo, etc. in the audio environment), then the measurement session completion time will be as indicated in fig. 23B-23E. If the task requires that each audio device be targeted in turn and listened to by other audio devices, then the number of audio devices participating in the process needs to be multiplied. For example, if each audio device is targeted in turn, the three minutes twenty seconds (3 m20 s) shown in fig. 23B as measuring the session completion time would mean that a system of 7 audio devices would be fully mapped after 7 x 3m20s = 23m20 s. When cycling between frequencies/bands and forcing multiple gaps at the same time, in these examples, the gaps will be spaced as far apart in frequency as possible to improve efficiency in covering the spectrum.

Fig. 23B and 23C are graphs showing examples of programmed gap sequences according to a gap allocation strategy. In these examples, the gap allocation policy involves gapping N entire frequency bands at a time during each successive measurement session (each frequency band comprising at least one frequency interval, and in most cases a plurality of frequency intervals). In fig. 23B, n=1, in fig. 23C, n=3, the latter meaning that the example of fig. 23C involves interpolation into three gaps at the same time interval. In these examples, the banding used is a 20 band Mel spacing arrangement. According to some such examples, the sequence may be restarted after all 20 bands have been measured. Although 3m20s is a reasonable time to reach a complete measurement, the gap that is struck in the critical audio region of 300Hz-8kHz is very wide and a significant amount of time is used to make measurements outside this region. This particular strategy will be very perceptible to the user due to the relatively wide gap in the 300Hz-8kHz frequency range.

Fig. 23D and 23E are graphs showing examples of sequences of gaps that are scheduled according to another gap allocation strategy. In these examples, the gap allocation strategy involves modifying the banding structure shown in fig. 23B and 23C to map to an "optimized" frequency region of approximately 300Hz to 8 kHz. The overall allocation strategy is otherwise unchanged from the strategy represented in fig. 23B and 23C, but the sequence ends slightly earlier since the 20 th band is now ignored. The gap bandwidth forced here will still be perceptible. However, the benefit is a very fast measurement of the optimized frequency region, especially when the gap is forced into multiple frequency bands at once.

Fig. 24 shows another example of an audio environment. In fig. 24, environment 2409 (acoustic space) includes a user (2401) speaking a direct talk 2402, and an example of a system including a set of intelligent audio devices (2403 and 2405), speakers for audio output, and a microphone. The system may be configured in accordance with embodiments of the present disclosure. The speech spoken by user 2401 (sometimes referred to herein as a speaker) may be recognized by the element(s) of the system in the programmed time-frequency gap.

More specifically, the elements of the system of fig. 24 include:

2402: direct local speech (generated by user 2401);

2403: a voice assistant device (coupled to one or more microphones). Device 2403 is closer to user 2401 than device 2405, so device 2403 is sometimes referred to as a "near" device, while device 2405 is referred to as a "far" device;

2404: a plurality of microphones in the near device 2403 (or coupled to the near device 2403);

2405: a voice assistant device (coupled to one or more microphones);

2406: a plurality of microphones in remote device 2405 (or coupled to remote device 2405);

2407: household appliances (e.g., electric lamps); and

2408: a plurality of microphones in the home appliance 2407 (or coupled to the home appliance 2407). In some examples, each microphone 2408 may be configured to communicate with a device configured to implement a classifier, which may be at least one of devices 2403 or 2405 in some cases.

The system of fig. 24 may also include at least one classifier. For example, device 2403 (and/or device 2405) may include a classifier. Alternatively or additionally, the classifier may be implemented by another device that may be configured to communicate with devices 2403 and/or 2405. In some examples, the classifier may be implemented by another local device (e.g., a device within environment 2409), while in other examples, the classifier may be implemented by a remote device (e.g., a server) located outside of environment 2409.

In some implementations, a control system (e.g., control system 160 of fig. 1B) may be configured to implement a classifier, e.g., such as those disclosed herein. Alternatively or additionally, the control system 160 may be configured to determine an estimate of the user zone in which the user is currently located based at least in part on the output from the classifier.

Fig. 25A is a flowchart outlining one example of a method that may be performed by an apparatus such as that shown in fig. 1B. As with other methods described herein, the blocks of method 2500 are not necessarily performed in the order indicated. Moreover, such methods may include more or less blocks than those shown and/or described. In this embodiment, method 2500 involves estimating a user's location in an environment.

In this example, block 2505 involves receiving an output signal from each of a plurality of microphones in an environment. In this case, each of the plurality of microphones resides in a microphone location of the environment. According to this example, the output signal corresponds to a current utterance of the user measured during the programmed gap in playback of the content. For example, block 2505 may involve a control system (such as control system 160 of fig. 1B) receiving an output signal from each of a plurality of microphones in an environment via an interface system (such as interface system 155 of fig. 1B).

In some examples, at least some of the microphones in the environment may provide an output signal that is asynchronous with respect to an output signal provided by one or more other microphones. For example, a first microphone of the plurality of microphones may sample audio data according to a first sample clock and a second microphone of the plurality of microphones may sample audio data according to a second sample clock. In some cases, at least one microphone in the environment may be included in or configured for communication with the smart audio device.

According to this example, block 2510 involves determining a plurality of current acoustic features from the output signals of each microphone. In this example, the "current acoustic feature" is an acoustic feature derived from the "current utterance" of block 2505. In some implementations, block 2510 may involve receiving a plurality of current acoustic features from one or more other devices. For example, block 2510 may involve receiving at least some of the plurality of current acoustic features from one or more speech detectors implemented by one or more other devices. Alternatively or additionally, in some implementations, block 2510 may involve determining a plurality of current acoustic features from the output signal.

Whether the acoustic signature is determined by a single device or multiple devices, the acoustic signature may be determined asynchronously. If the acoustic signature is determined by multiple devices, the acoustic signature will typically be determined asynchronously unless the devices are configured to coordinate the process of determining the acoustic signature. If the acoustic signature is determined by a single device, then in some embodiments the acoustic signature may still be determined asynchronously, as the single device may receive the output signal of each microphone at different times. In some examples, the acoustic features may be determined asynchronously, as at least some of the microphones in the environment may provide output signals that are asynchronous with respect to output signals provided by one or more other microphones.

In some examples, the acoustic features may include a speech confidence metric corresponding to speech measured during the programmed gap in the output playback signal.

Alternatively or additionally, the acoustic features may include one or more of the following:

band power in the frequency band weighted for human speech. For example, the acoustic features may be based on only a particular frequency band (e.g., 400Hz-1.5 kHz). In this example, the higher and lower frequencies may be ignored.

In the frequency bands or intervals corresponding to the gaps scheduled in the playback content, per band or per interval voice activity detector confidence.

The acoustic features may be based at least in part on the long-term noise estimate in order to ignore microphones with poor signal-to-noise ratios.

Kurtosis (kurtosis) as a measure of speech peak. Kurtosis may be an indicator of long reverberation tail.

According to this example, block 2515 involves applying a classifier to a plurality of current acoustic features. In some such examples, the application classifier may involve an application model trained on previously determined acoustic features derived from a plurality of previous utterances made by the user in a plurality of user zones in the environment. Various examples are provided herein.

In some examples, the user zone may include a sink zone, a food preparation zone, a refrigerator zone, a dining zone, a sofa zone, a television zone, a bedroom zone, and/or a door zone. According to some examples, one or more of the user zones may be a predetermined user zone. In some such examples, one or more predetermined user zones may have been selectable by the user during the training process.

In some implementations, applying the classifier may involve applying a gaussian mixture model trained on previous utterances. According to some such embodiments, applying the classifier may involve applying a gaussian mixture model trained on one or more of a normalized speech confidence, a normalized average reception level, or a maximum reception level of the previous utterance. However, in alternative embodiments, the application classifier may be based on a different model, such as one of the other models disclosed herein. In some cases, the model may be trained using training data that marks the user zone. However, in some examples, applying the classifier involves applying a model trained using unlabeled training data for unlabeled user zones.

In some examples, the previous utterance may have been or may have included a speaking utterance. According to some such examples, the previous utterance and the current utterance may be utterances of the same utterance.

In this example, block 2520 involves determining an estimate of the user zone in which the user is currently located based at least in part on the output from the classifier. In some such examples, the estimate may be determined without reference to the geometric positions of the plurality of microphones. For example, the estimate may be determined without reference to the coordinates of the respective microphones. In some examples, the estimate may be determined without estimating a geometric location of the user. However, in alternative embodiments, the position estimation may involve estimating the geometric position of one or more persons and/or one or more audio devices in the audio environment, e.g., a reference coordinate system.

Some embodiments of method 2500 may involve selecting at least one speaker based on the estimated user zone. Some such implementations may involve controlling at least one selected speaker to provide sound to an estimated user zone. Alternatively or additionally, some embodiments of the method 2500 may involve selecting at least one microphone based on the estimated user area. Some such implementations may involve providing a signal output by at least one selected microphone to a smart audio device.

FIG. 25B is a block diagram of elements of one example of an embodiment configured to implement a region classifier. According to this example, the system 2530 includes a plurality of microphones 2534 distributed in at least a portion of an environment (e.g., such as the environment shown in fig. 21 or 24). In this example, the system 2530 includes a multi-channel loudspeaker renderer 2531. According to this embodiment, the output of the multi-channel loudspeaker renderer 2531 is used as a loudspeaker drive signal (for driving the speaker feed of the speaker 2534) and an echo reference. In such an embodiment, the echo reference is provided to the echo management subsystem 2533 via a plurality of loudspeaker reference channels 2532, the plurality of loudspeaker reference channels 2532 including at least some of the speaker feed signals output from the renderer 2531.

In such an embodiment, the system 2530 includes a plurality of echo management subsystems 2533. In this example, the renderer 2531, echo management subsystem 2533, wake-up word detector 2536, and classifier 2537 are implemented via the example of the control system 160 described above with reference to FIG. 1B. According to this example, the echo management subsystem 2533 is configured to implement one or more echo suppression processes and/or one or more echo cancellation processes. In this example, each echo management subsystem 2533 provides a corresponding echo management output 2533A to one of the wake-up word detectors 2536. Echo management output 2533A attenuates echoes with respect to the input of an associated one of echo management subsystems 2533.

According to such an embodiment, the system 2530 includes N microphones 2535 (N is an integer) distributed in at least a portion of an audio environment (e.g., the audio environment shown in fig. 21 or 24). The microphones may include array microphones and/or spot microphones. For example, one or more intelligent audio devices located in an environment may include a microphone array. In this example, the output of microphone 2535 is provided as an input to echo management subsystem 2533. According to such an embodiment, each of the echo management subsystems 2533 captures an output of a single microphone 2535 or a single group or subset of microphones 2535.

In this example, the system 2530 includes a plurality of wake-up word detectors 2536. According to this example, each of the wake-up word detectors 2536 receives an audio output from one of the echo management subsystems 2533 and outputs a plurality of acoustic features 2536A. The acoustic features 2536A output from each echo management subsystem 2533 may include (but are not limited to): a measure of wake word confidence, wake word duration, and reception level. Although three arrows depicting three acoustic features 2536A are shown as being output from each echo management subsystem 2533, more or fewer acoustic features 2536A may be output in alternative embodiments. Furthermore, although these three arrows strike the classifier 2537 along more or less vertical lines, this does not indicate that the classifier 2537 must receive acoustic features 2536A from all wake-up word detectors 2536 simultaneously. As described elsewhere herein, in some cases, acoustic features 2536A may be asynchronously determined and/or provided to a classifier.

According to such an embodiment, the system 2530 includes a region classifier 2537, and the region classifier 2537 may also be referred to as a classifier 2537. In this example, the classifier receives a plurality of features 2536A from a plurality of wake-up word detectors 2536 of a plurality (e.g., all) microphones 2535 in the environment. According to this example, the output 2538 of the region classifier 2537 corresponds to an estimate of the user region in which the user is currently located. According to some such examples, the output 2538 may correspond to one or more posterior probabilities. Based on bayesian statistics, the estimate of the user region in which the user is currently located may be or may correspond to a maximum posterior probability.

We next describe example implementations of a classifier, which in some examples may correspond to the region classifier 2537 of fig. 25B. Let x _i (N) is the ith microphone signal i= {1 … N } (i.e., microphone signal x) _i (N) is NThe output of microphone 2535). N signals x in echo management subsystem 2533 _i The process of (n) generates a "clean" microphone signal e _i (N), where i= {1 … N }, each at a discrete time N. In this example, clean signal e, which is referred to as 2533A in FIG. 25B _i (n) is fed to wake-up word detector 2536. Here, each wake-up word detector 2536 generates a feature vector w, referred to as 2536A in fig. 25B _i (j) Where j= {1 … J } is the index corresponding to the jth wake-up word utterance. In this example, classifier 2537 would aggregate the feature setAs input.

According to some embodiments, a set of zone tags C _k For k= {1 … K }, this may correspond to the number K of different user areas in the environment. For example, the user zone may include a sofa zone, a kitchen zone, a reading chair zone, and the like. Some examples may define more than one zone in a kitchen or other room. For example, the kitchen area may include a sink area, a food preparation area, a refrigerator area, and a dining area. Similarly, living room areas may include sofa areas, television areas, reading chair areas, one or more doorway areas, and the like. The zone labels of these zones may be selected by the user, for example, during a training phase.

In some implementations, the classifier 2537 estimates the posterior probability p (C) of the feature set W (j), for example, by using a bayesian classifier _k W (j)). Probability p (C) _k W (j)) indicates that the user is in each zone C _k (for the "j" th utterance and the "k" th region, for each region C _k And each utterance), and is the output 2538 of the example classifier 2537.

According to some examples, training data may be collected (e.g., for each user zone) by prompting the user to select or define a zone (e.g., a sofa zone). The training process may involve prompting the user to issue a training utterance, such as a wake word, in the vicinity of the selected or defined region. In the sofa region example, the training process may involve prompting the user to make a training utterance in the center and extreme edges of the sofa. The training process may involve prompting the user to repeat the training utterance several times at each location within the user region. The user may then be prompted to move to another user zone and continue until all of the designated user zones have been covered.

FIG. 26 presents a block diagram of one example of a system for orchestrated gap insertion. The system of fig. 26 includes an audio device 2601a, the audio device 2601a being an example of the apparatus 150 of fig. 1B and including a control system 160 configured to implement a noise estimation subsystem (noise estimator) 64, a noise compensation gain application subsystem (noise compensation subsystem) 62, and a forced gap application subsystem (forced gap applicator) 70. In this example, audio devices 2601b-2601n are also present in playback environment E. In this embodiment, each of the audio devices 2601B-2601n is an example of the apparatus 150 of fig. 1B, and each includes a control system configured to implement an example of the noise estimation subsystem 64, the noise compensation subsystem 62, and the forced gap application subsystem 70.

According to this example, the system of fig. 26 also includes orchestration device 2605, which is also an example of apparatus 150 of fig. 1B. In some examples, orchestration device 2605 may be an audio device of the playback environment, such as an intelligent audio device. In some such examples, orchestration device 2605 may be implemented via one of audio devices 2601a-2601 n. In other examples, orchestration device 2605 may be another type of device, such as a device referred to herein as a smart home hub. According to this example, orchestration device 2605 comprises a control system configured to receive noise estimates 2610a-2610n from audio devices 2601a-2601n and provide emergency signals 2615a-2615n to audio devices 2601a-2601n to control each respective instance of forced gap applicator 70. In this embodiment, each instance of the forced gap applicator 70 is configured to determine whether to insert a gap, and if so what type of gap, based on the emergency signals 2615a-2615 n.

According to this example, the audio devices 2601a-2601n are further configured to provide current gap data 2620a-2620n to the orchestration device 2605 indicating what gaps, if any, are being implemented by each of the audio devices 2601a-2601 n. In some examples, the current gap data 2620a-2620n may indicate a series of gaps and corresponding times (e.g., a start time and a time interval for each gap or all gaps) that the audio device is applying. In some implementations, the control system of orchestration device 2605 may be configured to maintain a data structure indicating, for example, the most recent gap data, which audio devices have received the most recent emergency signals, and so on. In the system of fig. 26, the forced gap application subsystem 70 operates in response to the emergency signals 2615a-2615n such that the orchestration device 2605 controls forced gap insertion based on the need for a gap in the playback signal.

According to some examples, the emergency signals 2615a-2615n may indicate a series of emergency value sets [ U ] ₀ ，U ₁ ，...U _N ]Where N is a predetermined number of frequency bands (of the entire frequency range of the playback signal) in which the subsystem 70 may insert a mandatory gap (e.g., one mandatory gap in each band), and U _i Is the emergency value of the "i" th band in which subsystem 70 may insert the forced gap. The urgency value for each set of urgency values (corresponding to time) may be generated in accordance with any of the disclosed embodiments for determining urgency and may indicate urgency of inserting a forced gap (at that time) in the N bands (by subsystem 70).

In some implementations, the emergency signals 2615a-2615n may indicate a fixed (time-invariant) set of emergency values [ U ] ₀ ，U ₁ ，...U _N ]Which is determined by a probability distribution defining the gap insertion probability for each of the N frequency bands. According to some examples, the probability distribution is implemented with a pseudo-random mechanism, so the results (the response of each instance of subsystem 70) are deterministic (e.g., the same) across all of the recipient audio devices 2601a-2601 n. Thus, in response to such a fixed set of emergency values, subsystem 70 may be configured to insert fewer forced gaps (on average) in those bands having lower emergency values (i.e., lower probability values determined by the pseudo-random probability distribution) and more forced gaps (on average) in those bands having higher emergency values (i.e., higher probability values). In some embodiments Where the emergency signals 2615a-2615n may indicate a series of emergency value sets [ U ] ₀ ，U ₁ ，...U _N ]For example, a different set of emergency values for each different time in the sequence. Each such different set of emergency values may be determined by a different pseudo-random probability distribution for each of the different times.

We next describe a method for determining an emergency value or a signal (U) indicative of an emergency value (which may be implemented in various embodiments of the disclosed pervasive listening method).

The emergency value of a frequency band indicates that a gap is required to be forced in the band. We present a method for determining an emergency value U _k Wherein U _k Represents urgency of forced gap insertion in band k, and U represents inclusion for a set B _count Vector of emergency values for all bands of the individual bands:

U＝[U ₀ ，U ₁ ，U ₂ ，...].

the first strategy (sometimes referred to herein as method 1) determines a fixed emergency value. This method is simplest and simply allows the emergency vector U to be a predetermined fixed number. When used with a fixed perceived freeness metric, this can be used to implement a system that randomly inserts forced gaps over time. Some such methods do not require time-dependent emergency values provided by pervasive listening applications. Thus:

U＝[u ₀ ，u ₁ ，u ₂ ，…，u _X ]

Wherein x=b _count And each value u _k (for k=1 to k=b _count K) in the range represents a predetermined, fixed emergency value for the "k" band. All u _k Setting to 1.0 will express an equal degree of urgency in all bands.

The second strategy (sometimes referred to herein as method 2) determines an emergency value that depends on the time elapsed since the previous gap occurred. In some embodiments, the urgency increases gradually over time and returns to a low value once the forcing or existing gap causes an update of the pervasive listening result (e.g., a background noise estimate update).

Thus, the emergency value U in each frequency band (band k) _k May correspond to a duration (e.g., number of seconds) since a gap was perceived in band k (pervasive listener). In some examples, the emergency value U in each frequency band _k The following can be determined:

U _k (t)＝min(t-t _g ，U _max )

wherein t is _g Represents the time at which the last gap of band k was seen, and U _max Representing tuning parameters that limit urgency to a maximum size. It should be noted that t _g The update may be based on the existence of a gap originally present in the playback content. For example, in noise compensation, the current noise condition in the playback environment may determine what is considered a gap in the output playback signal. That is, the playback signal must be quieter to occur when the environment is quiet than if the environment is noisier. Also, when implementing a pervasive listening method that relies on the presence or absence of a user's spoken utterance in a playback environment, the urgency of the frequency band typically occupied by a human speaking utterance is often more important.

A third strategy (sometimes referred to herein as method 3) determines event-based emergency values. In this context, "event-based" means relying on some event or activity (or need for information) outside the playback environment, or detecting or inferring that it has occurred in the playback environment. The urgency determined by the pervasive listening subsystem can suddenly change as new user behavior begins or playback environmental conditions change. For example, such changes may cause one or more devices configured for pervasive listening to need to observe background activity in order to make decisions, or to quickly adjust the playback experience to accommodate new conditions, or to achieve changes in general urgency or desired density and time between gaps in each band. Table 3 below provides a number of examples of contexts and scenarios and corresponding event-based changes in urgency:

/>

TABLE 3 Table 3

A fourth strategy (sometimes referred to herein as method 4) uses a combination of two or more of methods 1, 2 and 3 to determine the emergency value. For example, each of methods 1, 2, and 3 may be combined into a joint policy, represented by the following general formulas of the type:

U _k (t)＝u _k *min(t-t _g ，U _max )*V _k

wherein u is _k A fixed unitless weighting factor representing the relative importance of controlling each band, V _k Representing a scalar value modulated in response to a change in context or user behavior requiring a rapid change in urgency, and t _g And U _max The definition is as above. In some examples, the value V _k It is expected to remain at a value of 1.0 under normal operation.

In some examples of multi-device contexts, the forcible gap applicators of intelligent audio devices of an audio environment may cooperate in a orchestrated manner to achieve accurate estimation of ambient noise N. In some such embodiments, determining that forced gaps are introduced in time and frequency may be implemented by orchestration device 2605 implemented by a separate orchestration device (such as the orchestration device referred to herein elsewhere as a smart home hub). In some alternative implementations, the location of the introduction of the forced gap in time and frequency may be determined by one of the intelligent audio devices acting as a leader (e.g., the intelligent audio device acting as orchestration device 2605).

In some implementations, the orchestration device 2605 may include a control system configured to receive the noise estimates 2610a-2610n and provide gap commands to the audio devices 2601a-2601n that may be based at least in part on the noise estimates 2610a-2610 n. In some such examples, orchestration device 2605 may provide a gap command instead of an emergency signal. According to some such embodiments, the forced gap applicator 70 need not determine whether to insert a gap based on the emergency signal, and if so what type of gap, but may instead simply act upon the gap command.

In some such embodiments, the gap command may indicate a characteristic (e.g., frequency range or B) of one or more particular gaps to be inserted _count Z, t1, t2 and/or t 3) and time(s) for inserting one or more specific gaps. For example, the gap command may indicate a series of gaps and corresponding time intervals, such as one of those shown in fig. 23B-23E and described above. In some examples, the gap command may indicate a data structure from which the receiving audio device may access characteristics of a sequence of gaps to be inserted and a corresponding time interval. The data structure may, for example, have been previously provided to the receiving audio device. In some such examples, orchestration device 2605 may include a control system configured to make emergency calculations to determine when and what type of gap command to send.

According to some examples, the emergency signal may be estimated at least in part by noise estimation element 64 of one or more of audio devices 2601a-2601n and may be transmitted to orchestration device 2605. In some examples, the decision to schedule the forced gap at a particular frequency region and time location may be determined, at least in part, by an aggregation of these emergency signals from one or more of the audio devices 2601a-2601 n. For example, the disclosed algorithm that makes a selection based on urgency may instead use a maximum urgency calculated using emergency signals across multiple audio devices, e.g., urgency = maximum (urgency a, urgency B, urgency C.), where urgency a/B/C is understood to be the emergency signal of three separate example devices that implement noise compensation.

The noise compensation system (e.g., the system of fig. 26) may operate with weak or absent echo cancellation (e.g., when implemented as described in U.S. provisional patent application No.62/663,302, which is incorporated herein by reference), but is subject to content-dependent response times, particularly in the case of music, television, and movie content. The time it takes for the noise compensation system to respond to changes in the background noise profile in the playback environment can be very important to the user experience, sometimes even more important than the accuracy of the actual noise estimate. When playback content provides little or no gap to glance at background noise, the noise estimate may remain unchanged even if the noise conditions change. While interpolation and interpolation of missing values in the noise estimate spectrum is often helpful, large areas of the noise estimate spectrum may still become locked and stale.

Some embodiments of the fig. 26 system may be operable to provide a forced gap (in the playback signal) that occurs frequently enough that the background noise estimate (of noise estimator 64) may be updated frequently enough to respond to typical changes in background noise N in playback environment E. In some examples, subsystem 70 may be configured to introduce a forced gap in the compensated audio playback signal (having K channels, where K is a positive integer) output by noise compensation subsystem 62. Here, the noise estimator 64 may be configured to search for gaps in each channel of the compensated audio playback signal (including the mandatory gaps inserted by the subsystem 70) and generate a noise estimate of the frequency band (and time interval) in which the gaps occur. In this example, the noise estimator 64 of the audio device 2601a is configured to provide a noise estimate 2610a to the noise compensation subsystem 62. According to some examples, noise estimator 64 of audio device 2601a may be further configured to use the resulting information about the detected gap to generate (and provide to orchestration device 2605) an estimated emergency signal whose emergency value tracks the urgency of inserting a forced gap in the frequency band of the compensated audio playback signal.

In this example, the noise estimator 64 is configured to accept a microphone feed Mic (output of the microphone M in the playback environment E) and a reference for the compensated audio playback signal (input of the speaker system S in the playback environment E). According to this example, the noise estimate generated in subsystem 64 is provided to noise compensation subsystem 62, and noise compensation subsystem 62 applies a compensation gain to input playback signal 23 (from content source 22) to level each of its frequency bands to a desired playback level. In this example, the noise-compensated audio playback signal (output from subsystem 62) and the urgency metric for each band (indicated by the urgency signal output from orchestration device 2605) are provided to forced gap applicator 70, which forced gap applicator 70 forces a gap in the compensated playback signal (preferably according to an optimization procedure). Speaker feed (S), each indicating the content of a different channel of the noise-compensated playback signal (output from the forced gap applicator 70), are provided to each speaker of the speaker system S.

While some embodiments of the system of fig. 26 may perform echo cancellation as an element of noise estimation that it performs, other embodiments of the system of fig. 26 do not perform echo cancellation. Thus, elements for realizing echo cancellation are not specifically shown in fig. 26.

In fig. 26, the time domain to frequency domain (and/or frequency domain to time domain) transformation of the signal is not shown, but the application of noise compensation gain (in subsystem 62), analysis of the content for gap forcing (in the encoding device 2605, noise estimator 64, and/or forced gap applicator 70), and insertion of forced gaps (by forced gap applicator 70) may be implemented in the same transform domain for convenience, the resulting output audio being re-synthesized into Pulse Code Modulated (PCM) audio in the time domain or further encoded for transmission prior to playback. According to some examples, each participating device coordinates the enforcement of such gaps using methods described elsewhere herein. In some such examples, the gaps introduced may be identical. In some examples, the introduced gaps may be synchronized.

By inserting gaps using the forced gap applicator 70 present on each participating device, the number of gaps in each channel of the compensated playback signal (output from the noise compensation subsystem 62 of the system of fig. 26) can be increased (relative to the number of gaps that would occur without the use of the forced gap applicator 70) in order to significantly reduce the requirements of any echo canceller implemented by the system of fig. 26, and in some cases even completely eliminate the need for echo cancellation.

In some disclosed embodiments, simple post-processing circuitry, such as time-domain peak limiting or speaker protection, may be implemented between the forced gap applicator 70 and the speaker system S. However, post-processing with the ability to boost and compress the speaker feed is possible to cancel or reduce the quality of the forced gap inserted by the forced gap applicator, so these types of post-processing are preferably implemented at some point in the signal processing path prior to the forced gap applicator 70.

27A and 27B illustrate a system block diagram showing examples of elements of an orchestration device and elements of an orchestrated audio device according to some disclosed embodiments. As with the other figures provided herein, the types and amounts of elements shown in fig. 27A and 27B are provided by way of example only. Other embodiments may include more, fewer, different types, and/or different numbers of elements. In this example, orchestrated audio devices 2720a-2720n and orchestration device 2701 of fig. 27A and 27B are examples of apparatus 150 described above with reference to fig. 1B.

According to such an embodiment, each of the orchestrated audio devices 2720a-2720n includes the following elements:

-2731: an example of a loudspeaker system 110 of fig. 1B, which includes one or more loudspeakers;

-2732: an example of the microphone system 111 of fig. 1B includes one or more microphones;

-2711: the audio playback signal output by rendering module 2721, which in this example is an example of rendering module 210A of fig. 2. According to this example, rendering module 2721 is controlled according to instructions from orchestration module 2702 and may also receive information and/or instructions from user region classifier 2705 and/or rendering configuration module 2707;

-2712: the noise-compensated audio playback signal output by noise compensation module 2721, which in this example is an example of noise compensation subsystem 62 of fig. 26;

-2713: the noise-compensated audio playback signal, including one or more gaps, is output by an acoustic gap puncher 2722, which in this example is an example of the forced gap applicator 70 of fig. 26. In this example, the acoustic gap punch 2722 is controlled according to instructions from the orchestration module 2702;

-2714: the modified audio playback signal output by calibration signal injector 2723, which in this example is an example of calibration signal injector 211A of fig. 2;

-2715: the calibration signal output by calibration signal generator 2725, which in this example is an example of calibration signal generator 212A of fig. 2;

-2716: a calibration signal replica corresponding to a calibration signal generated by other audio devices of the audio environment (in this example, by one or more of audio devices 2720b-2720 n). Calibration signal replica 2716 may be, for example, an example of calibration signal replica 204A described above with reference to fig. 2. In some examples, calibration signal copy 2716 may be received from orchestration device 2701 (e.g., via a device such as Wi-Fi or Bluetooth ^TM Wireless communication protocol of (c) a);

-2717: control information related to and/or used by one or more audio devices in an audio environment. In this example, control information 2717 is provided by orchestration device 2701 (e.g., by orchestration module 2702) described below with reference to fig. 27B. Control information 2717 may, for example, include an example of calibration information 205A described above with reference to fig. 2, or an example of calibration signal parameters disclosed elsewhere herein. Control information 2717 may include parameters used by control system 160n to generate calibration signals, modulate calibration signals, demodulate calibration signals, and the like. In some examples, control information 2717 may include one or more DSSS spreading code parameters and one or more DSSS carrier parameters. In some examples, the control information 2717 may include information for controlling the rendering module 2721, the noise compensation module 2711, the acoustic gap puncher 2712, and/or the baseband processor 2729;

-2718: microphone signals received by microphone(s) 2732;

-2719: demodulating the coherent baseband signal, which may be an example of demodulating the coherent baseband signals 208 and 208A described above with reference to fig. 2-4 and 17;

-2721: a rendering module configured to render audio signals of a content stream, such as audio data of music, movies, and TV programs, etc., to generate audio playback signals;

-2723: a calibration signal injector configured to insert the calibration signal 2715a modulated by the calibration signal modulator 2724 (or, in some cases where the calibration signal does not require modulation, the calibration signal 2715 generated by the calibration signal generator 2725) into the audio playback signal generated by the rendering module 2721 (which in this example has been modified by the noise compensation module 2730 and the acoustic gap punch 2722) to generate a modified audio playback signal 2714. The insertion process may be, for example, a mixing process in which the calibration signal 2715 or 2715a is mixed with the audio playback signal produced by rendering module 210A (which in this example has been modified by noise compensation module 2730 and acoustic gap punch 2722) to produce a modified audio playback signal 2714;

-2724: an optional calibration signal modulator configured to modulate the calibration signal 2715 generated by the calibration signal generator 2725 to produce a modulated calibration signal 2715a;

-2725: a calibration signal generator configured to generate a calibration signal 2715, and in this example, provide the calibration signal 2715 to the calibration signal modulator 2724 and baseband processor 2729. In some examples, calibration signal generator 2725 may be an example of calibration signal generator 212A described above with reference to fig. 2. According to some examples, calibration signal generator 2725 may include a spreading code generator and a carrier generator, e.g., as described above with reference to fig. 17. In this example, calibration signal generator 2725 provides a calibration signal replica 2715 to baseband processor and calibration signal demodulator 2726;

-2726: a calibration signal demodulator configured to demodulate microphone signals 2718 received by microphone(s) 2732. In some examples, calibration signal demodulator 2726 may be an example of calibration signal demodulator 212A described above with reference to fig. 2. In this example, calibration signal demodulator 2726 outputs a demodulated coherent baseband signal 2719. Demodulation of microphone signal 2718 may be performed using standard correlation techniques, including integrating and dumping matched filter correlator banks, for example. Some detailed examples are provided herein. To improve the performance of these demodulation techniques, in some embodiments, the microphone signal 2718 may be filtered prior to demodulation to remove unwanted content/phenomena. According to some embodiments, the demodulated coherent baseband signal 2719 may be filtered before or after being provided to baseband processor 2729. The signal-to-noise ratio (SNR) generally increases with increasing integration time (e.g., with increasing length of the spreading code used to generate the calibration signal);

-2729: a baseband processor configured to baseband process the demodulated coherent baseband signal 2719. In some examples, baseband processor 2729 may be configured to implement techniques such as non-coherent averaging to improve SNR by reducing the variance of the square waveform to produce a delayed waveform. Some detailed examples are provided herein. In this example, baseband processor 218A is configured to output one or more estimated acoustic scene metrics 2733;

-2730: and a noise compensation module configured to compensate for noise in the audio environment. In this example, noise compensation module 2730 compensates for noise in audio playback signal 2711 output by rendering module 2721 based at least in part on control information 2717 from orchestration module 2702. In some implementations, the noise compensation module 2730 can be configured to compensate for noise in the audio playback signal 2711 based at least in part on one or more acoustic scene metrics 2733 (e.g., noise information) provided by the baseband processor 2729; and

2733n: the audio device 2720n derives one or more observations from, for example, a calibration signal extracted from the microphone signal (e.g., from the demodulated coherent baseband signal 2719) and/or wake-up word information 2734 provided from the wake-up word detector 2727. These observations are also referred to herein as acoustic scene metrics. The acoustic scene metric(s) 2733 may include or may be wake word metrics, data corresponding to time of flight, time of arrival, range, audio device audibility, audio device impulse response, angle between audio devices, audio device location, audio environmental noise and/or signal-to-noise ratio. In this example, orchestrated audio devices 2720a-2720n determine acoustic scene metrics 2733a-2733n, respectively, and provide acoustic scene metrics 2733a-2733n to orchestration device 2701.

According to such an embodiment, orchestration device 2701 comprises the following elements:

-2702: an orchestration module configured to control various functions of orchestrated audio devices 2720a-2720n, including, but not limited to, gap insertion and calibration signal generation in this example. In some implementations, orchestration module 2702 may provide one or more of the various functions of the orchestration device disclosed herein. Thus, orchestration module 2702 may provide information for controlling one or more aspects of audio processing and/or playback of the audio device. For example, orchestration module 2702 may provide calibration signal parameters to calibration signal generators 2725 (and in this example modulators 2724 and demodulators 2726) of orchestrated audio devices 2720a-2720 n. The orchestration module 2702 may provide gap insertion information to the acoustic gap punches 2722 of the orchestrated audio devices 2720a-2720 n. Orchestration module 2702 may provide instructions for coordinating gap insertion and calibration signal generation. Orchestration module 2702 (and in some examples other modules of orchestration device 2701, such as user region classifier 2705 and rendering configuration generator 2707 in this example) may provide instructions for controlling rendering module 2721;

-2703: a geometric proximity estimator configured to estimate a current location, and in some examples a current orientation, of an audio device in an audio environment. In some examples, geometric proximity estimator 2703 may be configured to estimate a current location (and in some cases a current orientation) of one or more persons in the audio environment. Some examples of the geometric proximity estimator function are described below with reference to fig. 41 and the like;

-2704: an audio device audibility estimator may be configured to estimate audibility of one or more loudspeakers in or near an audio environment at any location, such as audibility at a current estimated location of a listener. Some examples of audio device audibility estimator functions are described below with reference to fig. 31 and the like. (see, e.g., fig. 32 and corresponding description);

-2705: a user zone classifier configured to estimate a zone of an audio environment (e.g., sofa zone, dining table zone, refrigerator zone, reading chair zone, etc.) in which a person is currently located. In some examples, user zone classifier 2705 may be an example of zone classifier 2537 whose functionality is described above with reference to fig. 25A and 25B;

-2706: a noise audibility estimator configured to estimate noise audibility at any location, such as the audibility of a listener in an audio environment at a current estimated location. Some examples of audio device audibility estimator functions are described below with reference to fig. 31 and the like. (see, e.g., fig. 33 and 34, and corresponding descriptions). In some examples, noise audibility estimator 2706 may estimate noise audibility by interpolating aggregated noise data 2740 from aggregator 2708. The aggregated noise data 2740 may be obtained from a plurality of audio devices of the audio environment (e.g., by a plurality of baseband processors 2729 and/or other modules implemented by a control system of the audio device), for example, by "fully listening" to gaps in the audio data that have been inserted into playback, as described above with reference to fig. 21, etc., to evaluate noise conditions in the audio environment;

-2707: a rendering configuration generator configured to generate a rendering configuration in response to relative positions (and, in this example, relative audibility) of the audio device and one or more listeners in the audio environment. Rendering configuration generator 2707 may, for example, provide functionality such as described below with reference to fig. 51, etc.;

-2708: an aggregator configured to aggregate acoustic scene metrics 2733a-2733n received from orchestrated audio devices 2720a-2720n and provide the aggregated acoustic scene metrics (in this example, aggregated acoustic scene metrics 2735-2740) to acoustic scene metric processing module 2728 and other modules of orchestration device 2701. The estimation of the acoustic scene metrics from the baseband processor modules of the orchestrated audio devices 2720a-2720n will typically arrive asynchronously, so the aggregator 2708 is configured to collect acoustic scene metric data over time, store the acoustic scene metric data in memory (e.g., buffer), and pass it to subsequent processing blocks at appropriate times (e.g., after receiving the acoustic scene metric data from all orchestrated audio devices). In this example, aggregator 2708 is configured to provide aggregated audibility data 2735 to orchestration module 2702 and audio device audibility estimator 2704. In such an embodiment, aggregator 2708 is configured to provide aggregated noise data 2740 to orchestration module 2702 and noise audibility estimator 2706. According to such an embodiment, the aggregator 2708 provides aggregated direction of arrival (DOA) data 2736, aggregated time of arrival (TOA) data 2737, and aggregated Impulse Response (IR) data 2738 to the orchestration module 2702 and the geometric proximity estimator 2703. In this example, aggregator 2708 provides aggregated wake word metrics 2739 to orchestration module 2702 and user region classifier 2705; and

-2728: an acoustic scene metric processing module configured to receive and apply aggregated acoustic scene metrics 2735-2739. According to this example, the acoustic scene metric processing module 2728 is a component of the orchestration module 2702, while in alternative examples, the acoustic scene metric processing module 2728 may not be a component of the orchestration module 2702. In this example, the acoustic scene metric processing module 2728 is configured to generate information and/or commands based at least in part on at least one of the aggregate acoustic scene metrics 2735-2739 and/or at least one audio device characteristic. The audio device characteristic(s) may be one or more characteristics of one or more of the orchestrated audio devices 2720a-2720 n. The audio device characteristic(s) may be stored, for example, in a memory of control system 160 or in a memory accessible to control system 160 of orchestration device 2701.

In some implementations, orchestration device 2701 may be implemented in an audio device, such as a smart audio device. In such an embodiment, orchestration device 2701 may include one or more microphones and one or more loudspeakers.

Cloud processing

In some implementations, the orchestrated audio devices 2720a-2720n mainly include real-time processing blocks that run locally due to the high data bandwidth and low processing latency requirements. However, in some examples, baseband processor 2729 may reside in the cloud (e.g., may be implemented via one or more servers) because in some examples, the output of baseband processor 2729 may be calculated asynchronously. According to some embodiments, the blocks of orchestration device 2701 may all reside in the cloud. In some alternative implementations, blocks 2702, 2703, 2708, and 2705 may be implemented on a local device (e.g., a device in the same audio environment as orchestrated audio devices 2720a-2720 n), as these blocks preferably operate in real-time or near real-time. However, in some such embodiments, blocks 2703, 2704, and 2707 may operate via cloud services.

Fig. 28 is a flowchart outlining another example of the disclosed audio device orchestration method. As with other methods described herein, the blocks of method 2800 are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described. The method 2800 may be performed by an orchestration device, such as orchestration device 2701 described above with reference to fig. 27B. Method 2800 involves controlling the orchestrated audio devices, such as some or all of the orchestrated audio devices 2720a-2720n described above with reference to fig. 27A.

According to this example, block 2805 involves causing, by the control system, a first audio device of the audio environment to generate a first calibration signal. For example, a control system of an orchestration device, such as orchestration device 2701, may be configured to cause a first orchestrated audio device of the audio environment (e.g., orchestrated audio device 2720 a) to generate a first calibration signal in block 2805.

In this example, block 2810 involves causing, by the control system, a first calibration signal to be inserted into a first audio playback signal corresponding to a first content stream to generate a first modified audio playback signal for the first audio device. For example, orchestration device 2701 may be configured to cause orchestrated audio device 2720a to insert a first calibration signal into a first audio playback signal corresponding to the first content stream to generate a first modified audio playback signal for orchestrated audio device 2720 a.

According to this example, block 2815 involves causing, by the control system, the first audio device to play back the first modified audio playback signal to generate a first audio device playback sound. For example, orchestration device 2701 may be configured to cause orchestrated audio device 2720a to play back the first modified audio playback signal on loudspeaker(s) 2731 to generate a first orchestrated audio device playback sound.

In this example, block 2820 relates to causing, by the control system, a second audio device of the audio environment to generate a second calibration signal. For example, orchestration device 2701 may be configured to cause orchestrated audio device 2720b to generate a second calibration signal.

According to this example, block 2825 relates to causing, by the control system, a second calibration signal to be inserted into the second content stream to generate a second modified audio playback signal for the second audio device. For example, orchestration device 2701 may be configured to cause orchestrated audio device 2720b to insert a second calibration signal into the second content stream to generate a second modified audio playback signal for orchestrated audio device 2720 b.

In this example, block 2830 relates to causing, by the control system, the second audio device to play back the second modified audio playback signal to generate a second audio device playback sound. For example, orchestration device 2701 may be configured to cause orchestrated audio device 2720b to play back the second modified audio playback signal on loudspeaker(s) 2731 to generate a second orchestrated audio device playback sound.

According to this example, block 2835 relates to causing, by a control system, at least one microphone of an audio environment to detect at least a first audio device playback sound and a second audio device playback sound and generate microphone signals corresponding to at least the first audio device playback sound and the second audio device playback sound. In some examples, the microphone may be a microphone of an orchestration device. In other examples, the microphone may be a microphone of an orchestrated audio device. For example, the orchestration device 2701 may be configured to cause one or more of the orchestrated audio devices 2720a-2720n to detect at least a first orchestrated audio device playback sound and a second orchestrated audio device playback sound using at least one microphone and to generate microphone signals corresponding to at least the first orchestrated audio device playback sound and the second orchestrated audio device playback sound.

In this example, block 2840 relates to extracting, by a control system, a first calibration signal and a second calibration signal from a microphone signal. For example, orchestration device 2701 may be configured to cause one or more of orchestrated audio devices 2720a-2720n to extract a first calibration signal and a second calibration signal from the microphone signals.

According to this example, block 2845 relates to estimating, by a control system, at least one acoustic scene metric based at least in part on the first calibration signal and the second calibration signal. For example, orchestration device 2701 may be configured to cause one or more of orchestrated audio devices 2720a-2720n to estimate at least one acoustic scene metric based at least in part on the first calibration signal and the second calibration signal. Alternatively or additionally, in some examples, orchestration device 2701 may be configured to estimate acoustic scene metric(s) based at least in part on the first calibration signal and the second calibration signal.

The particular acoustic scene metric(s) estimated in method 2800 may vary depending on the particular implementation. In some examples, the acoustic scene metric(s) may include one or more of time of flight, time of arrival, direction of arrival, range, audio device audibility, audio device impulse response, angle between audio devices, audio device location, audio environmental noise, or signal-to-noise ratio.

In some examples, the first calibration signal may correspond to a first sub-audible component of the first audio device playback sound and the second calibration signal may correspond to a second sub-audible component of the second audio device playback sound.

In some cases, the first calibration signal may be or may include a first DSSS signal and the second calibration signal may be or may include a second DSSS signal. However, the first and second calibration signals may be any suitable type of calibration signal, including but not limited to the specific examples disclosed herein.

According to some examples, the first content stream component of the first orchestrated audio device playback sound may result in a perceptual masking of the first calibration signal component of the first orchestrated audio device playback sound, and the second content stream component of the second orchestrated audio device playback sound may result in a perceptual masking of the second calibration signal component of the second orchestrated audio device playback sound.

In some implementations, the method 2800 may involve causing, by the control system, a first gap to be inserted into a first frequency range of the first audio playback signal or the first modified audio playback signal during a first time interval of the first content stream such that the first modified audio playback signal and the first audio device playback sound include the first gap. The first gap may correspond to an attenuation of the first audio playback signal in the first frequency range. For example, orchestration device 2701 may be configured to cause orchestrated audio device 2720a to insert a first gap into a first frequency range of a first audio playback signal or a first modified audio playback signal during a first time interval.

According to some implementations, the method 2800 may include causing, by the control system, the first gap to be inserted into a first frequency range of the second audio playback signal or the second modified audio playback signal during the first time interval such that the second modified audio playback signal and the second audio device playback sound include the first gap. For example, orchestration device 2701 may be configured to cause orchestrated audio device 2720b to insert a first gap into a first frequency range of a second audio playback signal or a second modified audio playback signal during a first time interval.

In some implementations, the method 2800 can include causing, by the control system, extraction of audio data from the microphone signal at least in a first frequency range to produce extracted audio data. For example, orchestration device 2701 may cause one or more of orchestrated audio devices 2720a-2720n to extract audio data from the microphone signals at least in a first frequency range to produce extracted audio data.

According to some implementations, the method 2800 may involve estimating, by the control system, at least one acoustic scene metric based at least in part on the extracted audio data. For example, orchestration device 2701 may cause one or more of orchestrated audio devices 2720a-2720n to estimate at least one acoustic scene metric based at least in part on the extracted audio data. Alternatively or additionally, in some examples, orchestration device 2701 may be configured to estimate acoustic scene metric(s) based at least in part on the extracted audio data.

The method 2800 may involve both control gap insertion and calibration signal generation. In some examples, method 2800 may involve controlling gap insertion and/or calibration signal generation such that a perceived level of reproduced audio content at a user location is maintained under varying noise conditions (e.g., varying noise spectrum) in some cases. According to some examples, method 2800 may involve controlling the calibration signal generation such that a signal-to-noise ratio of the calibration signal is maximized. Method 2800 may involve controlling the calibration signal generation to ensure that the calibration signal is not heard by the user even under varying audio content and noise conditions.

In some examples, method 2800 may involve controlling the gap insertion to vacate the time-frequency division block such that neither content nor calibration signals are present during the inserted gap, allowing for estimation of background noise. Thus, in some examples, method 2800 may involve controlling gap insertion and calibration signal generation such that the calibration signal corresponds to neither a gap time interval nor a gap frequency range. For example, orchestration device 2701 may be configured to control gap insertion and calibration signal generation such that the calibration signal corresponds to neither gap time intervals nor gap frequency ranges.

According to some examples, method 2800 may involve controlling gap insertion and calibration signal generation based at least in part on a time since noise was estimated in at least one frequency band. For example, orchestration device 2701 may be configured to control gap insertion and calibration signal generation based at least in part on time since noise was estimated in at least one frequency band.

In some examples, method 2800 may involve controlling gap insertion and calibration signal generation based at least in part on a signal-to-noise ratio of a calibration signal of at least one audio device in at least one frequency band. For example, orchestration device 2701 may be configured to control gap insertion and calibration signal generation based at least in part on a signal-to-noise ratio of a calibration signal of at least one orchestrated audio device in at least one frequency band.

According to some implementations, the method 2800 may involve causing the target audio device to play back an unmodified audio playback signal of the target device content stream to generate a target audio device playback sound. In some such examples, method 2800 may involve estimating at least one of target audio device audibility or target audio device location based at least in part on the extracted audio data. In some such embodiments, the unmodified audio playback signal does not include the first gap. In some such examples, the microphone signal also corresponds to target audio device playback sound. According to some such examples, the unmodified audio playback signal does not include gaps that are inserted into any frequency range.

For example, orchestration device 2701 may be configured to cause a target orchestrated audio device of orchestrated audio devices 2720a-2720n to play back unmodified audio playback signals of a target device content stream to generate target orchestrated audio device playback sounds. In one example, if the target audio device is an orchestrated audio device 2720a, orchestration device 2701 will cause orchestrated audio device 2720a to play back unmodified audio playback signals of the target device content stream to generate target orchestrated audio device playback sounds. The orchestration device 2701 may be configured to cause at least one of the audibility of the target orchestrated audio device or the location of the target orchestrated audio device to be estimated by at least one of the other orchestrated audio devices (in the previous example, one or more of the orchestrated audio devices 2720b-2720 n) based at least in part on the extracted audio data. Alternatively or additionally, in some examples, orchestration device 2701 may be configured to estimate target orchestrated audio device audibility and/or target orchestrated audio device location based at least in part on the extracted audio data.

In some examples, method 2800 may involve controlling one or more aspects of audio device playback based at least in part on acoustic scene metric(s). For example, orchestration device 2701 may be configured to control rendering module 2721 of one or more of orchestrated audio devices 2720b-2720n based at least in part on the acoustic scene metric(s). In some implementations, orchestration device 2701 may be configured to control noise compensation module 2730 of one or more of orchestrated audio devices 2720b-2720n based at least in part on the acoustic scene metric(s).

According to some implementations, the method 2800 may involve causing, by the control system, third through nth calibration signals to be generated by third through nth audio devices of the audio environment and causing, by the control system, the third through nth calibration signals to be inserted into the third through nth content streams to generate third through nth modified audio playback signals for the third through nth audio devices. In some examples, the method 2800 may involve causing, by the control system, the third through nth audio devices to play back corresponding instances of the third through nth modified audio playback signals to generate third through nth instances of audio device playback sound. For example, the orchestration device 2701 may be configured to cause the orchestrated audio devices 2720c-2720N to generate third through nth calibration signals and insert the third through nth calibration signals into the third through nth content streams to generate third through nth modified audio playback signals for the orchestrated audio devices 2720 c-2720N. The orchestration device 2701 may be configured to cause the orchestrated audio devices 2720c-2720N to play back corresponding instances of the third through nth modified audio playback signals to generate third through nth instances of audio device playback sound.

In some examples, the method 2800 may involve causing, by the control system, at least one microphone of each of the first through nth audio devices to detect first through nth instances of audio device playback sound and generate microphone signals corresponding to the first through nth instances of audio device playback sound. In some cases, the first through nth instances of the audio device playback sound may include the first audio device playback sound, the second audio device playback sound, and the third through nth instances of the audio device playback sound. According to some examples, the method 2800 may involve extracting, by the control system, first through nth calibration signals from the microphone signal. The acoustic scene metric(s) may be estimated based at least in part on the first through nth calibration signals.

For example, orchestration device 2701 may be configured to cause at least one microphone of some or all of orchestrated audio devices 2720a-2720N to detect first through nth instances of audio device playback sound, and to generate microphone signals corresponding to the first through nth instances of audio device playback sound. The orchestration device 2701 may be configured to cause some or all of the orchestrated audio devices 2720a-2720N to extract first through nth calibration signals from the microphone signals. Some or all of the orchestrated audio devices 2720a-2720N may be configured to estimate acoustic scene metric(s) based at least in part on the first through nth calibration signals. Alternatively or additionally, orchestration device 2701 may be configured to estimate acoustic scene metric(s) based at least in part on the first through nth calibration signals.

According to some implementations, the method 2800 may involve determining one or more calibration signal parameters for a plurality of audio devices in an audio environment. One or more calibration signal parameters may be used for the generation of the calibration signal. The method 2800 may involve providing one or more calibration signal parameters to one or more programmed audio devices of an audio environment. For example, orchestration device 2701 (in some cases, orchestration module 2702 of orchestration device 2701) may be configured to determine one or more calibration signal parameters of one or more of orchestrated audio devices 2720a-2720n and provide the one or more calibration signal parameters to the orchestrated audio device(s).

In some examples, determining the one or more calibration signal parameters may involve scheduling a time slot for playback of the modified audio playback signal for each of the plurality of audio devices. In some cases, the first time slot of the first audio device may be different from the second time slot of the second audio device.

According to some embodiments, determining the one or more calibration signal parameters may involve determining a frequency band for playback of the modified audio playback signal for each of the plurality of audio devices. In some examples, the first frequency band of the first audio device may be different from the second frequency band of the second audio device.

In some examples, determining the one or more calibration signal parameters may involve determining a DSSS spreading code for each of a plurality of audio devices. According to some examples, the first spreading code of the first audio device may be different from the second spreading code of the second audio device. According to some implementations, the method 2800 may involve determining at least one spreading code length based at least in part on audibility of a corresponding audio device.

In some implementations, determining the one or more calibration signal parameters may involve applying an acoustic model that is based at least in part on the mutual audibility of each of a plurality of audio devices in the audio environment.

In some examples, the method 2800 may involve causing each of a plurality of audio devices in an audio environment to play back the modified audio playback signal simultaneously.

According to some embodiments, at least a portion of the first audio playback signal, at least a portion of the second audio playback signal, or at least a portion of each of the first audio playback signal and the second audio playback signal may correspond to silence.

Fig. 29 is a flowchart outlining another example of the disclosed audio device orchestration method. As with other methods described herein, the blocks of method 2900 are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described. Method 2900 may be performed by an orchestration device, such as orchestration device 2701 described above with reference to fig. 27B. Method 2900 involves controlling orchestrated audio devices, such as some or all of orchestrated audio devices 2720a-2720n described above with reference to fig. 27A.

The following table defines the symbols used in fig. 29 and the following description:

TABLE 4 Table 4

In this example, fig. 29 shows a block of a method of allocating a spectral band k at a time block l. According to this example, the blocks shown in fig. 29 will be repeated for each spectral band and each time block. The length of the time block may vary depending on the particular implementation, but may be on the order of a few seconds (e.g., in the range of 1 second to 5 seconds), or on the order of hundreds of milliseconds, for example. The frequency spectrum occupied by a single frequency band may also vary depending on the particular implementation. In some implementations, the spectrum occupied by a single band is based on a perceptual interval, such as Mel or a critical band.

As used herein, the term "time-frequency division block" refers to a single block of time in a single frequency band. At any given time, the time-frequency partitions may be occupied by a combination of program content (e.g., movie audio content, music, etc.) and one or more calibration signals. Neither the program content nor the calibration signal should be present when only background noise needs to be sampled. The corresponding time-frequency division blocks are referred to herein as "gaps".

The left column of fig. 29 (blocks 2902-2908) relates to estimating background noise in the audio environment when neither the content nor the calibration signal is present in the time-frequency block (in other words, when the time-frequency block corresponds to a gap). This is a simplified example of an orchestrated gap approach, such as those described above with reference to fig. 21, etc., with additional logic to handle calibration sequences that may occupy the same band in some cases.

In this example, processing of spectral band k at time block l is initiated in block 2901. Block 2902 involves determining whether a previous block (block l-1) has a gap in spectral band k. If so, then this time-frequency block corresponds only to the background noise that may be estimated in block 2903.

In this example, it is assumed that the noise is pseudo-static, such that time T is required _N The noise is sampled at defined regular intervals. Thus, block 2904 involves determining whether T has passed since the last noise measurement _N 。

If it is determined in block 2904 that T has passed since the last measurement _N Then processing continues to block 2905 where block 2905 involves determining whether the calibration signal in the current time-frequency block is complete. Block 2905 is desirable because in some embodiments, the calibration signal may occupy more than one block of time and may need to wait (or at least be desirable) until the calibration signal in the current time-frequency block is complete before inserting the gap. In this example, if it is determined in block 2905 that the calibration signal is not complete, then the method proceeds to block 2906, where block 2906 involves marking the current time-frequency block as requiring noise estimation in future blocks.

In this example, if it is determined in block 2905 that the calibration signal is complete, then the method proceeds to block 2907, block 2907 involves determining if there are any mute (gap) bands within the minimum spectral gap interval, which in this example is denoted as K _G . It should be noted that not at interval K _G The frequency band is muted (a gap is inserted) to avoid producing perceptible artifacts in the reproduced audio data. If it is determined in block 2907 that there is a band of gaps within the minimum spectral gap interval, processing continues to block 2906 and the band is marked as requiring future noise estimation. However, if it is determined in block 2907 that there is no gap band within the minimum spectral gap interval, then the process continues to block 2908, where block 2908 involves having all orchestrated audio devices insert gaps into the band. In this example, block 2908 also relates to sampling noise in the current time-frequency block.

The right column of fig. 29 (blocks 2909-2917) relates to the servicing of any calibration signals (also referred to herein as calibration sequences) that may have been run in a previous time block. In some examples, each time-frequency block may contain a plurality of orthogonal calibration signals (such as DSSS sequences described herein), e.g., a set of calibration signals have been inserted into/mixed with the audio content and played back by each of a plurality of programmed audio devices. Thus, in this example, block 2909 involves iterating through all calibration sequences present in the current time-frequency block to determine if all calibration sequences have been serviced. If not, then the next calibration sequence is serviced beginning at block 2910.

Block 2911 relates to determining whether the calibration sequence has been completed. In some examples, the calibration sequence may span multiple time blocks, so a calibration sequence that begins before the current time block does not have to be completed at the time of the current time block. If it is determined in block 2911 that the calibration sequence is complete, processing continues to block 2912.

In this example, block 2912 involves determining whether the calibration sequence currently being evaluated has been successfully demodulated. For example, block 2912 may be based on information obtained from one or more programmed audio devices attempting to demodulate a calibration sequence currently being evaluated. The demodulation failure may be due to one or more of the following reasons:

1. High levels of background noise;

2. high levels of program content;

3. high level calibration signal from nearby devices (especially near +.A.discussed elsewhere herein)

Remote problems); and

4. the devices are asynchronous.

If it is determined at block 2912 that the calibration sequence has been successfully demodulated, processing continues to block 2913. According to this example, block 2913 involves estimating one or more acoustic scene metrics, such as DOA, TOA, and/or audibility in the current frequency band. Block 2913 may be performed by one or more orchestrated devices and/or orchestration devices.

In this example, if it is determined in block 2912 that the calibration sequence was not successfully demodulated, then the process directly continues to block 2914. According to this example, block 2914 involves monitoring the demodulated calibration signal and updating the calibration signal parameters as needed to determineEnsuring that all programmed devices hear each other sufficiently well (with sufficiently high mutual audibility). Robustness of the calibration signal parameters may pass through the kth bandThe combination of parameters of the i-th device. In one example where the calibration signal is a DSSS signal, the robustness may include modifying the parameter, for example, by performing one or more of:

1. Increasing the amplitude of the calibration signal;

2. reducing the chip rate of the calibration signal;

3. increasing the coherent integration time;

4. increasing the incoherent integration time; and/or

5. The number of concurrent signals in the same time-frequency block is reduced.

Calibration parameters 2 and 3 may result in the calibration sequence taking up more time blocks.

According to this example, block 2915 involves determining whether the calibration parameters have reached one or more limits. For example, block 2915 may involve determining whether the amplitude of the calibration signal has reached a limit such that exceeding the limit will result in the calibration signal being audible over the played back audio content. In some examples, block 2915 may involve determining that the coherent integration time or the noncoherent integration time has reached a predetermined limit.

If it is determined in block 2915 that the calibration parameters have not reached one or more limits, then the process directly continues to block 2917. However, if it is determined in block 2915 that the calibration parameters have reached one or more in block 2915, then processing continues to block 2916. In some alternative examples, block 2916 may involve scheduling (e.g., for a next block of time) scheduled gaps in which none of the scheduled audio devices play back content and only one of the scheduled audio devices play back acoustic calibration signals. In some alternative examples, block 2916 may involve playing back content and acoustic calibration signals through only one programmed audio device. In other examples, block 2916 may involve playback of content by all of the orchestrated audio devices and playback of the acoustic calibration signal by only one of the orchestrated audio devices.

In this example, block 2917 involves assigning a calibration sequence to the next block in the current band. In some cases, block 2917 may involve increasing or decreasing the number of acoustic calibration signals that are simultaneously played back during the next time block in the current frequency band. Block 2917 may, for example, involve determining when the last acoustic calibration signal in the current frequency band was successfully demodulated as part of a process of determining whether to increase or decrease the number of acoustic calibration signals that are simultaneously played back during the next time block in the current frequency band.

Fig. 30 shows an example of the calibration signal, the gap for noise estimation, and the time-frequency allocation for hearing the gap of a single audio device. Fig. 30 is intended to represent a time snapshot of a continuous process where different channel conditions exist in each frequency band prior to time block 1. As with the other disclosed examples, in fig. 30, time is represented as a series of blocks represented along the horizontal axis and frequency bands are represented along the vertical axis. The rectangles indicated as "device 1", "device 2", etc. in fig. 30 correspond to calibration signals for the programmed audio device 1, the programmed audio device 2, etc. in a specific frequency band and during one or more time blocks.

The calibration signal in band 1 (band 1) essentially represents a repeated single measurement of one time block. During each time block except for time block 1 where the programmed gap is punched, only one calibration signal of the programmed audio device is present in the band 1.

In band 2, the calibration signals of the two programmed audio devices occur during each time block. In this example, the calibration signal has been assigned an orthogonal code. This arrangement allows all programmed audio devices to play back their acoustic calibration signals half the time required for the arrangement shown in band 1. The calibration sequence of devices 1 and 2 is completed at the end of block 1, allowing the scheduled gap to be played in block 2, which delays the playback of the acoustic calibration signals of devices 3 and 4 to time block 3.

In band 3, four programmed audio devices attempt to play back their acoustic calibration signals in the first block, possibly due to good conditions before time block 1. However, this may result in poor demodulation results, and thus a concurrent reduction to two devices in time block 2 (e.g., in block 2917 of fig. 29). However, poor demodulation results are still returned. After the forced gap in time block 3, instead of further reducing the concurrency of the individual devices, longer codes are allocated to devices 1 and 2 starting from time block 4 in an attempt to increase robustness.

The band 4 starts with only the device 1 playing back its acoustic calibration signal during time blocks 1-4 (e.g. via a 4 block code sequence), possibly due to the severe conditions before time block 1. The code sequence is incomplete in block 4 when the gap is scheduled, resulting in the implementation of the forced gap being delayed by one time block.

The scene depicted for zone 5 is substantially the same as the scene of zone 2, with the two programmed audio devices simultaneously playing back their acoustic calibration signals during a single time block. In this example, due to the delay gap in band 4, the gap scheduled for time block 5 is delayed to time block 6, because in this example, due to the minimum spectral interval K _G Two adjacent spectral blocks are not allowed to have a mandatory gap at the same time.

Fig. 31 depicts an audio environment, which in this example is a living space. As with the other figures provided herein, the types, amounts, and arrangements of elements shown in fig. 31 are provided by way of example only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements. In other examples, the audio environment may be other types of environments, such as an office environment, a vehicle environment, a park or other outdoor environment, and so forth. In this example, the elements of fig. 31 include the following:

-3101: a person, which may also be referred to as a "user" or "listener";

-3102: a smart speaker comprising one or more loudspeakers and one or more microphones;

-3103: a smart speaker comprising one or more loudspeakers and one or more microphones;

-3104: a smart speaker comprising one or more loudspeakers and one or more microphones;

-3105: a smart speaker comprising one or more loudspeakers and one or more microphones;

-3106: the sound source, which may be a noise source, is located in the same room of the audio environment where the person 3101 and the smart speakers 3102-3106 are located, and the location is known. In some examples, the sound source 3106 may be a conventional device, such as a radio, that is not part of the audio system that includes the smart speakers 3102-3106. In some cases, the volume of sound source 3106 may not be continuously adjustable by person 3101 and may not be adjustable by the orchestration device. For example, the volume of sound source 3106 may be adjusted by a manual process only, e.g., via an on/off switch or by selecting a power or speed level (e.g., a power or speed level of a fan or air conditioner); and

-3107: a source of sound, which may be a source of noise, is not located in the same room of the audio environment in which the person 3101 and the smart speakers 3102-3106 are located. In some examples, the sound source 3107 may not have a known location. In some cases, the sound source 3107 may be interspersed.

The following discussion refers to several basic assumptions. For example, assume that an estimate of the location of an audio device (such as smart devices 102-105 of fig. 31) and an estimate of the listener's location (such as the location of person 101) are available. In addition, it is assumed that a measure of mutual audibility between audio devices is known. In some examples, such a measure of mutual audibility may be in the form of reception levels in multiple frequency bands. Some examples are described below. In other examples, the measure of mutual audibility may be a wideband measure, such as a measure that includes only one frequency band.

The reader may question whether the microphones in the consumer device provide a uniform response because a mismatched microphone gain may increase a layer of ambiguity. However, most smart speakers contain microelectromechanical systems (MEMS) microphones that match very well (worst ± 3dB, but typically within ± 1 dB) and have a limited set of acoustic overload points, such that the absolute mapping from digital dBFS (decibels relative to full scale) to dBSPL (decibel of sound pressure level) can be determined by the model and/or device descriptor. Thus, it can be assumed that the MEMS microphones provide a well calibrated acoustic reference for mutual audibility measurements.

Fig. 32, 33 and 34 are block diagrams representing three types of disclosed embodiments. Fig. 32 illustrates an implementation that relates to estimating audibility (in dBSPL in this example) of a user location (e.g., the location of person 3101 of fig. 31) of all audio devices in an audio environment (e.g., the locations of smart speakers 3102-3105) based on mutual audibility between the audio devices, their physical locations, and the location of the user. Such an embodiment does not require the use of a reference microphone at the user site. In some such examples, audibility may be normalized by the digital level of the loudspeaker drive signal (in dBFS in this example) to produce a transfer function between each audio device and the user. According to some examples, the implementation represented in fig. 32 is essentially a sparse interpolation problem: given the measured banding levels between a set of audio devices at a known location, a model is applied to estimate the level received at the listener's location.

In the example shown in fig. 32, the full matrix spatial audibility interpolator is shown as receiving device geometry information (audio device location information), a mutual audibility matrix (an example of which is described below), and user location information, and outputting an interpolation transfer function. In this example, the interpolation transfer function is from dBFS to dBSPL, which may be useful for leveling and equalizing audio devices (such as smart devices). In some examples, there may be some empty rows or columns in the audibility matrix corresponding to only input or only output devices. Details of the implementation corresponding to the example of fig. 32 are set forth in the following discussion of the "full matrix mutual audibility implementation".

Fig. 33 illustrates an embodiment that involves estimating the audibility of an uncontrolled point source (in dBSPL units in this example) at a user location based on the audibility of the uncontrolled point source (such as sound source 3106 of fig. 31) at the audio device, the physical location of the audio device, the location of the uncontrolled point source, and the location of the user. In some examples, the uncontrolled point source may be a noise source located in the same room as the audio device and the person. In the example shown in fig. 33, the point source spatial audibility interpolator is shown as receiving device geometry information (audio device location information), audibility matrix (an example of which is described below) and sound source location information, and outputting the interpolated audibility information.

Fig. 34 illustrates an embodiment involving estimating audibility (in dBSPL in this example) of scattered and/or unsettled and uncontrolled sources (such as sound source 3107 of fig. 31) at a user location based on audibility of each audio device's sound source, the audio device's physical location, and the user's location. In this embodiment, it is assumed that the position of the sound source is unknown. In the example shown in fig. 34, it is shown that a naive spatial audibility interpolator receives device geometry information (audio device location information) and audibility matrix (an example of which is described below) and outputs the interpolated audibility information. In some examples, the interpolated audibility information referenced in fig. 3B and 3C may indicate the interpolated audibility in dBSPL, which may be useful for estimating the level received from a sound source (e.g., from a noise source). By interpolating the reception level of the noise source, noise compensation (e.g., a process of increasing the gain of content in a band in which noise is present) can be applied more accurately than is achieved with reference to noise detected by a single microphone.

Full matrix mutual audibility implementation

Table 5 indicates the meaning represented by each item of the equation in the following discussion.

TABLE 5

Let L be the total number of audio devices, each containing M _i Microphone, and let K be the total number of spectral bands reported by the audio device. According to thisBy way of example, a mutual audibility matrix H.epsilon.is determinedIncluding the measured transfer function between all devices in all bands expressed in linear units.

There are several examples for determining H. However, the disclosed embodiments are agnostic to the method used to determine H.

Some examples of determining H may involve multiple iterations of a "single" calibration played back in turn by each audio device, with a controlled acoustic calibration signal such as a sinusoidal sweep, noise (e.g., white noise or pink noise), an acoustic DSSS signal, or planned program material. In some such examples, determining H may involve sequential processing of having a single smart audio device emit sound while other smart audio devices "listen" to the sound.

For example, referring to fig. 31, one such process may involve: (a) Causing the audio device 3102 to emit sound and receiving microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 3103-3105; then (b) causing the audio device 3103 to emit sound and receiving microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 3102, 3104, and 3105; then (c) causing the audio device 3104 to emit sound and receiving microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 3102, 3103, and 3105; then (d) causes the audio device 3105 to emit sound and receives microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 3102, 3103, and 3104. The sounds emitted may be the same or may be different, depending on the particular implementation.

Some pervasive and/or persistent methods described in detail herein that relate to acoustic calibration signals involve simultaneous playback of acoustic calibration signals by multiple audio devices in an audio environment. In some such examples, the acoustic calibration signal is mixed into the audio content for playback. According to some embodiments, the acoustic calibration signal is subsonic. Some such examples also relate to spectral puncturing (also referred to herein as forming a "gap").

According to some implementations, an audio device including multiple microphones may estimate multiple audibility matrices (e.g., one for each microphone) that are averaged to produce a single audibility matrix for each device. In some examples, abnormal data that may be caused by microphone failure may be detected and removed.

As described above, it is also assumed that the spatial position x of the audio device in 2D or 3D coordinates _i Can be used. Some examples for determining a device location based on time of arrival (TOA), direction of arrival (DOA), and a combination of the DOA and TOA are described below. In other examples, the spatial location x of the audio device _i Can be determined by manual measurement, for example using a tape measure.

In addition, assume also the location x of the user _u Is known and in some cases the location and orientation of the user may also be known. Some methods for determining listener location and listener orientation are described in detail below. According to some examples, device location x= [ X ₁ x ₂ …x _L ] ^T May have been translated such that x _u Located at the origin of the coordinate system.

According to some embodiments, the objective is to estimate the interpolation co-audibility matrix B by applying a suitable interpolation to the measurement data. In one example, a fading law model of the form:

in this example, x _i Representing the location of the transmission device x _A Indicating the location of the receiving device,represents the unknown linear output gain in band k, and +.>Representing the distance fading constant. Least squares solution

/>

Generating estimated parameters for an ith transmitting deviceThus, the audibility estimated in linear units at the user site can be expressed as follows:

in some embodiments of the present invention, in some embodiments,can be constrained to the global room parameter +.>And in some examples may additionally be constrained to lie within a particular range of values.

Fig. 35 shows an example of a heat map. In this example, heat map 3500 represents an estimated transfer function from sound source (o) to one frequency band in the room with any point in the x and y dimensions indicated in fig. 35. The estimated transfer function is based on interpolation of sound source measurements of 4 receivers (x). For any user location x in a room _u The interpolation level is depicted by a heat map 3500.

In another example, the distance fading model may include critical distance parameters such that the interpolation takes the form:

in this example of the present invention, in this case,represents the critical distance, which in some examples may be taken as the global room parameter d _c To solve for and/or can be constrained to lie within a fixed range of values.

Fig. 36 is a block diagram showing another embodiment example. As with the other figures provided herein, the types, amounts, and arrangements of elements shown in fig. 36 are provided by way of example only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements. In this example, the full matrix spatial audibility interpolator 3605, the delay compensation block 3610, the equalization and gain compensation block 3615, and the flexible renderer block 3620 are implemented by an example of the control system 160 of the apparatus 150 described above with reference to FIG. 1B. In some implementations, the apparatus 150 may be an orchestration device for an audio environment. According to some examples, apparatus 150 may be one of the audio devices of the audio environment. In some cases, the full matrix spatial audibility interpolator 3605, the delay compensation block 3610, the equalization and gain compensation block 3615, and the flexible renderer block 3620 may be implemented via instructions (e.g., software) stored on one or more non-transitory media.

In some examples, the full matrix spatial audibility interpolator 3605 may be configured to calculate an estimated audibility at a listener site as described above. According to this example, the equalization and gain compensation block 3615 is configured to base the interpolated audibility received from the full matrix spatial audibility interpolator 36053607 to determine an equalization and compensation gain matrix 3617 (such as +.f. in Table 5>). In some cases, equalization and compensation gain matrix 3617 may be determined using normalization techniques. For example, the estimated level at the user site may be smoothed across the frequency band, and Equalization (EQ) gains may be calculated such that the results match the target curve. In some embodiments, the target curve may be spectrally flat. In other examples, the target curve may be ramped down to high frequencies to avoid overcompensation. In some cases, the EQ band may then beMapped to a different set of frequency bands corresponding to the capabilities of a particular parameter equalizer. In some examples, the different sets of frequency bands may be 77 CQMF bands as mentioned elsewhere herein. In other examples, different sets of frequency bands may include different numbers of frequency bands, e.g., 20 critical bands or as few as two frequency bands (high and low). Some flexible renderer implementations may use 20 critical bands.

In this example, the process of applying the compensation gain and the EQ are separated such that the compensation gain provides a coarse overall level match and the EQ provides finer control in multiple bands. According to some alternative embodiments, the compensation gain and EQ may be implemented as a single process.

In this example, the flexible renderer box 3620 is configured to render audio data of the program content 3630 according to corresponding spatial information (e.g., location metadata) of the program content 3630. The flexible renderer box 3620 may be configured to implement a combination of CMAP, FV, CMAP and FV, or another type of flexible rendering, depending on the particular implementation. According to this example, the flexible renderer block 3620 is configured to use an equalization and compensation gain matrix 3617 to ensure that each loudspeaker is heard by the user at the same equalization level. The loudspeaker signals 3625 output by the flexible renderer block 3620 may be provided to an audio device of an audio system.

According to such an embodiment, the delay compensation block 3610 is configured to determine delay compensation information 3612 (which in some examples may be or include as shown in table 1 Is used for the delay compensation vector). Delay compensation information 3612 is based on the time required for sound to travel the distance between the user's location and the location of each loudspeaker. According to this example, the flexible renderer box 3620 is configured to apply delay compensation information 3612 to ensure that the time of arrival of the corresponding sound played back from all loudspeakers to the user is constant.

Fig. 37 is a flowchart outlining one example of another method that may be performed by an apparatus or system such as that disclosed herein. As with other methods described herein, the blocks of method 3700 are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described. The blocks of method 3700 may be performed by one or more devices, which may be (or may include) a control system, such as control system 160 shown in fig. 1B and described above, or one of the other disclosed examples of control systems. According to some examples, the blocks of method 3700 may be implemented by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media.

In such an embodiment, block 3705 relates to causing, by a control system, a plurality of audio devices in an audio environment to reproduce audio data. In this example, each of the plurality of audio devices includes at least one loudspeaker and at least one microphone. However, in some such examples, the audio environment may include at least one output-only audio device having at least one loudspeaker but no microphone. Alternatively or additionally, in some such examples, the audio environment may include one or more input-only audio devices having at least one microphone but no loudspeaker. Some examples of methods 3700 in this context are described below.

According to this example, block 3710 relates to determining, by a control system, audio device location data including an audio device location of each of a plurality of audio devices. In some examples, block 3710 may involve determining the audio device location data by reference to previously obtained audio device location data stored in memory (e.g., in memory system 165 of fig. 1B). In some cases, block 3710 may involve determining audio device location data via an audio device automatic positioning process. The audio device auto-positioning process may involve performing one or more audio device auto-positioning methods, such as the DOA-based and/or TOA-based audio device auto-positioning methods referenced elsewhere herein.

According to such an embodiment, block 3715 relates to obtaining, by the control system, microphone data from each of the plurality of audio devices. In this example, the microphone data corresponds at least in part to sound reproduced by loudspeakers of other audio devices in the audio environment.

In some examples, causing the plurality of audio devices to reproduce the audio data may involve causing each of the plurality of audio devices to play back the audio when all other audio devices in the audio environment are not playing back the audio. For example, referring to fig. 31, one such process may involve: (a) Causing the audio device 3102 to emit sound and receiving microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 3103-3105; then (b) causing the audio device 3103 to emit sound and receiving microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 3102, 3104, and 3105; then (c) causing the audio device 3104 to emit sound and receiving microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 3102, 3103, and 3105; then (d) causes the audio device 3105 to emit sound and receives microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 3102, 3103, and 3104. The sounds emitted may be the same or different, depending on the particular implementation.

Other examples of block 3715 may involve obtaining microphone data while each audio device is playing back content. Some such examples may involve spectral puncturing (also referred to herein as forming a "gap"). Thus, some such examples may involve having each of a plurality of audio devices insert one or more frequency range gaps into audio data reproduced by one or more loudspeakers of each audio device by a control system.

In this example, block 3720 relates to determining, by the control system, a mutual audibility of each of the plurality of audio devices relative to each other of the plurality of audio devices. In some implementations, block 3720 may involve determining a mutual audibility matrix, e.g., as described above. In some examples, determining the mutual audibility matrix may involve a process of mapping the full scale decibels to sound pressure level decibels. In some implementations, the mutual audibility matrix may include a measured transfer function between each of the plurality of audio devices. In some examples, the mutual audibility matrix may include a value for each of a plurality of frequency bands.

According to such an embodiment, block 3725 relates to determining, by the control system, a user location of a person in an audio environment. In some examples, determining the user location may be based at least in part on at least one of arrival direction data or arrival time data corresponding to one or more utterances of the person. Some detailed examples of determining a user location of a person in an audio environment are described below.

In this example, block 3730 relates to determining, by the control system, user location audibility of each of the plurality of audio devices at the user location. According to such an implementation, block 3735 relates to controlling one or more aspects of playback of the audio device based at least in part on the user location audibility. In some examples, one or more aspects of audio device playback may include leveling and/or equalization such as described above with reference to fig. 36.

According to some examples, block 3720 (or another block of method 3700) may involve determining a mutual audibility matrix of the interpolation by applying the interpolation to the measured audibility data. In some examples, determining the interpolated mutual audibility matrix may involve applying a fading law model based in part on a distance fading constant. In some examples, the distance fading constant may comprise a per-device parameter and/or an audio environment parameter. In some cases, the fading law model may be band-based. According to some examples, the fading law model may comprise a critical distance parameter.

In some examples, method 3700 may involve estimating an output gain of each of the plurality of audio devices from values of a mutual audibility matrix and a fading law model. In some cases, estimating the output gain of each audio device may involve determining a least squares solution to a function of the values of the mutual audibility matrix and the fading law model. In some examples, method 3700 may involve determining values of the interpolated mutual audibility matrix as a function of an output gain of each audio device, a user location, and a location of each audio device. In some examples, the values of the interpolated mutual audibility matrix may correspond to user location audibility of each audio device.

According to some examples, method 3700 may involve equalizing the band values of the interpolated mutual audibility matrix. In some examples, method 3700 may involve applying a delay compensation vector to the interpolated mutual audibility matrix.

As described above, in some implementations, the audio environment may include at least one output-only audio device having at least one loudspeaker but no microphone. In some such examples, method 3700 may involve determining audibility of at least one output-only audio device at an audio device location of each of a plurality of audio devices.

As described above, in some implementations, the audio environment may include one or more input-only audio devices having at least one microphone but no loudspeaker. In some such examples, method 3700 may involve determining audibility of each loudspeaker-equipped audio device in an audio environment at a location of each of one or more input-only audio devices.

Point noise source case implementation

This section discloses an embodiment corresponding to fig. 33. As used in this section, "point noise source" refers to location x _n A noise source that is available but the source signal is not available, an example of which is when the sound source 3106 of fig. 31 is a noise source. Instead of (or in addition to) determining a mutual audibility matrix corresponding to the mutual audibility of each of a plurality of audio devices in an audio environment, embodiments of "point-to-noise source conditions" relate to determining the audibility of such point sources at each of a plurality of audio device sites. Some such examples involve determining a noise audibility matrixThe noise audibility matrix is measured at a plurality of tonesThe reception level of such point sources at each of the frequency device sites, rather than the transfer function as in the full matrix spatial audibility example described above.

In some embodiments, the estimation of a may be made in real-time, e.g., during the time of playback of audio in an audio environment. According to some embodiments, the estimation of a may be part of a process of compensating for noise of a point source (or other sound source of a known location).

Fig. 38 is a block diagram showing an example of a system according to another embodiment. As with the other figures provided herein, the types, amounts, and arrangements of elements shown in fig. 38 are provided by way of example only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements. According to this example, the control systems 160A-160L correspond to the audio devices 3801A-3801L (where L is two or more) and are examples of the control system 160 of the apparatus 150 described above with reference to FIG. 1B. Here, the control systems 160A-160L are implementing the multi-channel acoustic echo cancellers 3805A-3805L.

In this example, the point source spatial audibility interpolator 3810 and the noise compensation block 3815 are implemented by a control system 160M of an apparatus 3820, the apparatus 3820 being another example of the apparatus 150 described above with reference to fig. 1B. In some examples, the apparatus 3820 may be a orchestration device or a smart home hub, as referred to herein. However, in alternative examples, apparatus 3820 may be an audio device. In some cases, the functionality of the apparatus 3820 may be implemented by one of the audio devices 3801A-3801L. In some cases, the multi-channel acoustic echo cancellers 3805A-3805L, the point source spatial audibility interpolator 3810, and/or the noise compensation block 3815 may be implemented via instructions (e.g., software) stored on one or more non-transitory media.

In this example, a sound source 3835 produces sound 3830 in an audio environment. According to this example, sound 3830 will be considered noise. In this case, the sound source 3825 does not operate under the control of any of the control systems 160A-160M. In this example, the location of the sound source 3825 is known to the control system 160M (in other words, provided to the control system 160M and/or stored in a memory accessible to the control system 160M).

According to this example, the multi-channel acoustic echo canceller 3805A receives microphone signals 3802A from one or more microphones of the audio device 3801A and a local echo reference 3803A corresponding to audio being played back by the audio device 3801A. Here, the multi-channel acoustic echo canceller 3805A is configured to generate a residual microphone signal 3807A (which may also be referred to as an echo cancelled microphone signal) and provide the residual microphone signal 3807A to the device 3820. In this example, it is assumed that the residual microphone signal 3807A corresponds primarily to the sound 3830 received at the location of the audio device 3801A.

Similarly, the multi-channel acoustic echo canceller 3805L receives microphone signals 3802L from one or more microphones of the audio device 3801L and a local echo reference 3803L corresponding to audio played back by the audio device 3801L. The multi-channel acoustic echo canceller 3805L is configured to output a residual microphone signal 3807L to the apparatus 3820. In this example, it is assumed that the residual microphone signal 3807L corresponds primarily to the sound 3830 received at the location of the audio device 3801L. In some examples, the multi-channel acoustic echo cancellers 3805A-3805L may be configured for echo cancellation in each of K frequency bands.

In this example, the point source spatial audibility interpolator 3810 receives the residual microphone signals 3807A-3807L, as well as the audio device geometry (location data for each audio device 3801A-3801L) and source location data. According to this example, the point source spatial audibility interpolator 3810 is configured to determine noise audibility information indicative of a reception level of the sound 3830 at each of the locations of the audio devices 3801A-3801L. In some examples, the noise audibility information may include noise audibility data for each of the K frequency bands and may be the noise audibility matrix mentioned above in some cases

In some embodiments, point source spatial audibility interpolator 3810 (or another of control systems 160MOne block) may be configured to estimate noise audibility information 3812 indicative of a level of sound 3830 at a user location in an audio environment based on the user location data and a received level of sound 3830 at each location of audio devices 3801A-3801L. In some cases, estimating noise audibility information 3812 may involve interpolation processes such as those described above, for example, by applying a distance attenuation model to estimate noise level vectors at a user location

According to this example, the noise compensation block 3815 is configured to determine a noise compensation gain 3817 based on the estimated noise level 3812 at the user location. In this example, noise compensation gain 3817 is a multi-band noise compensation gain (e.g., the noise compensation gain mentioned above ) Which may be different according to frequency bands. For example, the noise compensation gain may be higher in a frequency band corresponding to a higher estimated level of sound 3830 at the user site. In some examples, the noise compensation gain 3817 is provided to the audio devices 3801A-3801L such that the audio devices 3801A-3801L may control playback of audio data in accordance with the noise compensation gain 3817. As shown by the dashed lines 3817A and 3817L, in some cases, the noise compensation block 3815 may be configured to determine a noise compensation gain specific to each of the audio devices 3801A-3801L.

FIG. 39 is a flowchart outlining one example of another method that may be performed by an apparatus or system such as that disclosed herein. As with other methods described herein, the blocks of method 3900 are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described. The blocks of method 3900 may be performed by one or more devices, which may be (or may include) a control system such as that shown in fig. 1B and described above, or one of the other disclosed examples of control systems. According to some examples, the blocks of method 3900 may be implemented by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media.

In such an embodiment, block 3905 relates to receiving, by a control system, a residual microphone signal from each of a plurality of microphones in an audio environment. In this example, the residual microphone signal corresponds to sound received at each of the plurality of audio device sites from a noise source. In the example described above with reference to fig. 38, block 3905 relates to the control system 160M receiving residual microphone signals 3807A-3807L from the multi-channel acoustic echo cancellers 3805A-3805L. However, in some alternative implementations, one or more of blocks 3905-3925 (and in some cases all of blocks 3905-3925) may be performed by another control system, such as one of the audio device control systems.

According to this example, block 3910 relates to obtaining, by a control system, audio device location data corresponding to each of a plurality of audio device locations, noise source location data corresponding to locations of noise sources, and user location data corresponding to locations of people in an audio environment. In some examples, block 3910 may involve determining audio device location data, noise source location data, and/or user location data by reference to previously obtained audio device location data stored in memory (e.g., in memory system 115 of fig. 1). In some cases, block 3910 may involve determining audio device location data, noise source location data, and/or user location data via an automatic positioning process. The automatic positioning process may involve performing one or more automatic positioning methods, such as the automatic positioning methods referenced elsewhere herein.

According to such an embodiment, block 3915 relates to estimating a noise level of sound from a noise source at a user location based on a residual microphone signal, audio device location data, noise source location data, and user location data. In the example described above with reference to fig. 38, block 3915 may relate to a point source spatial audibility interpolator 3810 (orAnother block of the control system 160M) estimates a noise level 3832 of the sound 3830 at the user location in the audio environment based on the user location data and the level of the sound 3830 received at each location of the audio devices 3801A-3801L. In some cases, block 3915 may involve interpolation processes such as those described above, for example, by applying a distance fading model to estimate a noise level vector at a user location

In this example, block 3920 relates to determining a noise compensation gain for each audio device based on an estimated noise level of sound from a noise source at a user location. In the example described above with reference to fig. 38, block 3920 may involve noise compensation block 3815 determining a noise compensation gain 3817 based on an estimated noise level 3812 at a user location. In some examples, the noise compensation gain may be a multi-band noise compensation gain (e.g., the noise compensation gain mentioned above ) Which may be different depending on the frequency band.

According to such an embodiment, block 3925 relates to providing noise compensation gain to each audio device. In the example described above with reference to fig. 38, block 3925 may involve apparatus 3820 providing noise compensation gains 3817A-3817L to each of the audio devices 3801A-3801L.

Walk-through or non-localized noise source implementation

Localization of sound sources such as noise sources may not always be possible, especially when the sound sources are not in the same room or the sound source is highly occluded by the microphone array(s) of the detected sound. In this case, estimating the noise level at the user site may be regarded as a sparse interpolation problem with several known noise level values (e.g., one at each microphone or microphone array of each of a plurality of audio devices in an audio environment).

Such an insertThe values can be expressed as general functionsIt means that a known point in the 2D space (defined by +.>Term representation) to an interpolated scalar value (represented by +.>Representation). One example involves selecting a subset of three nodes (corresponding to microphones or microphone arrays of three audio devices in an audio environment) to form a triangle of nodes and solving for audibility within the triangle by bivariate linear interpolation. For any given node i, the reception level in the kth band can be expressed as +. > The unknown number is solved for,

the interpolated audibility of any point within a triangle becomes

Other examples may involve barycentric interpolation or trigonometric interpolation, for example, as described in Amidror, isaac, "Scattered data interpolation methods for electronic imaging systems: a survey," in Journal of Electronic Imaging, vol.11, no. 2, month 4 2002, pages 157-176, which is incorporated herein by reference. Such an interpolation method is suitable for the noise compensation method described above with reference to fig. 38 and 39, for example, by replacing the point-source spatial audibility interpolator 3810 of fig. 38 with a naive spatial interpolator implemented according to any of the interpolation methods described in this section and by omitting the processing of obtaining noise source location data in block 3910 of fig. 39. The interpolation method described in this section does not produce spherical distance fading, but does provide reasonable horizontal interpolation within the listening area.

Fig. 40 shows a plan view example of another audio environment, which in this example is living space. As with the other figures provided herein, the types and amounts of elements shown in fig. 40 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements.

According to this example, the environment 4000 includes a living room 4010 at the upper left, a kitchen 4015 at the center below, and a bedroom 4022 at the lower right. The boxes and circles distributed throughout the living space represent a set of loudspeakers 4005a-4005h, at least some of which may be intelligent loudspeakers in some embodiments, placed at spatially convenient locations, but not in compliance with any standard prescribed layout (arbitrary placement). In some examples, television 4030 may be configured to at least partially implement one or more disclosed embodiments. In this example, environment 4000 includes cameras 4011a-4011e distributed throughout the environment. In some implementations, one or more intelligent audio devices in environment 4000 may also include one or more cameras. The one or more intelligent audio devices may be single-use audio devices or virtual assistants. In some such examples, one or more cameras of optional sensor system 180 (see fig. 1B) may reside in television 4030 or on television 4030, in a mobile phone, or in a smart speaker (such as one or more of microphones 4005B, 4005d, 4005e, or 4005 h). Although cameras 4011a-4011e are not shown in each of the descriptions of environment 4000 presented in this disclosure, in some implementations each of environment 4000 may still include one or more cameras.

Automatic positioning of audio devices

The present assignee has developed several speaker positioning techniques for movie theatres and households that are excellent solutions in the use cases they were designed for. Some such methods are based on a time of flight derived from an impulse response between the sound source and the microphone(s) that are substantially co-located with each loudspeaker. Although system delays in the recording and playback chain can also be estimated, sample synchronization between clocks is required and known test stimuli are required to estimate the impulse response.

In this case, the most recent source localization example relaxes the constraints by requiring intra-device microphone synchronization rather than inter-device synchronization. Furthermore, some such methods no longer require the transfer of audio between sensors through low bandwidth messaging, such as via detecting the time of arrival (TOA, also known as "time of flight") of the direct (non-reflected) sound or via detecting the dominant direction of arrival (DOA) of the direct sound. Each method has some potential advantages and potential disadvantages. For example, some previously deployed TOA methods may determine device geometry until unknown translation, rotation, and reflection are performed about one of the three axes. If there is only one microphone per device, the rotation of the individual devices is also unknown. Some previously deployed DOA methods can determine the geometry of the device until translation, rotation, and scaling are unknown. While some such methods may produce satisfactory results under ideal conditions, the robustness of such methods to measurement errors has not been demonstrated.

Some embodiments disclosed in the present application allow for the localization of a set of intelligent audio devices based on 1) DOA between each pair of audio devices in an audio environment and 2) minimization of nonlinear optimization problems designed for input data type 1). Other embodiments disclosed in this application allow for the localization of a set of intelligent audio devices based on 1) DOA between each pair of audio devices in the system, 2) TOA between each pair of devices, and 3) minimization of nonlinear optimization problems designed for input data types 1) and 2).

Fig. 41 shows an example of a geometric relationship between four audio devices in an environment. In this example, audio environment 4100 is a room that includes a television 4101 and audio devices 4105a, 4105b, 4105c, and 4105 d. According to this example, audio devices 4105a-4105d are located at sites 1 through 4, respectively, of audio environment 4100. As with other examples disclosed herein, the types, numbers, locations, and orientations of the elements shown in fig. 41 are by way of example only. Other embodiments may have different types, numbers, and arrangements of elements, e.g., more or fewer audio devices, audio devices in different locations, audio devices with different capabilities, etc.

In such an implementation, each of the audio devices 4105a-4105d is a smart speaker comprising a microphone system and a speaker system, wherein the speaker system comprises at least one speaker. In some embodiments, each microphone system includes an array of at least three microphones. According to some implementations, television 4101 may include a speaker system and/or a microphone system. In some such embodiments, an automatic positioning method may be used to automatically position the television 4101 or a portion of the television 4101 (e.g., a television loudspeaker, television transceiver, etc.), for example, as described below with reference to audio devices 4105a-4105d.

Some embodiments described in this disclosure allow for the automatic positioning of a group of audio devices, such as audio devices 4105a-4105d shown in fig. 41, based on the direction of arrival (DOA) between each pair of audio devices, the time of arrival (TOA) of the audio signals between each pair of devices, or the DOA and TOA of the audio signals between each pair of devices. In some cases, as in the example shown in fig. 41, each audio device is enabled with at least one drive unit and one microphone array that is capable of providing a direction of arrival of incoming sound. According to this example, a double-headed arrow 4110ab represents sound transmitted by the audio device 4105a and received by the audio device 4105b, and sound transmitted by the audio device 4105b and received by the audio device 4105 a. Similarly, two-way arrows 4110ac, 4110ad, 4110bc, 4110bd, and 4110cd represent sound transmitted and received by the audio device 4105a and the audio device 4105c, sound transmitted and received by the audio device 4105a and the audio device 4105d, sound transmitted and received by the audio device 4105b and the audio device 4105c, sound transmitted and received by the audio device 4105b and the audio device 4105d, and sound transmitted and received by the audio device 4105c and the audio device 4105d, respectively.

In this example, each of the audio devices 4105a-4105d has an orientation, represented by arrows 4115a-4115d, that can be defined in various ways. For example, the orientation of an audio device having a single loudspeaker may correspond to the direction in which the single loudspeaker faces. In some examples, the orientation of an audio device having a plurality of loudspeakers facing in different directions may be indicated by the direction in which one of the loudspeakers faces. In other examples, the orientation of an audio device having a plurality of loudspeakers facing in different directions may be indicated by the direction of a vector corresponding to the sum of the audio outputs in the different directions that each of the plurality of loudspeakers faces. In the example shown in FIG. 41, the orientation of arrows 4115a-4115d is defined with reference to a Cartesian coordinate system. In other examples, the orientation of arrows 4115a-4115d may be defined with reference to another type of coordinate system, such as a spherical or cylindrical coordinate system.

In this example, the television 4101 includes an electromagnetic interface 4103 configured to receive electromagnetic waves. In some examples, electromagnetic interface 4103 may be configured to transmit and receive electromagnetic waves. According to some implementations, at least two of the audio devices 4105a-4105d may include an antenna system configured as a transceiver. The antenna system may be configured to transmit and receive electromagnetic waves. In some examples, the antenna system includes an antenna array having at least three antennas. Some embodiments described in this disclosure allow for automatically locating a group of devices, such as audio devices 4105a-4105d and/or television 4101 shown in fig. 41, based at least in part on the DOAs of electromagnetic waves transmitted between the devices. Thus, the bi-directional arrows 4110ab, 4110ac, 4110ad, 4110bc, 4110bd, and 4110cd may also represent electromagnetic waves transmitted between the audio devices 4105a-4105 d.

According to some examples, an antenna system of a device (such as an audio device) may be co-located with a loudspeaker of the device, e.g., adjacent to the loudspeaker. In some such examples, the antenna system orientation may correspond to a loudspeaker orientation. Alternatively or additionally, the antenna system of the device may have a known or predetermined orientation with respect to one or more loudspeakers of the device.

In this example, the audio devices 4105a-4105d are configured to wirelessly communicate with each other and with other devices. In some examples, the audio devices 4105a-4105d can include network interfaces configured for communication between the audio devices 4105a-4105d and other devices via the internet. In some implementations, the automatic positioning process disclosed herein may be performed by a control system of one of the audio devices 4105a-4105 d. In other examples, the auto-positioning process may be performed by another device of the audio environment 4100, such as a device sometimes referred to as a smart home hub, that is configured to wirelessly communicate with the audio devices 4105a-4105 d. In other examples, the automatic positioning process may be performed at least in part by a device (such as a server) external to the audio environment 4100 based on information received from one or more of the audio devices 4105a-4105d and/or the smart home hub.

Fig. 42 shows an audio transmitter located in the audio environment of fig. 41. Some embodiments provide for automatic positioning of one or more audio emitters, such as person 4205 of fig. 42. In this example, person 4205 is located at location 5. Here, the sound emitted by the person 4205 and received by the audio device 4105a is represented by a one-way arrow 4210 a. Similarly, sounds emitted by person 4205 and received by audio devices 4105b, 4105c, and 4105d are represented by unidirectional arrows 4210b, 4210c, and 4210 d. The audio emitter may be located based on the DOA of the audio emitter sounds captured by the audio devices 4105a-4105d and/or the television 4101, based on the TOA differences of the audio emitter sounds measured by the audio devices 4105a-4105d and/or the television 4101, or based on both the DOA and the TOA differences.

Alternatively or additionally, some embodiments may provide for automatic positioning of one or more electromagnetic wave emitters. Some embodiments described in this disclosure allow one or more electromagnetic wave emitters to be automatically positioned based at least in part on the DOA of electromagnetic waves transmitted by the one or more electromagnetic wave emitters. If the electromagnetic wave emitter is at the site 5, electromagnetic waves emitted by the electromagnetic wave emitter and received by the audio devices 4105a, 4105b, 4105c and 4105d may also be represented by unidirectional arrows 4210a, 4210b, 4210c and 4210 c.

Fig. 43 shows an audio receiver located in the audio environment of fig. 41. In this example, the microphone of the smart phone 4305 is enabled, but the speaker of the smart phone 4305 is not currently emitting sound. Some embodiments provide for automatic positioning of one or more passive audio receivers (such as the smart phone 4305 of fig. 43) when the smart phone 4305 is not sounding. Here, the sound emitted by the audio device 4105a and received by the smart phone 4305 is represented by a one-way arrow 4310 a. Similarly, sounds emitted by the audio devices 4105b, 4105c, and 4105d and received by the smartphone 4305 are represented by unidirectional arrows 4310b, 4310c, and 4310 d.

If the audio receiver is equipped with a microphone array and is configured to determine the DOA of the received sound, the audio receiver may be located based at least in part on the DOA of the sound emitted by the audio devices 4105a-4105d and captured by the audio receiver. In some examples, the audio receiver may be located based at least in part on the TOA differences of the intelligent audio device captured by the audio receiver, whether or not the audio receiver is equipped with a microphone array. Still other embodiments may allow for automatic positioning of a set of intelligent audio devices, one or more audio transmitters, and one or more receivers based solely on the DOA or DOA and TOA by combining the above methods.

Direction of arrival positioning

FIG. 44 is a flowchart outlining another example of a method that may be performed by a control system of an apparatus such as that shown in FIG. 1B. As with other methods described herein, the blocks of method 4400 are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described.

Method 4400 is an example of an audio device localization process. In this example, method 4400 involves determining the position and orientation of two or more intelligent audio devices, each intelligent audio device including a loudspeaker system and a microphone array. According to this example, method 4400 involves determining a location and orientation of the smart audio device based at least in part on audio emitted by each smart audio device and captured by each other smart audio device according to the DOA estimation. In this example, the initial block of method 4400 relies on the control system of each smart audio device being able to extract the DOA from the input audio obtained by the microphone array of that smart audio device, for example by using the time-of-arrival differences between the individual microphone capsules (microphone capsule) of the microphone array.

In this example, block 4405 relates to obtaining audio that is emitted by each intelligent audio device of the audio environment and captured by each other intelligent audio device of the audio environment. In some such examples, block 4405 may involve causing each smart audio device to emit a sound, which in some cases may be a sound having a predetermined duration, frequency content, or the like. This predetermined type of sound may be referred to herein as a structured source signal. In some implementations, the smart audio device may be or may include the audio devices 4105a-4105d of fig. 41.

In some such examples, block 4405 may involve sequential processing of causing a single smart audio device to emit sound while other smart audio devices "listen" to the sound. For example, referring to fig. 41, block 4405 may relate to: (a) Causing the audio device 4105a to emit sound and receiving microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 4105b-4105 d; then (b) causing the audio device 4105b to emit sound and receiving microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 4105a, 4105c, and 4105 d; then (c) causing the audio device 4105c to emit sound and receiving microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 4105a, 4105b, and 4105 d; then (d) causes the audio device 4105d to emit sound and receives microphone data corresponding to the emitted sound from the microphone arrays of the audio devices 4105a, 4105b, and 4105 c. The sounds emitted may be the same or different depending on the particular implementation.

In other examples, block 4405 may involve a simultaneous process of having all intelligent audio devices emit sound while other intelligent audio devices "listen" to the sound. For example, block 4405 may include performing the following steps simultaneously: (1) Causing the audio device 4105a to emit a first sound and receiving microphone data corresponding to the emitted first sound from the microphone array of the audio devices 4105b-4105 d; (2) Causing the audio device 4105b to emit a second sound different from the first sound and receiving microphone data corresponding to the emitted second sound from the microphone arrays of the audio devices 4105a, 4105c, and 4105 d; (3) Causing the audio device 4105c to emit a third sound different from the first sound and the second sound and receiving microphone data corresponding to the emitted third sound from the microphone arrays of the audio devices 4105a, 4105b, and 4105 d; (4) Causing the audio device 4105d to emit fourth sound different from the first sound, the second sound, and the third sound and receiving microphone data corresponding to the emitted fourth sound from the microphone arrays of the audio devices 4105a, 4105b, and 4105 c.

In some examples, block 4405 may be used to determine the mutual audibility of audio devices in an audio environment. Some detailed examples are disclosed herein.

In this example, block 4410 relates to a process of preprocessing an audio signal obtained via a microphone. For example, block 4410 may involve applying one or more filters, noise or echo suppression processing, or the like. Some additional examples of preprocessing are described below.

According to this example, block 4415 involves determining a DOA candidate from the preprocessed audio signal from block 4410. For example, if block 4405 relates to transmitting and receiving structured source signals, block 4415 may relate to one or more deconvolution methods for generating impulse responses and/or "pseudoranges" from which the arrival time differences of the main peaks may be used, in combination with known microphone array geometries of the intelligent audio device, for estimating DOA candidates.

However, not all embodiments of the method 4400 are directed to obtaining a microphone signal based on the emission of a predetermined sound. Thus, some examples of block 4415 include a "blind" approach to any audio signal, such as controlled response power, receiver-side beamforming, or other similar approach, from which one or more DOAs may be extracted by peak picking. Some examples are described below. It should be appreciated that although the DOA data may be determined via a blind approach or using a structured source signal, in most cases the TOA data may be determined using only a structured source signal. Furthermore, more accurate DOA information can generally be obtained using structured source signals.

According to this example, block 4420 involves selecting one DOA corresponding to the sound emitted by each of the other intelligent audio devices. In many cases, the microphone array may detect both direct arrival and reflected sound transmitted by the same audio device. Block 4420 may involve selecting an audio signal that most likely corresponds to the directly transmitted sound. Some additional examples of determining a DOA candidate and selecting a DOA from two or more DOA candidates are described below.

In this example, block 4425 involves receiving DOA information generated by the implementation of block 4420 of each intelligent audio device (in other words, receiving a set of DOAs corresponding to sound transmitted from each intelligent audio device to each other intelligent audio device in the audio environment) and performing a localization method (e.g., implementing a localization algorithm via a control system) based on the DOA information. In some disclosed embodiments, block 4425 relates to minimizing a cost function, possibly subject to some constraints and/or weights, e.g., as described below with reference to fig. 45. In some such examples, the cost function receives as input data the DOA value from each smart audio device to each other smart device and returns as output the estimated location and estimated orientation of each smart audio device. In the example shown in fig. 44, block 4430 represents the estimated smart audio device location and the estimated smart audio device orientation generated in block 4425.

FIG. 45 is a flowchart outlining another example of a method for automatically estimating device location and orientation based on DOA data. For example, method 4500 may be performed by implementing a positioning algorithm via a control system of an apparatus such as that shown in fig. 1B. As with other methods described herein, the blocks of method 4500 are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described.

According to this example, DOA data is obtained in block 4505. According to some embodiments, block 4505 may involve obtaining acoustic DOA data, e.g., as described above with reference to blocks 4405-4420 of FIG. 44. Alternatively or additionally, block 4505 may involve obtaining DOA data corresponding to electromagnetic waves transmitted and received by each of a plurality of devices in an environment.

In this example, the positioning algorithm receives as input the DOA data obtained in block 4505 from each smart device to each other smart device in the audio environment, as well as any configuration parameters 4510 specified for the audio environment. In some examples, optional constraint 4525 may be applied to DOA data. The configuration parameters 4510, minimization weights 4515, optional constraints 4525, and seed layouts 4530 may be obtained from memory, for example, by a control system executing software for implementing the cost function 4520 and the nonlinear search algorithm 4535. Configuration parameters 4510 may include, for example, data corresponding to maximum room dimensions, loudspeaker layout constraints, external inputs (e.g., 2 parameters) to set global panning, global rotation (1 parameter), global scale (1 parameter), and so forth.

According to this example, configuration parameters 4510 are provided to a cost function 4520 and a nonlinear search algorithm 4535. In some examples, configuration parameters 4510 are provided to optional constraints 4525. In this example, cost function 4520 accounts for differences between measured DOA and the DOA estimated by the positioning solution of the optimizer.

In some embodiments, optional constraints 4525 impose restrictions on possible audio device locations and/or orientations, such as conditions that impose a minimum distance of audio devices from each other. Alternatively or additionally, optional constraint 4525 may impose restrictions on virtual minimization variables introduced for convenience, e.g., as described below.

In this example, the minimization weight 4515 is also provided to the nonlinear search algorithm 4535. Some examples are described below.

According to some embodiments, the nonlinear search algorithm 4535 is an algorithm that can find a local solution to the continuous optimization problem of the form:

minC(x)

x∈C ⁿ

so that g _L ≤g(x)≤g _m

And x _L ≤x≤x _m

In the above expression, C (x): nn->n represents a cost function 4520, and g (x): nn->nm represents a constraint function 4525 corresponding to an optional constraint. In these examples, vector g _L And g _m Represents the lower and upper limits of the constraint, and vector x _L And x _m Representing the limits of the variable x.

The nonlinear search algorithm 4535 may vary according to particular embodiments. Examples of the nonlinear search algorithm 4535 include a gradient descent method, a Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, an Interior Point Optimization (IPOPT) method, and the like. While some nonlinear search algorithms only require values for the cost function and constraints, some other methods may also require first derivatives of the cost function and constraints (gradients, jacobians), while some other methods may also require second derivatives of the same function (Hessians). If derivatives are required, they may be provided explicitly, or they may be calculated automatically using automatic or numerical differentiation techniques.

Some nonlinear search algorithms require seed point information to initiate minimization as suggested by the seed layout 4530 provided to the nonlinear search algorithm 4535 in fig. 45. In some examples, the seed point information may be provided as a layout consisting of the same number of intelligent audio devices with corresponding locations and orientations (in other words, the same number as the actual number of intelligent audio devices that obtained the DOA data). The location and orientation may be arbitrary and need not be the actual or approximate location and orientation of the smart audio device. In some examples, the seed point information may indicate a smart audio device location along an axis or another arbitrary line of the audio environment, a smart audio device location along a circle, rectangle, or other geometric shape within the audio environment, and so forth. In some examples, the seed point information may indicate any smart audio device orientation, which may be a predetermined smart audio device orientation or a random smart audio device orientation.

In some embodiments, the cost function 4520 may be expressed in terms of complex plane variables as follows:

wherein the asterisks indicate complex conjugate, the bars indicate absolute value, and wherein:

·Z _nm ＝exp(i DOA _nm ) Representing complex plane values given the direction of arrival of the smart device m as measured from device n, where i represents an imaginary unit;

·x _n ＝x _nx +ix _ny representing complex plane values encoding x and y positions of the smart device n;

·z _n ＝exp(iα _n ) Representing the orientation angle alpha to the smart device n _n Complex values encoded;

·representation of DOA administration _nm A measured weight;

n represents the number of intelligent audio devices that obtain the DOA data; and

·x＝(x ₁ ,…,x _N ) And z= (z) ₁ ,…,z _N ) Vectors representing complex positions and complex orientations of all N intelligent audio devices, respectively.

According to this example, the result of the minimization is device location data 4540, x indicating the 2D location of the smart device _k (representing 2 true unknowns per device) and device orientation data 4545, z indicating an orientation vector of the smart device _k (representing 2 additional real variables per device). From the orientation vector, only the orientation angle α of the smart device _k Associated with the problem (1 true unknowns per device). Thus, in this example, each smart device has 3 relevant unknowns.

In some examples, the result evaluation block 4550 involves calculating a residual of the cost function at the result location and orientation. A relatively lower residual error indicates a relatively more accurate device location value. According to some embodiments, the result evaluation block 4550 may involve a feedback process. For example, some such examples may implement a feedback process that involves comparing the residual of a given DOA candidate combination with the residual of another DOA candidate combination, e.g., as explained in the DOA robustness measurements discussion below.

As described above, in some embodiments, block 4505 may involve obtaining acoustic DOA data as described above with reference to blocks 4405-4420 of fig. 44 that involve determining and selecting DOA candidates. Thus, FIG. 45 includes a dashed line from result evaluation block 4550 to block 4505 to represent one flow of the optional feedback process. Further, fig. 44 includes a dashed line from block 4430 (which may relate to outcome evaluation in some examples) to the DOA candidate selection block 4420, for representing the flow of another alternative feedback process.

In some embodiments, the nonlinear search algorithm 4535 may not accept complex valued variables. In this case, each complex-valued variable may be replaced by a pair of real variables.

In some embodiments, there may be additional a priori information about the availability or reliability of each DOA measurement. In some such examples, only a subset of all possible DOA elements may be used to locate the loudspeakers. For example, missing DOA elements may be masked with corresponding zero weights in the cost function. In some such examples, the weight w _nm May be 0 or 1, for example 0 for those measurements that are missing or considered to be unreliable and 1 for reliable measurements. In some other embodiments, the weight w _nm It is possible to have a continuous value from 0 to 1 as a function of the reliability of the DOA measurement. In those embodiments where no a priori information is available, the weights w _nm May simply be set to 1.

In some embodiments, the condition |z may be added _k |=1 (one condition per smart audio device) as a constraint to ensure that the smart audio device is indicatedNormalization of the vector of orientation. In other examples, these additional constraints may not be needed, and the vector indicating the orientation of the smart audio device may remain un-normalized. Other embodiments may add a condition regarding the proximity of the smart audio device as a constraint, e.g., indicating |x _n -x _m I.gtoreq.d, where D is the minimum distance between intelligent audio devices.

The minimization of the cost function described above does not fully determine the absolute position and orientation of the intelligent audio device. According to this example, the cost function remains unchanged under global rotation (1 independent parameter), global translation (2 independent parameters), and global scaling (1 independent parameter), while affecting the location and orientation of all smart devices. Such global rotation, translation and rescaling cannot be determined by minimization of the cost function. The different layouts related to the symmetric transformation are completely indistinguishable in this square frame and are said to belong to the same equivalence class. Thus, the configuration parameters should provide criteria to allow for uniquely defining the intelligent audio device layout that represents the entire equivalence class. In some embodiments, it may be advantageous to select the criteria such that the smart audio device layout defines a reference frame that is close to the reference frame of a listener in the vicinity of the reference listening position. Examples of such criteria are provided below. In some other examples, the criteria may be purely mathematical and deviate from a real reference frame.

Symmetry disambiguation criteria may include reference locations, fixed global panning symmetry (e.g., the intelligent audio device 1 should be at the origin of coordinates); with reference to orientation, two-dimensional rotational symmetry is fixed (e.g., the smart device 1 should be oriented towards an audio environment area designated as front, such as the location of the television 4101 in fig. 41-43); and a reference distance, a fixed global zoom symmetry (e.g., smart device 2 should remain a unit distance from smart device 1). A total of 4 parameters cannot be determined from the minimization problem in this example and should be provided as external inputs. Thus, in this example, there are 3N-4 unknowns that can be determined from the minimization problem.

As described above, in some examples, there may be one or more passive audio receivers and/or one or more audio transmitters equipped with a microphone array in addition to a set of intelligent audio devices. In this case, the localization process may use a technique to determine the location and orientation of the smart audio device, the emitter location, and the location and orientation of the passive receiver based on the DOA estimates from the audio emitted by each smart audio device and each emitter and captured by each other smart audio device and each passive receiver.

In some such examples, the positioning process may proceed in a similar manner as described above. In some cases, the positioning process may be based on the same cost function described above, which is shown below for the convenience of the reader:

/>

however, if the positioning process involves a passive audio receiver and/or an audio transmitter that is not an audio receiver, then the variables of the above equations need to be interpreted in a slightly different manner. Now N represents the total number of devices, including N _smart Personal intelligent audio device, N _rec Passive audio receiver and N _emit A transmitter such that n=n _smart +N _rec +N _emit . In some examples, the weightsMay have a sparse structure to mask data lost due to passive receivers or transmitter-only devices (or other audio sources without a receiver, such as humans), so that if device n is an audio transmitter without a receiver for all m +. >And for all n if device m is an audio receiver +.>For smart audio devices and passive receivers, a position and may be determinedAngle, whereas for an audio transmitter only position can be determined. The total number of unknowns is 3N _smart +3N _rec +2N _emit -4。

Positioning of combined time of arrival and direction of arrival

In the following discussion, the differences between the DOA-based positioning process described above and the combined DOA and TOA positioning of this section will be emphasized. Those details not explicitly given may be assumed to be the same as those in the DOA-based positioning process described above.

FIG. 46 is a flowchart outlining one example of a method for automatically estimating device location and orientation based on DOA data and TOA data. For example, method 4600 may be performed by implementing a positioning algorithm via a control system of an apparatus such as that shown in fig. 1B. As with other methods described herein, the blocks of method 4600 are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described.

According to this example, DOA data is obtained in blocks 4605-4620. According to some embodiments, blocks 4605-4620 may involve obtaining acoustic DOA data from a plurality of intelligent audio devices, e.g., as described above with reference to blocks 4405-4420 of FIG. 44. In some alternative implementations, blocks 4605-4620 may involve obtaining DOA data corresponding to electromagnetic waves transmitted and received by each of a plurality of devices in an environment.

However, in this example, block 4605 is also directed to obtaining TOA data. According to this example, the TOA data includes a measured TOA of audio emitted and received by each intelligent audio device in the audio environment (e.g., each pair of intelligent audio devices in the audio environment). In some embodiments involving the emission of a structured source signal, the audio used to extract TOA data may be the same as the audio used to extract DOA data. In other embodiments, the audio used to extract TOA data may be different from the audio used to extract DOA data.

According to this example, block 4616 involves detecting TOA candidates in the audio data, and block 4618 involves selecting a single TOA for each intelligent audio device pair from among the TOA candidates. Some examples are described below.

TOA data may be obtained using various techniques. One approach is to use room calibration audio sequences such as a scan (e.g., logarithmic sine tones) or Maximum Length Sequences (MLS). Alternatively, any of the sequences described above may be used with band limiting of the near-ultrasonic audio frequency range (e.g., 18kHz to 24 kHz). In this audio frequency range, most standard audio equipment is capable of emitting and recording sound, but such a signal is not perceptible to humans because it exceeds the normal hearing ability of humans. Some alternative implementations may involve recovering TOA elements from a hidden signal in a primary audio signal, such as a direct sequence spread spectrum signal.

Given a set of DOA data from each intelligent audio device to each other intelligent audio device and a set of TOA data from each pair of intelligent audio devices, the localization method 4625 of FIG. 46 may be based on minimizing some cost function, possibly subject to some constraints. In this example, the positioning method 4625 of fig. 46 receives the aforementioned DOA and TOA values as input data and outputs estimated location data and orientation data 630 corresponding to the smart audio device. In some examples, the localization method 4625 may also output playback and recording delays of the smart audio device, e.g., to achieve some global symmetry that cannot be determined from the minimization problem. Some examples are described below.

FIG. 47 is a flowchart outlining another example of a method for automatically estimating device location and orientation based on DOA data and TOA data. For example, the method 4700 may be performed by a control system implementing a positioning algorithm, such as the apparatus shown in fig. 1B. As with other methods described herein, the blocks of method 4700 are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described.

Except as described below, in some examples, blocks 4705, 4710, 4715, 4720, 4725, 4730, 4735, 4740, 4745, and 4750 may be as described above with reference to blocks 4505, 4510, 4515, 4520, 4525, 4530, 4535, 4540, 4545, and 4550 of fig. 45. However, in this example, the cost function 4720 and the nonlinear optimization method 4735 are modified relative to the cost function 4520 and the nonlinear optimization method 4535 of fig. 45 to operate on the DOA data and the TOA data. In some examples, the TOA data of block 4708 may be obtained as described above with reference to fig. 46. Another difference compared to the process of fig. 45 is that in this example, the nonlinear optimization method 4735 also outputs recording and playback delay data 4747 corresponding to a smart audio device, for example, as described below. Thus, in some embodiments, the result evaluation block 4750 may be involved in evaluating DOA data and/or TOA data. In some such examples, the operations of block 4750 may include feedback processing involving both DOA data and/or TOA data. For example, some such examples may implement a feedback process that involves comparing the residual of a given TOA/DOA candidate combination with another TOA/DOA candidate combination, e.g., as explained in the following TOA/DOA robustness measurement discussion.

In some examples, the result evaluation block 4750 relates to calculating residuals of the cost function at the result location and orientation. A relatively lower residual error generally indicates a relatively more accurate device location value. According to some embodiments, the result evaluation block 4750 may involve a feedback process. For example, some such examples may implement a feedback process that involves comparing the residual of a given TOA/DOA candidate combination with another TOA/DOA candidate combination, e.g., as explained in the following TOA and DOA robustness measurement discussion.

Thus, fig. 46 includes dashed lines from block 4630 (which may involve a result evaluation in some examples) to the DOA candidate selection block 4620 and to the TOA candidate selection block 4618 to represent the flow of an optional feedback process. In some implementations, block 4705 may involve obtaining acoustic DOA data as described above with reference to blocks 4605-4620 of FIG. 46 that involve determining DOA candidates and selecting DOA candidates. In some examples, block 4708 may involve obtaining acoustic TOA data as described above with reference to blocks 4605-4618 of fig. 46 that involve determining TOA candidates and selecting TOA candidates. Although not shown in fig. 47, some optional feedback processing may involve returning from the result evaluation block 4750 to block 4705 and/or block 4708.

According to this example, the positioning algorithm proceeds by minimizing a cost function, possibly subject to some constraints, and may be described as follows. In this example, the positioning algorithm receives as input DOA data 4705 and TOA data 4708, as well as configuration parameters 4710 and possibly some optional constraints 4725 specified for the listening environment. In this example, the cost function takes into account the difference between the measured and estimated DOAs, and the difference between the measured and estimated TOAs. In some embodiments, constraints 4725 impose constraints on possible device locations, orientations, and/or delays, such as imposing conditions of minimum distance of audio devices from each other and/or imposing conditions that some device delays should be zero.

In some embodiments, the cost function may be expressed as follows:

C(x,z,l,k)＝W _DOA C _DOA (x,z)+W _TOA C _TOA (x,l,k)

in the above equation, l= (l ₁ ,…,l _N ) Sum k= (k) ₁ ,…,k _N ) Representing playback and recording device vectors for each device, respectively, and wherein W _DOA And W is _TOA The global weights (also referred to as pre-factors) of the DOA and TOA minimums, respectively, reflect the relative importance of each of the two terms. In some such examples, the TOA cost function may be expressed as:

wherein the method comprises the steps of

·TOA _nm Representing measured signal arrival times from smart device m to smart device n;

·Representation of TOA administration _nm A measured weight; and

c represents the speed of sound.

There are up to 5 real unknowns per intelligent audio device: device location x _n (2 real unknowns per device), device orientation alpha _n (1 real unknowns per device) and recording and playback delays l _n And k _n (2 additional unknowns per device). From these, only the device location and latency are related to the TOA portion of the cost function. If there is a priori known constraint or association between delays, the number of effective unknowns may be reduced in some embodiments.

In some examples, there may be additional a priori information, e.g., about the availability or reliability of each TOA measurement. In some of these examples, the weightsMay be 0 or 1, for example, 0 for those measurements that are not available (or deemed to be less reliable) and 1 for reliable measurements. In this way, device location may be estimated using only a subset of all possible DOA and/or TOA elements. In some other embodiments, the weights may have a continuous value from 0 to 1, for example, as a function of the reliability of the TOA measurement. In some examples where no a priori reliability information is available, the weight may simply be set to 1.

According to some embodiments, one or more additional constraints may be placed on possible values of the delay and/or on the relationship between different delays.

In some examples, the location of the audio device may be measured in standard length units (such as meters) and the delay and arrival time may be indicated in standard time units (such as seconds). However, in general, the nonlinear optimization method works better when the scales of variation of the different variables used in the minimization process are on the same order of magnitude. Thus, some embodiments may involve readjusting the position measurement such that the smart device position ranges from-1 to 1, and readjusting the delay and arrival time such that these values also range from-1 to 1.

The minimization of the cost function described above does not fully determine the absolute position and orientation or delay of the intelligent audio device. The TOA information gives an absolute distance scale, meaning that the cost function is no longer unchanged under the scale transformation, but remains unchanged under global rotation and translation. Furthermore, latency is subject to additional global symmetry: if the same global quantity is added to all playback and recording delays at the same time, the cost function remains unchanged. These global transformations cannot be determined by minimizing the cost function. Similarly, the configuration parameters should provide a standard to allow the unique definition of device layouts that represent the entire equivalence class.

In some examples, symmetry disambiguation criteria may include the following: a reference location, a fixed global translational symmetry (e.g., the smart device 1 should be at the origin of coordinates); with reference to orientation, two-dimensional rotational symmetry is fixed (e.g., the smart device 1 should be oriented forward); and a reference delay (e.g. the recording delay of device 1 should be zero). In general, there are 4 parameters in this example that cannot be determined from the minimization problem and should be provided as external inputs. Thus, there are 5N-4 unknowns that can be determined from the minimization problem.

In some implementations, in addition to a set of intelligent audio devices, one or more passive audio receivers and/or one or more audio transmitters of an array of microphones that may not be equipped for operation may be provided. The inclusion of time delays as a minimization variable allows some of the disclosed methods to locate receivers and transmitters where the transmit and receive times are not precisely known. In some such embodiments, the TOA cost function described above may be implemented. For the convenience of the reader, this cost function is shown again below:

as described above with reference to the DOA cost function, if the cost function is used for position estimation involving passive receivers and/or transmitters, then the cost function variables need to be interpreted in a slightly different manner. Now N represents the total number of devices, including N _smart Personal intelligent audio device, N _rec Passive audio receiverN _emit A transmitter such that n=n _smart +N _rec +N _emit . Weighting ofIt is possible to have a sparse structure to mask data lost due to passive receivers or transmitters only, e.g. so that if device n is an audio transmitter, then for all m +.>And if device m is an audio receiver, then for all n +.>According to some embodiments, for intelligent audio devices, the position, orientation, and recording and playback delays must be determined; for passive receivers, the position, orientation and recording delays must be determined; and for an audio transmitter the position and playback delay must be determined. According to some such examples, the total number of unknowns is thus 5N _smart +4N _rec +3N _emit -4。

Disambiguation of global translation and rotation

The solutions to the DOA-only and combined TOA-and-DOA problems are both limited by global translational and rotational ambiguity. In some examples, translational ambiguity can be resolved by treating only the emitter source as a listener and translating all devices such that the listener is located at the origin.

Rotational ambiguity can be resolved by imposing additional constraints on the solution. For example, some multi-loudspeaker environments may include Television (TV) loudspeakers and sofas positioned for viewing TV. After positioning the loudspeakers in the environment, some methods may involve finding a vector connecting the listener to the TV viewing direction. Some such methods may then involve letting the TV sound from its loudspeaker and/or prompting the user to walk in front of the TV and locate the user's voice. Some implementations may involve rendering audio objects that pan in an environment. The user may provide user input (e.g., say "stop") to indicate when the audio object is at one or more predetermined locations within the environment, such as in front of the environment, TV sites of the environment, etc. Some embodiments relate to a cellular application equipped with an inertial measurement unit that prompts a user to direct a cellular phone in two defined directions: the first one pointing in the direction of a specific device, e.g. a device with a light emitting LED, and the second one pointing in the direction of the user's desired viewing direction, such as in front of the environment, TV-spot of the environment, etc. Some detailed disambiguation examples will now be described with reference to fig. 48A-48D.

Fig. 48A illustrates another example of an audio environment. According to some examples, the audio device location data output by one of the disclosed localization methods may include an estimate of the audio device location for each of the audio devices 1-5, referencing the audio device coordinate system 4807. In this embodiment, the audio device coordinate system 4807 is a cartesian coordinate system having the location of the microphone of the audio device 2 as its origin. Here, the x-axis of the audio device coordinate system 4807 corresponds to a line 4803 between the location of the microphone of the audio device 2 and the location of the microphone of the audio device 1.

In this example, the listener location is determined by prompting a listener 4805, shown sitting on a sofa 4833, to make one or more utterances 4827 (e.g., via audio cues from one or more loudspeakers in environment 4800 a) and estimating the listener location from time of arrival (TOA) data. The TOA data corresponds to microphone data obtained by a plurality of microphones in the environment. In this example, the microphone data corresponds to the detection of one or more utterances 4827 by microphones of at least some (e.g., 3, 4, or all 5) of the audio devices 1-5.

Alternatively or additionally, the listener location may be estimated from the DOA data provided by the microphones of at least some (e.g., 2, 3, 4, or all 5) of the audio devices 1-5. According to some such examples, the listener location may be determined from the intersection of lines 4809a, 4809b, etc., corresponding to the DOA data.

According to this example, the listener position corresponds to the origin of listener coordinate system 4820. In this example, the listener angular orientation data is indicated by the y ' axis of listener coordinate system 4810, which corresponds to line 4813a between listener's head 4810 (and/or listener's nose 4825) and bar 4830 of television 4801. In the example shown in fig. 48A, line 4813a is parallel to the y' axis. Thus, the angle θ represents the angle between the y-axis and the y' -axis. Thus, while the origin of audio device coordinate system 4807 is shown as corresponding to audio device 2 in fig. 48A, some embodiments involve co-locating the origins of audio device coordinate system 4870 and listener coordinate system 4820 before the audio device coordinates are rotated about listener coordinate system 4820 origin by angle θ. Such co-location may be performed by a coordinate transformation from audio device coordinate system 4807 to listener coordinate system 4820.

In some examples, the location of the sound bar 4830 and/or the television 4801 may be determined by having the sound bar emit sound and estimating the location of the sound bar from the DOA and/or TOA data (which may correspond to detecting sounds emitted by at least some (e.g., 3, 4, or all 5) microphones in the audio devices 1-5). Alternatively or additionally, the location of the sound bar 4830 and/or the television 4801 may be determined by prompting the user to walk to the TV and locate the user's voice via the DOA and/or TOA data (which may correspond to detecting sounds emitted by at least some (e.g., 3, 4, or all 5) of the microphones of the audio devices 1-5). Some such methods may involve applying a cost function, e.g., as described above. Some such methods may involve triangulation. Such an example may be beneficial in situations where the sound bar 4830 and/or the television 4801 does not have an associated microphone.

In some other examples where the sound bar 4830 and/or the television 4801 does have an associated microphone, the location of the sound bar 4830 and/or the television 4801 may be determined according to TOA and/or DOA methods (such as the methods disclosed herein). According to some such methods, the microphone may be co-located with the sound bar 4830.

According to some embodiments, the sound bar 4830 and/or the television 48101 may have an associated camera 4811. The control system may be configured to capture an image of the listener's head 4810 (and/or the listener's nose 4825). In some such examples, the control system may be configured to determine a line 4813a between the listener's head 4810 (and/or the listener's nose 4815) and the camera 4811. The listener angular orientation data may correspond to line 4813a. Alternatively or additionally, the control system may be configured to determine an angle θ between the line 4813a and the y-axis of the audio device coordinate system.

Fig. 48 shows an additional example of determining listener angular orientation data. According to this example, a listener location has been determined. Here, the control system is controlling the loudspeakers of the environment 4800b to render the audio objects 4835 to various places within the environment 4800 b. In some such examples, the control system may cause the loudspeakers to render audio object 4835 such that audio object 4835 appears to rotate about listener 4805, for example, by rendering audio object 4835 such that audio object 4835 appears to rotate about the origin of listener coordinate system 4820. In this example, curved arrow 4840 shows a portion of the trajectory of audio object 4835 as it rotates around listener 4805.

According to some such examples, listener 4805 can provide user input (e.g., say "stop") indicating when audio object 4835 is in the direction that listener 4805 is facing. In some such examples, the control system may be configured to determine a line 4813b between the listening location and the location of the audio object 4835. In this example, line 4813b corresponds to the y 'axis of the listener's coordinate system, which indicates the direction in which listener 4805 is facing. In alternative implementations, the listener 4805 can provide user input indicating when the audio object 4835 is in front of the environment, at a TV site of the environment, at an audio device site, and the like.

Fig. 48C shows an additional example of determining listener angular orientation data. According to this example, a listener location has been determined. Here, listener 4805 is using handheld device 4845 to provide input regarding the viewing direction of listener 4805 by pointing handheld device 4845 at television 4801 or sound bar 4830. In this example, the dashed outline of handheld device 4845 and listener arm indicates: sometime before listener 4805 points handheld device 4845 at television 4801 or sound bar 4830, listener 4805 is pointing handheld device 4845 at audio device 2. In other examples, listener 4805 may have pointed handheld device 4845 at another audio device, such as audio device 1. According to this example, handheld device 4845 is configured to determine an angle α between audio device 2 and television 4801 or soundbar 4830 that approximates the angle between audio device 2 and the viewing direction of listener 4805.

In some examples, handheld device 4845 may be a cellular telephone including an inertial sensor system and a wireless interface configured to communicate with a control system of an audio device of control environment 4800 c. In some examples, handheld device 4845 may run an application or "app" configured to control handheld device 4845 to perform the necessary functions, for example, by providing user prompts (e.g., via a graphical user interface), by receiving input indicating that handheld device 4845 is pointing in a desired direction, by a control system of an audio device that stores and/or transmits corresponding inertial sensor data to control environment 4800c, and/or the like.

According to this example, the control system (which may be the control system of the handheld device 4845, the control system of the smart audio device of the environment 4800c, or the control system of the audio device of the control environment 4800 c) is configured to determine the orientation of the lines 4813c and 4850 from inertial sensor data (e.g., from gyroscope data). In this example, line 4813c is parallel to axis y' and may be used to determine the angular orientation of the listener. According to some examples, the control system may determine an appropriate rotation of the audio device coordinates about the origin of listener coordinate system 4820 from angle α between audio device 2 and the viewing direction of listener 4805.

Fig. 48D shows one example of determining an appropriate rotation of the audio device coordinates according to the method described with reference to fig. 48C. In this example, the origin of audio device coordinate system 4807 is co-located with the origin of listener coordinate system 4820. After determining the listener position, it is possible to co-locate the origins of audio device coordinate system 4807 and listener coordinate system 4820. Co-locating the origins of audio device coordinate system 4807 and listener coordinate system 4820 may involve transforming the audio device locations from audio device coordinate system 4807 to listener coordinate system 4820. The angle α has been determined as described above with reference to fig. 48C. Thus, angle α corresponds to the desired orientation of audio device 2 in listener coordinate system 4820. In this example, the angle β corresponds to the orientation of the audio device 2 in the audio device coordinate system 4807. The angle θ, which in this example is β - α, indicates the rotation necessary to align the y-axis of audio device coordinate system 4807 with the y' axis of listener coordinate system 4820.

DOA robustness measurement

As described above with reference to fig. 44, in some examples using a "blind" approach (including controlled response power, beamforming, or other similar approaches) applied to any signal, robustness measures may be added to improve accuracy and stability. Some embodiments include time integration of the beamformer controlled response to filter out transients and detect only sustained peaks, and average out random errors and fluctuations in those sustained DOAs. Other examples might use only a limited frequency band as input, which can be tuned to the room or signal type for better performance.

For example, using a "supervisory" approach involving the generation of impulse responses using structured source signals and deconvolution methods, pre-processing measures can be implemented to improve the accuracy and prominence of DOA peaks. In some examples, such preprocessing may include truncation of an amplitude window of a certain time width starting from the beginning of the impulse response on each microphone channel. Such an example may include an impulse response start detector such that each channel start may be found independently.

In some examples based on the "blind" or "supervised" approach described above, further processing may be added to improve DOA accuracy. It is important to note that the DOA selection based on peak detection (e.g., during controlled response power (SRP) or impulse response analysis) is sensitive to ambient acoustics, which may result in capturing non-primary path signals due to reflections and device occlusion that will suppress the received and transmitted energy. These events can reduce the accuracy of the device to the DOA and introduce errors in the positioning solution of the optimizer. Thus, it is prudent to consider all peaks within a predetermined threshold as candidates for ground truth (DOA). One example of a predetermined threshold is a requirement that the peak be greater than an average controlled response power (SRP). Thresholding and removing candidates below the average signal level has proven to be a simple and effective initial filtering technique for all detected peaks. As used herein, "protrusion" is a measure of how large a local peak is compared to its neighboring local minimum, as opposed to thresholding based on power alone. One example of a saliency threshold is where the power difference between a required peak and its adjacent local minimum is at or above the threshold. Preserving viable candidates increases the chance that the device will contain available DOAs in its set (within truly acceptable error margins of the ground), but it is likely that it will not contain available DOAs in the event that the signal is corrupted by strong reflection/occlusion. In some examples, a selection algorithm may be implemented to perform one of the following: 1) Selecting a best available DOA candidate for each device pair; 2) Determining that none of the candidates is available, thereby nullifying the pair of optimized contributions to the cost function weighting matrix; or 3) selecting the best inferred candidate, but applying a non-binary weighting to the DOA contribution in the event that it is difficult to disambiguate the amount of error carried by the best candidate.

After initial optimization using best inference candidates, in some examples, a positioning solution may be used to calculate the remaining cost contribution for each DOA. Outlier analysis of the remaining costs may provide evidence of the most influential DOA pairs on the positioning solution, with extreme outliers marking these DOAs as potentially incorrect or suboptimal. According to one of the three options described above, a recursive run of the peripheral DOA pair optimization based on the residual cost contribution of the remaining candidates and the weighting applied to the contribution of the device pair may then be used for candidate processing. This is an example of a feedback process, such as described above with reference to fig. 44-47. According to some embodiments, repeated optimization and processing decisions may be performed until all detected candidates are evaluated and the remaining cost contribution of the selected DOA is balanced.

The disadvantage of candidate selection based on optimizer evaluation is that it is computationally intensive and sensitive to candidate traversal order. An alternative technique with less computational weight involves determining all permutations of candidates in the set and running a triangle alignment method to locate the candidates. Related triangle alignment methods are disclosed in U.S. provisional patent application No.62/992,068, entitled "Audio Device Auto-Location," filed on 3/19/2020, which is incorporated herein by reference for all purposes. The positioning result may then be evaluated by calculating the resulting total and remaining costs for the DOA candidates used in the triangulation. Decision logic that resolves these metrics may be used to determine the best candidate and its respective weight to provide to the nonlinear optimization problem. In the case where the candidate list is large, resulting in a high ranking count, filtering and intelligent traversal through the ranking list may be applied.

TOA robustness measurement

As described above with reference to fig. 46, using multiple candidate TOA solutions increases robustness and ensures that errors have minimal impact on finding the optimal speaker layout compared to systems using single or minimum TOA values. After the impulse response of the system is obtained, in some examples, each of the TOA matrix elements may be recovered by searching for peaks corresponding to the direct sound. This peak can be easily identified as the largest peak in the impulse response under ideal conditions (e.g., no noise, no obstructions in the direct path between the source and the receiver, and the speaker pointing directly at the microphone). However, in the presence of noise, obstructions, or misalignment of the speaker and microphone, the peak corresponding to the direct sound does not necessarily correspond to the maximum value. Furthermore, in this case, peaks corresponding to the direct sound may be difficult to isolate from other reflections and/or noise. In some cases, direct sound identification can be a challenging process. The false recognition of the direct sound may reduce (and in some cases may totally destroy) the automatic localization process. Therefore, in the case where there may be an error in the direct sound recognition processing, it may be effective to consider a plurality of candidates of the direct sound. In some such cases, the peak selection process may include two parts: (1) A direct sound search algorithm that finds suitable peak candidates and (2) a peak candidate evaluation process to increase the probability of choosing the correct TOA matrix element.

In some implementations, the process of searching for direct sound candidate peaks may include a method of identifying relevant candidates for direct sound. Some such methods may be based on the following steps: (1) Identifying a first reference peak (e.g., the maximum of the absolute value of the Impulse Response (IR)), "first peak"; (2) Evaluating the noise level around (before and after) the first peak; (3) Searching for a replacement peak before (and in some cases after) the first peak above the noise level; (4) Ranking peaks found according to probabilities corresponding to correct TOAs; and optionally (5) grouping the approaching peaks (to reduce the number of candidates).

Once the direct sound candidate peak is identified, some embodiments may involve a multimodal evaluation step. As a result of the direct sound candidate peak search, in some examples, each TOA matrix element will have one or more candidate values that are ranked according to their estimated probabilities. Multiple TOA matrices may be formed by selecting among different candidate values. To evaluate the likelihood of a given TOA matrix, a minimization process (such as the minimization process described above) may be implemented. This process can generate minimized residuals that are good estimates of the internal coherence of the TOA and DOA matrices. A perfect noiseless TOA matrix will result in zero residuals, while a TOA matrix with incorrect matrix elements will result in large residuals. In some embodiments, the method will find a set of candidate TOA matrix elements that create the TOA matrix with the smallest residual. This is one example of the evaluation process described above with reference to fig. 46 and 47, which may involve a result evaluation block 4750. In one example, the evaluation process may involve performing the steps of: (1) selecting an initial TOA matrix; (2) evaluating the initial matrix with the minimized residual error; (3) Changing a matrix element of the TOA matrix from the TOA candidate list; (4) reevaluating the matrix with the minimized processed residual error; (5) If the residual error is smaller, accepting the change, otherwise, not accepting; and (6) iterating steps 3 to 5. In some examples, the evaluation process may stop when all TOA candidates have been evaluated or a predefined maximum number of iterations has been reached.

Positioning method example

Fig. 49 is a flowchart outlining another example of a positioning method. As with other methods described herein, the blocks of method 4900 are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described. In such an embodiment, the method 4900 involves estimating the location and orientation of the audio device in the environment. The blocks of method 4900 may be performed by one or more devices, which may be (or may include) apparatus 150 shown in fig. 1B.

In this example, block 4905 obtains direction of arrival (DOA) data corresponding to sound emitted by at least a first intelligent audio device of an audio environment by a control system. The control system may be, for example, control system 160 described above with reference to fig. 1B. According to this example, the first smart audio device includes a first audio transmitter and a first audio receiver, and the DOA data corresponds to sound received by at least a second smart audio device of the audio environment. Here, the second smart audio device includes a second audio transmitter and a second audio receiver. In this example, the DOA data also corresponds to sound emitted by at least the second smart audio device and received by at least the first smart audio device. In some examples, the first and second smart audio devices may be two of the audio devices 4105a-4105d shown in fig. 41.

The DOA data may be obtained in a variety of ways, depending on the particular implementation. In some cases, determining the DOA data may involve one or more DOA-related methods described above with reference to FIG. 44 and/or in the "DOA robustness measurements" section. Some embodiments may involve obtaining, by a control system, one or more elements of the DOA data using a beamforming method, a controlled power response method, a time difference of arrival method, and/or a structured signal method.

According to this example, block 4910 relates to receiving, by a control system, configuration parameters. In such an embodiment, the configuration parameters correspond to the audio environment itself, to one or more audio devices of the audio environment, or to both the audio environment and the one or more audio devices of the audio environment. According to some examples, the configuration parameters may indicate a number of audio devices in the audio environment, one or more dimensions of the audio environment, one or more constraints on audio device location or orientation, and/or disambiguation data for at least one of rotation, panning, or scaling. In some examples, the configuration parameters may include playback delay data, recording delay data, and/or data for disambiguating delay symmetry.

In this example, block 4915 relates to minimizing, by the control system, a cost function based at least in part on the DOA data and the configuration parameters to estimate a location and an orientation of at least the first intelligent audio device and the second intelligent audio device.

According to some examples, the DOA data may also correspond to sounds made by third through nth intelligent audio devices of the audio environment, where N corresponds to a total number of intelligent audio devices of the audio environment. In such examples, the DOA data may also correspond to sound received by each of the first through nth intelligent audio devices from all other intelligent audio devices of the audio environment. In this case, minimizing the cost function may involve estimating the position and orientation of the third through nth intelligent audio devices.

In some examples, the DOA data may also correspond to sound received by one or more passive audio receivers of the audio environment. Each of the one or more passive audio receivers may include a microphone array, but may be devoid of an audio transmitter. Minimizing the cost function may also provide an estimated location and orientation of each of the one or more passive audio receivers. According to some examples, the DOA data may also correspond to sounds emitted by one or more audio emitters of the audio environment. Each of the one or more audio transmitters may include at least one sound emitting transducer but may be devoid of a microphone array. Minimizing the cost function may also provide an estimated location of each of the one or more audio transmitters.

In some examples, method 4900 may involve receiving, by the control system, a seed layout of the cost function. For example, the seed layout may specify the correct number of audio transmitters and receivers in the audio environment and any location and orientation of each audio transmitter and receiver in the audio environment.

According to some examples, method 4900 may involve receiving, by a control system, a weight factor associated with one or more elements of the DOA data. The weighting factors may, for example, indicate availability and/or reliability of one or more elements of the DOA data.

In some examples, the method 4900 may involve receiving, by the control system, time of arrival (TOA) data corresponding to sound emitted by at least one audio device of the audio environment and received by at least one other audio device of the audio environment. In some such examples, the cost function may be based at least in part on TOA data. Some such embodiments may involve estimating at least one playback delay and/or at least one recording delay. According to some such examples, the cost function may operate with a rescaled location, a rescaled latency, and/or a rescaled arrival time.

In some examples, the cost function may include a first term that depends only on the DOA data and a second term that depends only on the TOA data. In some such examples, the first term may include a first weight factor and the second term may include a second weight factor. According to some such examples, the one or more TOA elements of the second item may have TOA element weight factors indicating availability or reliability of each of the one or more TOA elements.

Fig. 50 is a flowchart outlining another example of a positioning method. As with other methods described herein, the blocks of method 5000 are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described. In such an embodiment, method 5000 involves estimating the location and orientation of the device in the environment. The blocks of method 5000 may be performed by one or more devices, which may be (or may include) apparatus 150 shown in fig. 1B.

In this example, block 5005 obtains direction of arrival (DOA) data corresponding to transmissions of a first transceiver of at least a first device of the environment by the control system. The control system may be, for example, control system 160 described above with reference to fig. 1B. According to this example, the first transceiver includes a first transmitter and a first receiver and the DOA data corresponds to a transmission received by a second transceiver of at least a second device of the environment, the second transceiver further including a second transmitter and a second receiver. In this example, the DOA data also corresponds to a transmission received by at least the first transceiver from at least the second transceiver. According to some examples, the first transceiver and the second transceiver may be configured to transmit and receive electromagnetic waves. In some examples, the first and second smart audio devices may be two of the audio devices 4105a-4105d shown in fig. 41.

The DOA data may be obtained in a variety of ways, depending on the particular implementation. In some cases, determining the DOA data may involve one or more DOA-related methods described above with reference to FIG. 44 and/or in the "DOA robustness measurements" section. Some embodiments may involve obtaining, by a control system, one or more elements of the DOA data using a beamforming method, a controlled power response method, a time difference of arrival method, and/or a structured signal method. According to some examples, determining the DOA data may involve using acoustic calibration signals, e.g., according to one or more methods disclosed herein. As disclosed in more detail elsewhere herein, some such methods may involve orchestrating acoustic calibration signals played back by multiple audio devices in an audio environment.

According to this example, block 5010 relates to receiving, by a control system, configuration parameters. In such an embodiment, the configuration parameters correspond to the environment itself, to one or more devices of the audio environment, or to one or more devices of both the environment and the audio environment. According to some examples, the configuration parameters may indicate a number of audio devices in the environment, one or more dimensions of the environment, one or more constraints on a device location or orientation, and/or disambiguation data for at least one of rotation, panning, or scaling. In some examples, the configuration parameters may include playback delay data, recording delay data, and/or data for disambiguating delay symmetry.

In this example, block 5015 relates to minimizing, by the control system, a cost function based at least in part on the DOA data and the configuration parameters to estimate a location and an orientation of at least the first device and the second device.

According to some embodiments, the DOA data may also correspond to transmissions transmitted by third through nth transceivers of third through nth devices of the environment, where N corresponds to a total number of transceivers of the environment and where the DOA data also corresponds to transmissions received by each of the first through nth transceivers from all other transceivers of the environment. In some such embodiments, minimizing the cost function may also involve estimating the position and orientation of the third through nth transceivers.

In some examples, the first device and the second device may be smart audio devices and the environment may be an audio environment. In some such examples, the first transmitter and the second transmitter may be audio transmitters. In some such examples, the first receiver and the second receiver may be audio receivers. According to some such examples, the DOA data may also correspond to sounds made by third through nth intelligent audio devices of the audio environment, where N corresponds to a total number of intelligent audio devices of the audio environment. In such examples, the DOA data may also correspond to sound received by each of the first through nth intelligent audio devices from all other intelligent audio devices of the audio environment. In this case, minimizing the cost function may involve estimating the position and orientation of the third through nth intelligent audio devices. Alternatively or additionally, in some examples, the DOA data may correspond to electromagnetic waves transmitted and received by devices in the environment.

In some examples, the DOA data may also correspond to sound received by one or more passive receivers of the environment. Each of the one or more passive receivers may include an array of receivers, but may have no transmitter. Minimizing the cost function may also provide an estimated location and orientation of each of the one or more passive receivers. According to some examples, the DOA data may also correspond to transmissions from one or more transmitters of the environment. In some such examples, each of the one or more transmitters may be devoid of an array of receivers. Minimizing the cost function may also provide an estimated location for each of the one or more transmitters.

In some examples, method 5000 may involve receiving, by the control system, a seed layout of the cost function. For example, the seed layout may specify the correct number of transmitters and receivers in the audio environment and any location and orientation of each transmitter and receiver in the audio environment.

According to some examples, method 5000 may involve receiving, by a control system, weight factors associated with one or more elements of the DOA data. The weighting factors may, for example, indicate availability and/or reliability of one or more elements of the DOA data.

In some examples, the method 5000 may involve receiving, by the control system, time of arrival (TOA) data corresponding to sound emitted by at least one audio device of the audio environment and received by at least one other audio device of the audio environment. In some such examples, the cost function may be based at least in part on TOA data. According to some examples, determining TOA data may involve using acoustic calibration signals, for example, according to one or more methods disclosed herein. As disclosed in more detail elsewhere herein, some such methods may involve orchestrating acoustic calibration signals played back by multiple audio devices in an audio environment. Some such embodiments may involve estimating at least one playback delay and/or at least one recording delay. According to some such examples, the cost function may operate with a rescaled location, a rescaled latency, and/or a rescaled arrival time.

Fig. 51 depicts a plan view of another listening environment, which in this example is a living space. As with the other figures provided herein, the types, amounts, and arrangements of elements shown in fig. 51 are provided by way of example only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements. In other examples, the audio environment may be another type of environment, such as an office environment, a vehicle environment, a park or other outdoor environment, or the like. Some detailed examples relating to the vehicle environment are described below.

According to this example, the audio environment 5100 includes a living room 5110 at the upper left, a kitchen 5115 at the center below, and a bedroom 5122 at the lower right. In the example of fig. 51, boxes and circles distributed throughout the living space represent a set of loudspeakers 5105a, 5105b, 5105c, 5105d, 5105e, 5105f, 5105g, and 5105h, at least some of which may be intelligent speakers in some embodiments. In this example, the microphones 5105a-5105h have been placed in locations that facilitate living space, but the microphones 5105a-5105h are not located in positions corresponding to any standard "canonical" microphone layout (such as Dolby 5.1, dolby 7.1, etc.). In some examples, the microphones 5105a-5105h may be arranged to implement one or more of the disclosed embodiments.

Flexible rendering is a technique that renders spatial audio over any number of arbitrarily placed loudspeakers, such as the loudspeakers represented in fig. 51. With the widespread deployment of intelligent audio devices (e.g., intelligent speakers) in the home, as well as other audio devices that are not positioned according to any standard "canonical" loudspeaker layout, it may be advantageous to achieve flexible rendering of audio data and playback of audio data so rendered.

Several techniques have been developed to enable flexible rendering, including centroid amplitude translation (CMAP) and Flexible Virtualization (FV). Both techniques treat the rendering problem as one of minimizing a cost function, where the cost function includes at least a first term modeling the desired spatial impression that the renderer is attempting to achieve, and a second term assigning a cost to the active speaker. A detailed example of CMAP, FV and combinations thereof is described in International publication No. WO 2021/021707 A1, entitled "MANAGING PLAYBACK OF MULTIPLE STREAMS OF MULTIPLE SPEAKERS", published on month 2 and 4 of 2021, pages 25, line 8 to page 31, line 27, which is incorporated herein by reference.

However, the methods disclosed herein that relate to flexible rendering are not limited to CMAP and/or FV-based flexible rendering. Such a method may be implemented by any suitable type of flexible rendering, such as vector base amplitude translation (VBAP). The relevant VBAP method is disclosed in Pulkki, ville, "Virtual Sound Source Positioning Using Vector Base Amplitude Panning," in j.audio eng.soc. volume 45, stage 6 (month 6 1997), which is incorporated herein by reference. Other suitable types of flexible rendering include, but are not limited to, double balanced panning and Ambisonics-based flexible rendering methods, such as the methods described in d.artaga, "An Ambisonics Decoder for Irregular 3-D Loudspeaker Arrays" Paper 8918 (month 5 2013), which is incorporated herein by reference.

In some cases, flexible rendering may be performed with respect to a coordinate system, such as audio environment coordinate system 5117 shown in fig. 51. According to this example, the audio environment coordinate system 5117 is a two-dimensional cartesian coordinate system. In this example, the origin of the audio environment coordinate system 5117 is within the loudspeaker 5105a and the x-axis corresponds to the long axis of the loudspeaker 5105 a. In other implementations, the audio environment coordinate system 5117 may be a three-dimensional coordinate system, which may or may not be a cartesian coordinate system.

Furthermore, the origin of the coordinate system is not necessarily associated with a loudspeaker or a loudspeaker system. In some implementations, the origin of the coordinate system can be at another location of the audio environment 5100. An alternative location of the audio environment coordinate system 5117' provides one such example. In this example, the origin of the alternate audio environment coordinate system 5117' has been selected such that the values of x and y are positive for all locations within the audio environment 5100. In some cases, the origin and orientation of the coordinate system may be selected to correspond to the location and orientation of the head of a person within the audio environment 5100. In some such embodiments, the viewing direction of the person may be along an axis of the coordinate system (e.g., along a positive y-axis).

In some implementations, the control system may control the flexible rendering process based at least in part on the location (and, in some examples, the orientation) of each participating loudspeaker (e.g., each active loudspeaker and/or each loudspeaker for which audio data is to be rendered) in the audio environment. According to some such embodiments, the control system may have predetermined the location (and in some examples, the orientation) of each participating loudspeaker according to a coordinate system, such as the audio environment coordinate system 5117, and may have stored corresponding loudspeaker position data in the data structure. Methods for determining the location of an audio device are disclosed herein.

According to some such embodiments, a control system for orchestration devices (which may be one of the loudspeakers 5105a-5105h in some cases) may render audio data such that a particular element or region of the audio environment 5100 (such as the television 5130) represents the front end and center of the audio environment. Such an implementation may be advantageous for some use cases, such as audio playback of movies, television programs, or other content being displayed on television 5130.

However, for other use cases, such as playback of music not associated with content displayed on television 5130, such a rendering method may not be optimal. In such an alternative, it may be desirable to render the audio data for playback such that the front and center of the rendered sound field corresponds to the position and orientation of the person within the audio environment 5100.

For example, referring to person 5120a, it may be desirable to render audio data for playback such that the front and center of the rendered sound field corresponds to the viewing direction of person 5120a indicated by the direction of arrow 5123a from the location of person 5120 a. In this example, the location of person 5120a is indicated by a point 5121a at the center of the head of person 5120 a. In some examples, a "sweet spot" of audio data rendered for playback to the person 5120a may correspond to the spot 5121a. Some methods for determining the position and orientation of a person in an audio environment are described below. In some such examples, the position and orientation of the person may be determined based on the position and orientation of a piece of furniture, such as the position and orientation of chair 5125.

According to this example, the locations of persons 5120b and 5120c are represented by points 5121b and 5121c, respectively. Here, the front faces of the persons 5120b and 5120c are represented by arrows 5123b and 5123c, respectively. The locations of the points 5121a, 5121b, and 5121c and the orientations of the arrows 5123a, 5123b, and 5123c may be determined relative to a coordinate system (such as the audio environment coordinate system 5117). As described above, in some examples, the origin and orientation of the coordinate system may be selected to correspond to the location and orientation of the head of a person within the audio environment 5100.

In some examples, the "sweet spot" of audio data rendered for playback to the person 5120b may correspond to the spot 5121b. Similarly, a "sweet spot" of audio data rendered for playback to the person 5120c may correspond to the spot 5121c. It can be observed that if the "sweet spot" of audio data rendered for playback by the person 5120a corresponds to the spot 5121a, then the sweet spot will not correspond to the spot 5121b or the spot 5121c.

Furthermore, the front and center areas of the sound field rendered for person 5120b should ideally correspond to the direction of arrow 5123 b. Likewise, the front and center regions of the sound field rendered for person 5120c should ideally correspond to the direction of arrow 5123 c. It can be observed that the front and center areas are different with respect to the persons 5120a, 5120b, and 5120 c. Thus, audio data rendered via the previously disclosed method and according to the position and orientation of any of these persons will not be optimal for the position and orientation of the other two persons.

However, the various disclosed embodiments are capable of satisfactorily rendering audio data for multiple sweet spots and, in some cases, for multiple orientations. Some such methods involve creating two or more different spatial renderings of the same audio content for different listening configurations through a set of common loudspeakers, and combining the different spatial renderings by multiplexing the renderings across frequencies. In some such examples, the frequency spectrum corresponding to the human hearing range (e.g., 20Hz to 20000 Hz) may be divided into multiple frequency bands. According to some such examples, each of the different spatial renderings is to be played back via a different set of frequency bands. In some such examples, the rendered audio data corresponding to each set of frequency bands may be combined into a single set of outputs of the loudspeaker feed signal. The result may provide spatial audio for each of the plurality of venues and, in some cases, for each of the plurality of orientations.

In some implementations, the number of listeners and their locations (and in some cases their orientations) can be determined from data from one or more cameras in an audio environment (such as the audio environment 5100 of fig. 51). In this example, the audio environment 5100 includes cameras 5111a-5111e distributed throughout the environment. In some implementations, one or more intelligent audio devices in the audio environment 5100 can also include one or more cameras. The one or more intelligent audio devices may be single-use audio devices or virtual assistants. In some such examples, one or more cameras of optional sensor system 180 (see fig. 1B) may reside in television 5130 or on television 5130, in a mobile phone, or in a smart speaker, such as one or more of microphones 5105B, 5105d, 5105e, or 5105 h. Although cameras 5111a-5111e are not shown in each of the descriptions of audio environments presented in this disclosure, each audio environment may still include one or more cameras in some implementations.

One of the practical considerations in achieving flexible rendering (according to some embodiments) is complexity. In some cases, it may not be feasible to perform accurate rendering for each band of each audio object in real time, taking into account the processing capabilities of the particular device. One challenge is that the audio object position (which may be indicated by metadata in some cases) of at least some of the audio objects to be rendered may change multiple times per second. For some disclosed embodiments, the complexity may increase because rendering may be performed for each of a plurality of listening configurations.

Another approach to reducing complexity at the cost of memory is to use one or more look-up tables (or other such data structures) that include samples of all possible object locations in three-dimensional space (e.g., samples of speaker activation). Depending on the particular implementation, the samples may or may not be the same in all dimensions. In some such examples, one such data structure may be created for each of a plurality of listening configurations. Alternatively or additionally, a single data structure may be created by a sum of multiple data structures, each of which may correspond to a different one of the multiple listening configurations.

Fig. 52 is a diagram indicating points of speaker activation in an example embodiment. In this example, the x and y dimensions are sampled using 15 points and the z dimension is sampled using 5 points. According to this example, each point represents M speaker activations, one speaker for each of the M speakers in the audio environment. In some examples, the speaker activation may be a gain or complex value for each of the N frequency bands associated with the filter bank analysis. In some examples, one such data structure may be created for a single listening configuration. According to some such examples, one such data structure may be created for each of a plurality of listening configurations. In some such examples, a single data structure may be created by multiplexing data structures associated with multiple listening configurations across multiple frequency bands (such as the N frequency bands mentioned above). In other words, for each band of the data structure, an activation from one of a plurality of listening configurations may be selected. Once this single, multiplexed data structure is created, it can be associated with a single instance of a renderer to achieve equivalent functions to a multi-renderer implementation, such as those described below with reference to fig. 54 and 55. According to some examples, the points shown in fig. 52 may correspond to speaker activation values for a single data structure created by multiplexing multiple data structures, each data structure corresponding to a different listening configuration.

Other embodiments may include more samples or fewer samples. For example, in some embodiments, spatial sampling of speaker activation may be non-uniform. Some embodiments may involve more or fewer speaker activation samples in the x, y plane than shown in fig. 52. Some such embodiments may determine speaker activation samples in only one x, y plane. According to this example, each point represents CMAP, FV, VBAP or other flexible rendering method's M speakers activation. In some implementations, a set of speaker activations such as that shown in fig. 52 may be stored in a data structure, which may be referred to herein as a "table" (or "cartesian table", as indicated in fig. 52).

The desired rendering location does not necessarily correspond to the location where the speaker activation has been calculated. In operation, some form of interpolation may be implemented in order to determine the actual activation of each speaker. In some such examples, tri-linear interpolation between speaker activations of 8 points closest to the desired rendering location may be used.

Fig. 53 is a diagram of tri-linear interpolation between points indicating speaker activation according to one example. According to this example, the solid circles 5303 at or near the vertices of the rectangular prism shown in fig. 53 correspond to locations of 8 points closest to the desired rendering location for which speaker activation has been calculated. In this case, the desired rendering location is a point within the right angle prism presented in fig. 53. In this example, the processing of the continuous linear interpolation includes interpolation of each point in the top plane to determine the first and second interpolation points 5305a and 5305b, interpolation of each point in the bottom plane to determine the third and fourth interpolation points 5310a 5310b, interpolation of the first and second interpolation points 5305a and 5305b to determine the fifth interpolation point 5315 in the top plane, interpolation of the third and fourth interpolation points 5310a and 5310b to determine the sixth interpolation point 5320 in the bottom plane, and interpolation of the fifth and sixth interpolation points 5315 and 5320 to determine the seventh interpolation point 5325 between the top plane and the bottom plane.

While tri-linear interpolation is one effective interpolation method, those skilled in the art will recognize that tri-linear interpolation is but one possible interpolation method that may be used to implement aspects of the present disclosure, and other examples may include other interpolation methods. For example, some embodiments may involve interpolation in more or fewer x, y planes than shown in fig. 52. Some such implementations may involve interpolation in only one x, y plane. In some implementations, the speaker activation of the desired rendering location will simply be set to the speaker activation of the location closest to the desired rendering location for which the speaker activation has been calculated.

FIG. 54 is a block diagram of a minimum version of another embodiment. N program streams (N ≡ 2)) are depicted, the first explicitly labeled as spatial, with its corresponding set of audio signals fed through corresponding renderers, each being individually configured for playback of its corresponding program stream (M ≡ 2) through a common set of M arbitrarily spaced loudspeakers. The renderer may also be referred to herein as a "rendering module". The rendering module and mixer 5430a may be implemented via software, hardware, firmware, or some combination thereof. In this example, the rendering module and mixer 5430a are implemented via control system 160a, control system 160a being an example of control system 160 described above with reference to fig. 1B. Each of the N renderers outputs a set of M loudspeaker feeds that are summed over all N renderers for simultaneous playback over the M loudspeakers. According to this embodiment, information about the layout of the M loudspeakers within the listening environment is provided to all the renderers, indicated by the dashed lines fed back from the loudspeaker blocks, so that the renderers can be suitably configured for playback through the loudspeakers. The layout information may or may not be sent from one or more speakers themselves, depending on the particular implementation. According to some examples, the layout information may be provided by one or more intelligent speakers configured to determine a relative position of each of the M loudspeakers in the listening environment. Some such automatic positioning methods may be based on, for example, a direction of arrival (DOA) method and/or a time of arrival (TOA) method as disclosed herein. In other examples, the layout information may be determined by another device and/or entered by a user. In some examples, loudspeaker specification information regarding the capabilities of at least some of the M loudspeakers within the listening environment may be provided to all of the renderers. Such microphone specification information may include impedance, frequency response, sensitivity, power rating, number and location of individual drivers, etc. According to this example, information from the rendering of one or more additional program streams is fed into the renderer of the main spatial stream so that the rendering can be dynamically modified as a function of the information. This information is represented by the dashed line from renderer block 2 through N back to renderer block 1.

Fig. 55 depicts another (more capable) embodiment with additional features. In this example, the rendering module and mixer 5430B are implemented via control system 160B, control system 160B being an example of control system 160 described above with reference to fig. 1B. In this version, the dashed line moving up and down between all N renderers represents the idea that any one of the N renderers might contribute to the dynamic modification of any of the remaining N-1 renderers. In other words, the rendering of any one of the N program streams may be dynamically modified as a function of a combination of one or more renderings of any of the remaining N-1 program streams. In addition, any one or more of the program streams may be spatially mixed and the rendering of any program stream, whether or not it is spatial, may be dynamically modified as a function of any other program stream. The loudspeaker layout information may be provided to the N renderers, e.g. as described above. In some examples, loudspeaker specification information may be provided to N renderers. In some implementations, microphone system 5511 may include a set of K microphones (K.gtoreq.1)) within a listening environment. In some examples, the microphone(s) may be attached to or associated with one or more microphones. These microphones may feed back the audio signals they capture (represented by solid lines) and additional configuration information (e.g., their locations) (represented by dashed lines) back to the set of N renderers. Any of the N renderers may then be dynamically modified as a function of the additional microphone input. Various examples are provided in PCT application US20/43696 filed 7/27 2020, which is incorporated herein by reference.

Examples of information from microphone input and subsequent use to dynamically modify any N renderers include, but are not limited to:

detect a particular word or phrase spoken by a user of the system.

An estimate of the location of one or more users of the system.

An estimate of the loudness of any combination of N program streams at a particular location in the listening space.

An estimate of the loudness of other ambient sounds in the listening environment, such as background noise.

FIG. 56 is a flowchart outlining another example of the disclosed method. As with other methods described herein, the blocks of method 5600 are not necessarily performed in the order indicated. Furthermore, such methods may include more or less blocks than those shown and/or described. The method 5600 may be performed by an apparatus or system, such as the apparatus 150 shown in fig. 1B and described above. In some examples, the method 5600 may be performed by one of the orchestrated audio devices 2720a-2720n described above with reference to fig. 27A.

In this example, block 5605 relates to receiving, by a control system, a first content stream comprising a first audio signal. The content stream and the first audio signal may vary according to particular implementations. In some cases, the content stream may correspond to a television program, movie, music, podcast, or the like.

According to this example, block 5610 involves rendering, by the control system, the first audio signal to produce a first audio playback signal. The first audio playback signal may be or may comprise a loudspeaker feed signal for a loudspeaker system of the audio device.

In this example, block 5615 relates to generating, by the control system, a first calibration signal. According to this example, the first calibration signal corresponds to a signal referred to herein as an acoustic calibration signal. In some cases, the first calibration signal may be generated by one or more calibration signal generator modules, such as calibration signal generator 2725 described above with reference to fig. 27A.

According to this example, block 5620 involves inserting, by the control system, the first calibration signal into the first audio playback signal to generate the first modified audio playback signal. In some examples, block 5620 may be performed by calibration signal injector 2723 described above with reference to fig. 27A.

In this example, block 5625 relates to causing, by the control system, the loudspeaker system to play back the first modified audio playback signal to generate a first audio device playback sound. In some examples, block 5620 may involve the control system controlling the loudspeaker system 2731 of fig. 27A to play back the first modified audio playback signal to generate the first audio device playback sound.

In some implementations, the method 5600 may involve receiving, by the control system, from the microphone system, microphone signals corresponding to at least the first audio device playback sound and the second audio device playback sound. The second audio device playback sound may correspond to a second modified audio playback signal played back by the second audio device. In some examples, the second modified audio playback signal may include a second calibration signal generated by a second audio device. In some such examples, the method 5600 may involve extracting, by the control system, at least a second calibration signal from the microphone signal.

According to some implementations, the method 5600 may involve receiving, by the control system, from the microphone system, microphone signals corresponding to at least the first audio device playback sound and the second through nth audio device playback sounds. The second through nth audio device playback sounds may correspond to second through nth modified audio playback signals played back by the second through nth audio devices. In some cases, the second through nth modified audio playback signals may include second through nth calibration signals. In some such examples, the method 5600 may involve extracting, by the control system, at least second through nth calibration signals from the microphone signal.

In some implementations, the method 5600 can involve estimating, by the control system, at least one acoustic scene metric based at least in part on the second through nth calibration signals. In some examples, the acoustic scene metric(s) may be or may include time of flight, time of arrival, range, audio device audibility, audio device impulse response, angle between audio devices, audio device location, audio environmental noise, and/or signal-to-noise ratio.

According to some examples, the method 5600 may involve controlling one or more aspects of audio device playback (and/or having one or more aspects of audio device playback controlled) based at least in part on at least one acoustic scene metric and/or at least one audio device characteristic. In some such examples, the orchestration device may control one or more aspects of audio device playback of one or more orchestrated devices based at least in part on the at least one acoustic scene metric and/or the at least one audio device characteristic. In some implementations, the control system of the orchestrated device may be configured to provide at least one acoustic scene metric to the orchestration device. In some such implementations, the control system of the orchestrated device may be configured to receive instructions from the orchestrated device for controlling one or more aspects of playback of the audio device based at least in part on the at least one acoustic scene metric.

According to some examples, the first content stream component of the first audio device playback sound may cause a perceived masking of the first calibration signal component of the first audio device playback sound. In some such examples, the first calibration signal component may be inaudible to humans.

In some examples, method 5600 may involve receiving, by a control system of an orchestrated audio device, one or more calibration signal parameters from the orchestration device. The one or more calibration signal parameters may be used by a control system of the orchestration audio device to generate the calibration signal.

In some embodiments, the one or more calibration signal parameters may include parameters for scheduling a time slot for playback of the modified audio playback signal. In some such examples, the first time slot of the first audio device may be different from the second time slot of the second audio device.

According to some examples, the one or more calibration signal parameters may include parameters for determining a frequency band for playback of a modified audio playback signal comprising the calibration signal. In some such examples, the first frequency band of the first audio device may be different from the second frequency band of the second audio device.

In some cases, the one or more calibration signal parameters may include a spreading code for generating the calibration signal. In some such examples, the first spreading code of the first audio device may be different from the second spreading code of the second audio device.

In some examples, the method 5600 may involve processing the received microphone signal to produce a preprocessed microphone signal. Some such examples may involve extracting a calibration signal from a preprocessed microphone signal. Processing the received microphone signal may, for example, involve beamforming, applying a band pass filter, and/or echo cancellation.

According to some embodiments, extracting at least the second through nth calibration signals from the microphone signal may involve applying a matched filter to the microphone signal or to a pre-processed version of the microphone signal to produce second through nth delay waveforms. The second through nth delay waveforms may, for example, correspond to each of the second through nth calibration signals. Some such examples may involve applying a low pass filter to each of the second through nth delay waveforms.

In some examples, the method 5600 may involve implementing a demodulator via a control system. Some such examples may involve applying a matched filter as part of the demodulation process performed by the demodulator. In some such examples, the output of the demodulation process may be a demodulated coherent baseband signal. Some examples may involve estimating a bulk delay via a control system and providing the bulk delay estimate to a demodulator.

In some examples, the method 5600 may involve implementing, via a control system, a baseband processor configured to baseband process the demodulated coherent baseband signal. In some such examples, the baseband processor may be configured to output at least one estimated acoustic scene metric. In some examples, baseband processing may involve generating a non-coherent integration delay waveform based on a demodulated coherent baseband signal received during a non-coherent integration period. In some such examples, generating the incoherent integrated delay waveform may involve squaring a demodulated coherent baseband signal received during the incoherent integration period to generate a squared demodulated baseband signal and integrating the squared demodulated baseband signal. In some examples, the baseband processing may involve applying one or more of a leading edge estimation process, a controlled response power estimation process, or a signal-to-noise ratio estimation process to the incoherent integrated delay waveform. Some examples may involve estimating a bulk delay via a control system and providing the bulk delay estimate to a baseband processor.

According to some examples, the method 5600 may involve estimating, by the control system, second through nth noise power levels at the second through nth audio device sites based on the second through nth delay waveforms. Some such examples may involve generating a distributed noise estimate for the audio environment based at least in part on the second through nth noise power levels.

In some examples, the method 5600 may involve receiving a gap instruction from the orchestration device and inserting a first gap into a first frequency range of the first audio playback signal or the first modified audio playback signal during a first time interval of the first content stream according to the first gap instruction. The first gap may be an attenuation of the first audio playback signal in the first frequency range. In some examples, the first modified audio playback signal and the first audio device playback sound include a first gap.

According to some examples, the gap instructions may include instructions for controlling the gap insertion and the calibration signal generation such that the calibration signal corresponds to neither the gap time interval nor the gap frequency range. In some examples, the gap instructions may include instructions to extract target device audio data and/or audio ambient noise data from the received microphone data.

According to some examples, the method 5600 may involve estimating, by the control system, at least one acoustic scene metric based at least in part on data extracted from the received microphone data, while playback sounds produced by one or more audio devices of the audio environment include one or more gaps. In some such examples, the acoustic scene metric(s) include one or more of time of flight, time of arrival, range, audio device audibility, audio device impulse response, angle between audio devices, audio device location, audio ambient noise, and/or signal-to-noise ratio.

According to some embodiments, the control system may be configured to implement a wake word detector. In some such examples, the method 5600 may involve detecting wake-up words in the received microphone signal. According to some examples, the method 5600 may involve determining one or more acoustic scene metrics based on wake word detection data received from a wake word detector.

In some such examples, the method 5600 may involve implementing a noise compensation function. According to some such examples, the noise compensation function may be implemented in response to ambient noise detected through forced gaps "listening" that have been inserted into the playback audio data.

According to some examples, rendering may be performed by a rendering module implemented by a control system. In some such examples, the rendering module may be configured to perform rendering based at least in part on the rendering instructions received from the orchestration device. According to some such examples, the rendering instructions may include instructions from a rendering configuration generator, a user region classifier, and/or an orchestration module of an orchestration device.

Various features and aspects will be understood from the following enumerated example embodiments ("EEEs"):

Eee1. An apparatus comprising:

an interface system; and

a control system configured to implement an orchestration module configured to:

causing a first programmed audio device of the audio environment to generate a first calibration signal;

causing the first orchestrated audio device to insert the first calibration signal into a first audio playback signal corresponding to the first content stream to generate a first modified audio playback signal for the first orchestrated audio device;

causing the first programmed audio device to play back the first modified audio playback signal to generate a first programmed audio device playback sound;

causing a second programmed audio device of the audio environment to generate a second calibration signal;

causing the second orchestrated audio device to insert a second calibration signal into the second content stream to generate a second modified audio playback signal for the second orchestrated audio device;

causing the second programmed audio device to play back the second modified audio playback signal to generate a second programmed audio device playback sound;

causing at least one microphone of at least one orchestrated audio device in the audio environment to detect at least a first orchestrated audio device playback sound and a second orchestrated audio device playback sound and to generate microphone signals corresponding to at least the first orchestrated audio device playback sound and the second orchestrated audio device playback sound;

Causing at least one programmed audio device to extract a first calibration signal and a second calibration signal from the microphone signal; and

causing the at least one orchestrated audio device to estimate at least one acoustic scene metric based at least in part on the first calibration signal and the second calibration signal.

EEE2. The apparatus of EEE1 wherein the first calibration signal corresponds to a first sub-audio component of the playback sound of the first programmed audio device, and wherein the second calibration signal corresponds to a second sub-audio component of the playback sound of the second programmed audio device.

EEE3. the apparatus of EEE1 or EEE2, wherein the first calibration signal comprises a first DSSS signal and wherein the second calibration signal comprises a second DSSS signal.

The apparatus of any of EEEs 1-3, wherein the orchestration module is further configured to:

causing the first programmed audio device to insert a first gap into a first frequency range of the first audio playback signal or the first modified audio playback signal during a first time interval of the first content stream, the first gap comprising attenuation of the first audio playback signal within the first frequency range, the first modified audio playback signal and the first programmed audio device playback sound comprising the first gap;

Causing the second programmed audio device to insert a first gap into the second audio playback signal or the first frequency range of the second modified audio playback signal during the first time interval, the second modified audio playback signal and the second programmed audio device playback sound comprising the first gap;

causing extraction of audio data from the microphone signal in at least a first frequency range to produce extracted audio data; and

at least one acoustic scene metric is caused to be determined based at least in part on the extracted audio data.

EEE5. the apparatus of EEE4 wherein the orchestration module is further configured to control gap insertion and calibration signal generation such that the calibration signal corresponds to neither the gap time interval nor the gap frequency range.

EEE6. The apparatus of EEE4 or EEE5 wherein the orchestration module is further configured to control gap insertion and calibration signal generation based at least in part on time since noise was estimated in at least one frequency band.

The apparatus of any of EEEs 4-6, wherein the orchestration module is further configured to control gap insertion and calibration signal generation based at least in part on a signal-to-noise ratio of a calibration signal of the at least one orchestrated audio device in the at least one frequency band.

The apparatus of any of EEEs 4-7, wherein the orchestration module is further configured to:

causing the targeted audio device to play back an unmodified audio playback signal of the targeted device content stream to generate a targeted audio device playback sound; and

causing the at least one orchestrated audio device to estimate at least one of target orchestrated audio device audibility or target orchestrated audio device location based at least in part on the extracted audio data, wherein:

the unmodified audio playback signal does not include the first gap; and

the microphone signal also corresponds to the target played back sound by the orchestrated audio device.

EEE9. the apparatus of EEE8 wherein the unmodified audio playback signal does not include gaps inserted into any frequency range.

EEE10 the apparatus of any one of EEEs 1-9, wherein the at least one acoustic scene metric comprises one or more of time of flight, time of arrival, direction of arrival, range, audio device audibility, audio device impulse response, angle between audio devices, audio device location, audio environmental noise, or signal-to-noise ratio.

The apparatus of any of EEEs 1-10, further comprising an acoustic scene metric aggregator, wherein the orchestration module is further configured to cause a plurality of orchestrated audio devices in the audio environment to transmit at least one acoustic scene metric to the apparatus, and wherein the acoustic scene metric aggregator is configured to aggregate acoustic scene metrics received from the plurality of orchestrated audio devices.

EEE12. The apparatus of EEE11 wherein the orchestration module is further configured to implement an acoustic scene metric processor configured to receive the aggregated acoustic scene metrics from the acoustic scene metric aggregator.

EEE13. The apparatus of EEE12 wherein the orchestration module is further configured to control one or more aspects of the audio device orchestration based at least in part on input from the acoustic scene metric processor.

The apparatus of any of EEEs 11-13, wherein the control system is further configured to implement a user zone classifier configured to receive one or more acoustic scene metrics and estimate a zone of an audio environment in which the person is currently located based at least in part on the one or more received acoustic scene metrics.

The apparatus of any of EEEs 11-14, wherein the control system is further configured to implement a noise estimator configured to receive one or more acoustic scene metrics and estimate noise in the audio environment based at least in part on the one or more received acoustic scene metrics.

The apparatus of any of EEEs 11-15, wherein the control system is further configured to implement an acoustic proximity estimator configured to receive one or more acoustic scene metrics and estimate an acoustic proximity of one or more sound sources in the audio environment based at least in part on the one or more received acoustic scene metrics.

The apparatus of any of EEEs 11-16, wherein the control system is further configured to implement a geometric proximity estimator configured to receive one or more acoustic scene metrics and estimate a geometric proximity of one or more sound sources in the audio environment based at least in part on the one or more received acoustic scene metrics.

The apparatus of EEE18, wherein the control system is further configured to implement a rendering configuration module configured to determine a rendering configuration of the orchestrated audio device based at least in part on an estimated geometric proximity or an estimated acoustic proximity of one or more sound sources in the audio environment.

EEE19. the apparatus of any one of EEEs 1-18, wherein the first content stream component of the first programmed audio device playback sound results in a perceptual masking of the first calibration signal component of the first programmed audio device playback sound, and wherein the second content stream component of the second programmed audio device playback sound results in a perceptual masking of the second calibration signal component of the second programmed audio device playback sound.

The apparatus of any one of EEEs 1-19, wherein the orchestration module is further configured to:

Causing third through nth programmed audio devices of the audio environment to generate third through nth calibration signals;

causing the third through nth orchestrated audio devices to insert third through nth calibration signals into the third through nth content streams to generate third through nth modified audio playback signals for the third through nth orchestrated audio devices; and

causing the third through nth programmed audio devices to play back corresponding instances of the third through nth modified audio playback signals to generate third through nth instances of audio device playback sound.

The apparatus of EEE21, wherein the orchestration module is further configured to:

causing at least one microphone of each of the first through nth programmed audio devices to detect first through nth instances of audio device playback sound and generate microphone signals corresponding to the first through nth instances of audio device playback sound, the first through nth instances of audio device playback sound including the first programmed audio device playback sound, the second programmed audio device playback sound, and third through nth instances of audio device playback sound; and

causing first through nth calibration signals to be extracted from the microphone signal, wherein at least one acoustic scene metric is estimated based at least in part on the first through nth calibration signals.

The apparatus of any one of EEEs 1-21, wherein the orchestration module is further configured to:

determining one or more calibration signal parameters for a plurality of orchestrated audio devices in an audio environment, the one or more calibration signal parameters being usable to generate a calibration signal; and

providing the one or more calibration signal parameters to each of the plurality of orchestrated audio devices.

EEE23. The apparatus of EEE22 wherein determining the one or more calibration signal parameters involves scheduling a time slot for playback of the modified audio playback signal for each of the plurality of programmed audio devices, wherein a first time slot of a first programmed audio device is different from a second time slot of a second programmed audio device.

EEE24. The apparatus of EEE22 or EEE23 wherein determining the one or more calibration signal parameters comprises determining a frequency band for each of the plurality of programmed audio devices to play back a modified audio playback signal.

EEE25. The apparatus of EEE24 wherein the first frequency band of the first programmed audio device is different from the second frequency band of the second programmed audio device.

The apparatus of any of EEEs 22-25, wherein determining the one or more calibration signal parameters involves determining a DSSS spreading code for each of the plurality of programmed audio devices.

EEE27. The apparatus of EEE26 wherein the first spreading code of the first programmed audio device is different from the second spreading code of the second programmed audio device.

The apparatus of EEE28, EEE26 or EEE27, further comprising determining at least one spreading code length based at least in part on audibility of the corresponding programmed audio device.

The apparatus of any of EEEs 22-28, wherein determining the one or more calibration signal parameters involves applying an acoustic model based at least in part on a mutual audibility of each of a plurality of programmed audio devices in an audio environment.

The apparatus of any of EEEs 22-29, wherein the orchestration module is further configured to: determining that the calibrated signal parameters of the programmed audio device are at a maximum level of robustness; determining that the calibration signal from the programmed audio device cannot be successfully extracted from the microphone signal; and muting all other programmed audio devices of at least a portion of their corresponding programmed audio device playback sounds.

EEE31. The apparatus of EEE30 wherein the portion comprises a calibration signal component.

The apparatus of any of EEEs 1-31, wherein the orchestration module is further configured to cause each of a plurality of orchestrated audio devices in the audio environment to play back the modified audio playback signal simultaneously.

EEE33 the apparatus of any one of EEEs 1-32, wherein at least a portion of the first audio playback signal, at least a portion of the second audio playback signal, or at least a portion of each of the first audio playback signal and the second audio playback signal corresponds to silence.

Eee34. An apparatus comprising:

a loudspeaker system comprising at least one loudspeaker;

a microphone system comprising at least one microphone; and

a control system configured to:

receiving a first content stream, the first content stream comprising a first audio signal;

rendering the first audio signal to generate a first audio playback signal;

generating a first calibration signal;

inserting the first calibration signal into the first audio playback signal to generate a first modified audio playback signal; and

the loudspeaker system is caused to play back the first modified audio playback signal to generate a first audio device playback sound.

EEE35. The apparatus of EEE34 wherein the control system comprises:

a calibration signal generator configured to generate a calibration signal;

a calibration signal modulator configured to modulate the calibration signal generated by the calibration signal generator to generate a first calibration signal; and

a calibration signal injector configured to insert a first calibration signal into the first audio playback signal to generate a first modified audio playback signal.

EEE36. The apparatus of EEE34 or EEE35 wherein the control system is further configured to:

receiving from the microphone system at least microphone signals corresponding to first audio device playback sounds and second audio device playback sounds, the second audio device playback sounds corresponding to second modified audio playback signals played back by the second audio device, the second modified audio playback signals including a second calibration signal; and

at least a second calibration signal is extracted from the microphone signal.

EEE37. The apparatus of EEE34 or EEE35 wherein the control system is further configured to:

receiving from the microphone system at least microphone signals corresponding to first audio device playback sounds and second to nth audio device playback sounds, the second to nth audio device playback sounds corresponding to second to nth modified audio playback signals played back by the second to nth audio devices, the second to nth modified audio playback signals including second to nth calibration signals; and

At least second through nth calibration signals are extracted from the microphone signal.

The apparatus of EEE38 wherein the control system is further configured to estimate at least one acoustic scene metric based at least in part on the second through nth calibration signals.

EEE39 the apparatus of claim 38, wherein the at least one acoustic scene metric comprises one or more of time of flight, time of arrival, range, audio device audibility, audio device impulse response, angle between audio devices, audio device location, audio environmental noise or signal-to-noise ratio.

The EEE40. The apparatus of claim 38 or EEE39, wherein the control system is further configured to provide at least one acoustic scene metric to the orchestration device and receive instructions from the orchestration device for controlling one or more aspects of playback of the audio device based at least in part on the at least one acoustic scene metric.

EEE41 the apparatus of any one of EEEs 34-40 wherein the first content stream component of the first audio device playback sound results in a perceived masking of the first calibration signal component of the first audio device playback sound.

EEE42 the apparatus of any one of EEEs 34-41 wherein the control system is configured to receive one or more calibration signal parameters from the orchestration device, the one or more calibration signal parameters being usable to generate the calibration signal.

EEE43 the apparatus of EEE42 wherein the one or more calibration signal parameters comprise a parameter for scheduling a time slot for playback of the modified audio playback signal, wherein a first time slot of the first audio device is different from a second time slot of the second audio device.

EEE44. The apparatus of EEE42 wherein the one or more calibration signal parameters comprise parameters for determining a frequency band of the calibration signal.

EEE45. The apparatus of EEE44 wherein the first frequency band of the first audio device is different from the second frequency band of the second audio device.

EEE46 the apparatus of any one of EEEs 42-45 wherein the one or more calibration signal parameters comprise a spreading code for generating a calibration signal.

EEE47. The apparatus of EEE46 wherein the first spreading code of the first audio device is different from the second spreading code of the second audio device.

EEE48 the apparatus of any one of EEEs 35-47 wherein the control system is further configured to process the received microphone signal to produce a pre-processed microphone signal, wherein the control system is configured to extract the calibration signal from the pre-processed microphone signal.

EEE49 the apparatus of EEE48 wherein processing the received microphone signals involves one or more of beamforming, applying a bandpass filter, or echo cancellation.

EEE50 the apparatus of any one of EEEs 37-49 wherein extracting at least second through nth calibration signals from the microphone signals involves applying a matched filter to the microphone signals or a pre-processed version of the microphone signals to produce second through nth delay waveforms, the second through nth delay waveforms corresponding to each of the second through nth calibration signals.

EEE51. The apparatus of EEE50 wherein the control system is further configured to apply a low pass filter to each of the second through nth delay waveforms.

EEE52. the device of EEE50 or EEE51, wherein:

the control system is configured to implement a demodulator;

applying a matched filter is part of the demodulation process performed by the demodulator; and

the output of the demodulation process is a demodulated coherent baseband signal.

EEE53. The apparatus of EEE52 wherein the control system is further configured to estimate the bulk delay and provide the bulk delay estimate to the demodulator.

EEE54. The apparatus of EEE52 or EEE53 wherein the control system is further configured to implement a baseband processor configured to baseband process the demodulated coherent baseband signal, and wherein the baseband processor is configured to output at least one estimated acoustic scene metric.

EEE55. The apparatus of EEE54 wherein the baseband processing involves generating a non-coherent integration delay waveform based on a demodulated coherent baseband signal received during a non-coherent integration period.

EEE56. The apparatus of EEE55 wherein generating the incoherent integrated delay waveform involves squaring a demodulated coherent baseband signal received during incoherent integration to generate a squared demodulated baseband signal and integrating the squared demodulated baseband signal.

EEE57. apparatus as in EEE55 or EEE56, wherein the baseband processing involves applying one or more of a front-edge estimation process, a controlled response power estimation process, or a signal-to-noise ratio estimation process to the incoherent integrated delay waveform.

EEE58 the apparatus of any one of EEEs 54-57 wherein the control system is further configured to estimate a bulk delay and provide the bulk delay estimate to the baseband processor.

The apparatus of any of EEEs 50-58, wherein the control system is further configured to estimate second through nth noise power levels at second through nth audio device sites based on the second through nth delay waveforms.

The apparatus of EEE60 wherein the control system is further configured to generate a distributed noise estimate for the audio environment based at least in part on the second through nth noise power levels.

The apparatus of any of EEEs 34-60, wherein the control system is further configured to receive a gap instruction from the orchestration device and insert a first gap into a first frequency range of the first audio playback signal or the first modified audio playback signal during a first time interval of the first content stream according to the first gap instruction, the first gap comprising attenuation of the first audio playback signal within the first frequency range, the first modified audio playback signal and the first audio device playback sound comprising the first gap.

EEE62. The apparatus of EEE61 wherein the gap command comprises a command for controlling gap insertion and calibration signal generation such that the calibration signal corresponds to neither the gap time interval nor the gap frequency range.

EEE63.EEE61 or EEE62 wherein the gap instructions comprise instructions for extracting at least one of target device audio data or audio ambient noise data from the received microphone data.

The apparatus of any of EEEs 61-63, wherein the control system is further configured to estimate at least one acoustic scene metric based at least in part on data extracted from the received microphone data, while playback sounds produced by one or more audio devices of the audio environment include one or more gaps.

EEE65. The apparatus of claim 64, wherein the at least one acoustic scene metric comprises one or more of time of flight, time of arrival, range, audio device audibility, audio device impulse response, angle between audio devices, audio device location, audio environmental noise, or signal-to-noise ratio.

The EEE66. The apparatus of claim 64 or 65, wherein the control system is further configured to provide at least one acoustic scene metric to the orchestration device and receive instructions from the orchestration device for controlling one or more aspects of playback of the audio device based at least in part on the at least one acoustic scene metric.

EEE67 the apparatus of any one of EEEs 34-66, wherein the control system is further configured to implement a wake-up word detector configured to detect wake-up words in the received microphone signal.

The apparatus of any of EEEs 34-67, wherein the control system is further configured to determine one or more acoustic scene metrics based on wake word detection data received from the wake word detector.

EEE69. the apparatus of any one of EEEs 34-68 wherein the control system is further configured to implement a noise compensation function.

The apparatus of any of EEEs 34-69, wherein rendering is performed by a control system implemented rendering module, and wherein the rendering module is further configured to perform rendering based at least in part on rendering instructions received from the orchestration device.

EEE71 the apparatus of EEE70 wherein the rendering instructions comprise instructions from at least one of a rendering configuration generator, a user zone classifier, or an orchestration module.

Some aspects of the disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, as well as a tangible computer-readable medium (e.g., a disk) storing code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems may be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware, and/or otherwise configured to perform any of a variety of operations on data, including embodiments of the disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, memory, and a processing subsystem programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.

Some embodiments may be implemented as a configurable (e.g., programmable) Digital Signal Processor (DSP) that is configured (e.g., programmed or otherwise configured) to perform the required processing on the audio signal(s), including the execution of one or more examples of the disclosed methods. In the alternative, embodiments of the disclosed system (or elements thereof) may be implemented as a general-purpose processor (e.g., a Personal Computer (PC) or other computer system or microprocessor, which may include an input device and memory) programmed and/or otherwise configured with software or firmware to perform any of a variety of operations, including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system may be implemented as a general-purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system may also include other elements (e.g., one or more speakers and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or keyboard), memory, and a display device.

Another aspect of the disclosure is a computer-readable medium (e.g., a disk or other tangible storage medium) storing code for performing (e.g., an encoder executable to perform) one or more examples of the disclosed method or steps thereof.

While specific embodiments of the present disclosure and applications of the present disclosure have been described herein, it will be apparent to those skilled in the art that many changes can be made to the embodiments and applications described herein and claimed herein without departing from the scope of the described disclosure. It is to be understood that while certain forms of the disclosure have been illustrated and described, the disclosure is not to be limited to the specific embodiments described and illustrated or the specific methods described.

Claims

1. An audio processing method, comprising:

causing, by the control system, a first audio device of the audio environment to generate a first calibration signal;

causing, by the control system, the first calibration signal to be inserted into a first audio playback signal corresponding to the first content stream to generate a first modified audio playback signal for the first audio device;

causing, by the control system, the first audio device to play back the first modified audio playback signal to generate a first audio device playback sound;

Causing, by the control system, a second audio device of the audio environment to generate a second calibration signal;

causing, by the control system, a second calibration signal to be inserted into the second content stream to generate a second modified audio playback signal for the second audio device;

causing, by the control system, the second audio device to play back the second modified audio playback signal to generate a second audio device playback sound;

causing, by the control system, at least one microphone of the audio environment to detect at least a first audio device playback sound and a second audio device playback sound, and generating microphone signals corresponding to the at least first audio device playback sound and the second audio device playback sound;

causing, by the control system, extraction of a first calibration signal and a second calibration signal from the microphone signal; and

at least one acoustic scene metric is caused to be estimated by the control system based at least in part on the first calibration signal and the second calibration signal.

2. The audio processing method of claim 1, wherein the first calibration signal corresponds to a first sub-audio component of the first audio device playback sound, and wherein the second calibration signal corresponds to a second sub-audio component of the second audio device playback sound.

3. The audio processing method of claim 1 or claim 2, wherein the first calibration signal comprises a first DSSS signal, and wherein the second calibration signal comprises a second DSSS signal.

4. The audio processing method of any one of claims 1 to 3, further comprising:

causing, by the control system, a first gap to be inserted into a first frequency range of the first audio playback signal or the first modified audio playback signal during a first time interval of the first content stream, the first gap comprising attenuation of the first audio playback signal within the first frequency range, the first modified audio playback signal and the first audio device playback sound comprising the first gap;

causing, by the control system, the first gap to be inserted into a first frequency range of the second audio playback signal or the second modified audio playback signal during a first time interval, the second modified audio playback signal and the second audio device playback sound comprising the first gap;

extracting, by the control system, audio data from the microphone signal in at least a first frequency range to produce extracted audio data; and

at least one acoustic scene metric is estimated by the control system based at least in part on the extracted audio data.

5. The audio processing method of claim 4, further comprising controlling the gap insertion and the calibration signal generation such that the calibration signal corresponds to neither the gap time interval nor the gap frequency range.

6. The audio processing method of claim 4 or claim 5, further comprising controlling gap insertion and calibration signal generation based at least in part on time since noise was estimated in at least one frequency band.

7. The audio processing method of any of claims 4-6, further comprising controlling gap insertion and calibration signal generation based at least in part on a signal-to-noise ratio of a calibration signal of at least one audio device in at least one frequency band.

8. The audio processing method according to any one of claims 4 to 7, further comprising:

causing the target audio device to play back an unmodified audio playback signal of the target device content stream to generate target audio device playback sound; and

estimating, by the control system, at least one of target audio device audibility or target audio device location based at least in part on the extracted audio data, wherein:

the unmodified audio playback signal does not include the first gap; and is also provided with

The microphone signal also corresponds to the target audio device playback sound.

9. The audio processing method of claim 8, wherein the unmodified audio playback signal does not include gaps inserted into any frequency range.

10. The audio processing method of any of claims 1-9, wherein the at least one acoustic scene metric comprises one or more of time of flight, time of arrival, direction of arrival, range, audio device audibility, audio device impulse response, angle between audio devices, audio device location, audio environmental noise, or signal to noise ratio.

11. The audio processing method of any of claims 1-10, wherein causing the at least one acoustic scene metric to be estimated involves either estimating the at least one acoustic scene metric or causing another device to estimate at least one acoustic scene metric.

12. The audio processing method of any of claims 1-11, further comprising controlling one or more aspects of audio device playback based at least in part on the at least one acoustic scene metric.

13. The audio processing method of any of claims 1-12, wherein the first content stream component of the first audio device playback sound results in a perceptual masking of the first calibration signal component of the first audio device playback sound, and wherein the second content stream component of the second audio device playback sound results in a perceptual masking of the second calibration signal component of the second audio device playback sound.

14. The audio processing method according to any one of claims 1 to 13, wherein the control system is an orchestration device control system.

15. The audio processing method of any one of claims 1 to 14, further comprising:

causing, by the control system, third through nth calibration signals to be generated by third through nth audio devices of the audio environment;

causing, by the control system, third through nth calibration signals to be inserted into the third through nth content streams to generate third through nth modified audio playback signals for the third through nth audio devices; and

the third through nth audio devices are caused by the control system to play back corresponding instances of the third through nth modified audio playback signals to generate third through nth instances of audio device playback sounds.

16. The audio processing method of claim 15, further comprising:

causing, by the control system, at least one microphone of each of the first through nth audio devices to detect first through nth instances of audio device playback sound and to generate microphone signals corresponding to the first through nth instances of audio device playback sound, the first through nth instances of audio device playback sound including the first audio device playback sound, the second audio device playback sound, and third through nth instances of audio device playback sound; and

The method further includes causing, by the control system, extraction of first through nth calibration signals from the microphone signal, wherein the at least one acoustic scene metric is estimated based at least in part on the first through nth calibration signals.

17. The audio processing method of any one of claims 1 to 16, further comprising:

determining one or more calibration signal parameters for a plurality of audio devices in an audio environment, the one or more calibration signal parameters being usable to generate a calibration signal; and

providing the one or more calibration signal parameters to each of the plurality of audio devices.

18. The audio processing method of claim 17, wherein determining the one or more calibration signal parameters involves scheduling a time slot for playback of a modified audio playback signal for each of the plurality of audio devices, wherein a first time slot of a first audio device is different from a second time slot of a second audio device.

19. The audio processing method of claim 17, wherein determining the one or more calibration signal parameters involves determining a frequency band for playback of a modified audio playback signal for each of the plurality of audio devices.

20. The audio processing method of claim 19, wherein the first frequency band of the first audio device is different from the second frequency band of the second audio device.

21. The audio processing method of any of claims 17-20, wherein determining the one or more calibration signal parameters involves determining a DSSS spreading code for each of the plurality of audio devices.

22. The audio processing method of claim 21, wherein the first spreading code of the first audio device is different from the second spreading code of the second audio device.

23. The apparatus of claim 21 or claim 22, further comprising determining at least one spreading code length based at least in part on audibility of the corresponding audio device.

24. The audio processing method of any of claims 17-23, wherein determining the one or more calibration signal parameters involves applying an acoustic model based at least in part on mutual audibility of each of a plurality of audio devices in an audio environment.

25. The audio processing method of any of claims 17-24, further comprising:

determining that a calibration signal parameter of the audio device is at a maximum robustness level;

determining that a calibration signal from the audio device cannot be successfully extracted from the microphone signal; and

all other audio devices are muted with at least a portion of their respective audio device playback sound.

26. The audio processing method of claim 25, wherein the portion comprises a calibration signal component.

27. The audio processing method of any of claims 1-26, further comprising causing each of a plurality of audio devices in an audio environment to play back the modified audio playback signal simultaneously.

28. The audio processing method of any of claims 1-27, wherein at least a portion of the first audio playback signal, at least a portion of the second audio playback signal, or at least a portion of each of the first audio playback signal and the second audio playback signal corresponds to silence.

29. An apparatus configured to perform the method of any of claims 1-28.

30. A system configured to perform the method of any of claims 1-28.

31. One or more non-transitory media having software stored thereon, the software comprising instructions for controlling one or more devices to perform the method of any of claims 1-28.