CN112640475B

CN112640475B - System and method for associating playback devices with voice assistant services

Info

Publication number: CN112640475B
Application number: CN201980056604.9A
Authority: CN
Inventors: 盛·吴; 约翰·G·托洛美
Original assignee: Sonos Inc
Current assignee: Sonos Inc
Priority date: 2018-06-28
Filing date: 2019-06-28
Publication date: 2023-10-13
Anticipated expiration: 2039-06-28
Also published as: WO2020006410A1; CN112640475A; EP3815384A1; CN117316150A

Abstract

A system and method for media playback via a media playback system, comprising: detecting a first wake word via a first network microphone device of a first playback device; detecting a second wake word via a second network microphone device of a second playback device; and forming a bonded region including the first playback device and the second playback device. In response to detecting the first wake word, a first speech utterance following the first wake word is sent to the first speech assistant service. In response to detecting the second wake word, a second speech utterance following the second wake word is sent to a second speech assistant service. The requested media content received from the first voice assistant service and/or the second voice assistant service is played back via the first playback device and the second playback device in synchronization with each other.

Description

System and method for associating playback devices with voice assistant services

Cross Reference to Related Applications

The present application claims the benefit of priority from U.S. provisional patent application Ser. No.62/691,587 filed on day 28 of 6 in 2018 and U.S. patent application Ser. No.16/022,662 filed on day 28 of 6 in 2018, the contents of which are incorporated herein by reference in their entirety.

Technical Field

The present technology relates to consumer products, and more particularly, to methods, systems, products, features, services, and other elements directed to associating playback devices with voice assistant services or some aspect thereof.

Background

The option of accessing and listening to digital audio under loud settings has been limited until SONOS corporation in 2003 filed one of its first patent applications, entitled "Method for Synchronizing Audio Playback between Multiple Networked Devices", and began offering a media playback system for sale in 2005. The Sonos wireless HiFi system enables people to experience music from multiple sources via one or more networked playback devices. Through a software control application installed on a smart phone, tablet or computer, a person can play his or her desired content in any room with a networked playback device. In addition, for example, using the controller, different songs may be streamed to each room by the playback device, the rooms may be combined together for synchronized playback, or the same song may be listened to in all rooms simultaneously.

In view of the increasing interest in digital media, the need to develop consumer accessible technology is still continuing to further enhance the listening experience.

Drawings

The features, aspects, and advantages of the presently disclosed technology may be better understood with reference to the following description, appended claims, and accompanying drawings in which:

FIG. 1A is a partial cross-sectional view of an environment having a media playback system configured in accordance with aspects of the disclosed technology;

FIG. 1B is a schematic diagram of the media playback system and one or more networks of FIG. 1A;

FIG. 2A is a functional block diagram of an example playback device;

fig. 2B is an isometric view of an example playback device including a network microphone device;

3A-3D are diagrams illustrating example zones (zones) and zone groups according to aspects of the present disclosure;

fig. 3E and 3F are diagrams illustrating example speech inputs for calibrating a bonded stereo pair of a playback device according to aspects of the present disclosure;

FIG. 4A is a functional block diagram of an example controller device according to aspects of the present disclosure;

FIGS. 4B and 4C are controller interfaces according to aspects of the present disclosure;

fig. 5A is a functional block diagram of an example network microphone device in accordance with aspects of the present disclosure;

FIG. 5B is a diagram of an example speech input according to aspects of the present disclosure;

FIG. 6 is a functional block diagram of an example remote computing device according to aspects of the present disclosure;

FIG. 7 is a schematic diagram of an example network system in accordance with aspects of the present disclosure;

Fig. 8 (including fig. 8A-8H) is an example process flow for associating a voice assistant service with one or more playback devices of a media playback system in accordance with aspects of the present disclosure;

FIG. 9A illustrates a bundled pair of playback devices, each playback device associated with a different voice assistant service, in accordance with aspects of the disclosure;

FIG. 9B illustrates a binding locale of four playback devices, each playback device associated with a different voice assistant service, in accordance with aspects of the present disclosure;

9C-9F illustrate example user interfaces for managing VASs associated with particular playback devices of a bonded locale in accordance with aspects of the present disclosure;

FIG. 10 illustrates an example method of interacting with two different voice assistant services via a bundled playback device in accordance with aspects of the disclosure;

FIG. 11 is a flowchart of an example process flow for associating a stereo pair of a playback device with a voice assistant service, in accordance with aspects of the present disclosure; and

fig. 12A-12C are tables with example voice input commands and related information in accordance with aspects of the present disclosure.

The drawings are for purposes of illustrating example embodiments, it being understood, however, that the invention is not limited to the arrangements and instrumentality shown in the drawings. In the drawings, like reference numerals identify at least substantially similar elements. To facilitate discussion of any particular element, one or more most significant digits of any reference number refer to the figure in which that element is first introduced. For example, element 103a is first introduced and discussed with reference to FIG. 1A.

Detailed Description

Overview II

Voice control may be beneficial for "smart" homes having smart appliances and related devices such as wireless lighting devices, home automation devices (e.g., thermostats, door locks, etc.), and audio playback devices. In some implementations, a networked microphone device may be used to control smart home devices. A network microphone device will typically include a microphone for receiving voice input. The network microphone device may forward voice input to a Voice Assistant Service (VAS), such as AMAZONApplication->MICROSOFT>Gostile Assistant, etc. The VAS may be a remote service implemented by a cloud server to process voice inputs. The VAS may process the voice input to determine the intent of the voice input. Based on the response, the network microphone device may cause one or more smart devices to perform an action. For example, the network microphone device may instruct the lighting device to turn on/off based on a response to an instruction from the VAS.

The voice input detected by the network microphone device will typically include a wake-up word followed by an utterance containing the user's request. The wake word is typically a predetermined word or phrase that is used to "wake up" and invoke the VAS to interpret the intent of the voice input. For example, in querying AMAZON At this time, the user may speak the wake-up word "Alexa". Other examples include "Ok, GOOGLE" for invoking Assistant of GOOGLE, and +.>"Hey, siri" or for use by"Hey, sonos" of the provided VAS. In various embodiments, wake words may also be referred to, for example, as activation words, trigger words, wake words, or phrases, and may take the form of: any suitable word; word combinations, such as phrases; and/or an audio prompt indicating that the network microphone device and/or associated VAS is about to invoke an action.

The network microphone device listens for user requests or commands in the voice input that accompany the wake-up word. In some cases, the user request may include a command to control a third party device, such as a thermostat (e.g.,thermostat), lighting devices (e.g., PHILIP->Lighting device) or media playback device (e.g., -a ∈k->Playback device). For example, the user may speak the wake-up word "Alexa" and then speak the word "set the thermostat to 68 degrees" to use AMAZONVAS to set the temperature in the home. User' sThe same wake-up word can be spoken and then "turn on living room lighting" is spoken to turn on the lighting devices in the living room area of the home. The user may similarly speak a wake word and then request that a particular song, album, or music playlist be played on a playback device in the home.

The VAS can employ Natural Language Understanding (NLU) systems to process speech input. NLU systems typically require multiple remote servers programmed to detect the potential intent of a given voice input. For example, the server may maintain language dictionaries, parsers, grammatical and semantic rules, and related processing algorithms to determine the intent of the user.

Managing associations between various playback devices using one or more corresponding VASs can be difficult. For example, while a user may wish to use multiple VASs in their home, it may not be possible or preferable to associate a single playback device with multiple VASs. This may be due to limitations in processing power and memory required to execute multiple wake-up word detection algorithms on a single device, as well as limitations imposed by one or more VAS. As a result, for any particular playback device, the user may be required to select only a single VAS to exclude any other VAS.

In some embodiments, playback devices may be purchased with pre-associated VASs. In this case, the user may wish to replace the pre-associated VAS with the other VAS selected by the user. For example, if the user purchases a purchase with AMAZON A VAS associated playback device, the user may wish to alternatively associate the playback device with gosstate VAS of GOOGLE and disable AMAZON on the playback device>In addition, certain voice-enabled playback devices may be sold without any pre-associated VAS, in which case the user may wish to manage the selection of a particular VAS and association with the playback device.

The systems and methods detailed herein address the above-described challenges of managing associations between one or more playback devices and one or more VASs. In particular, systems and methods are provided for allowing a user to select a VAS from a plurality of VASs to associate with one or more playback devices of a media playback system.

As described in more detail below, in some cases, two or more playback devices, each associated with a different VAS, may be bundled together to form a bundled zone. For example, the first playback device and the second playback device may be bundled to form a stereo pair. In this case, the bound device pair may be presented to the media playback system as a single User Interface (UI) entity. When displayed to a user via a user interface (e.g., UT displayed on a screen of a controller device), the binding pair may be displayed as a single "device" to be controlled. Individual playback devices of the binding locale may be associated with different VASs. For example, the first playback device may be in communication with AMAZON Associated with a second playback device of the bound locale and associated with an Assistant of GOOGLE. As a result, a single "device" or UI entity presented to the media playback system may be effectively associated with two different VASs. This allows a user to interact with a single UI entity (i.e., a binding locale, which appears as a single device via a media playback system), which in turn can interact with two different VASs. For example, the user may use a first wake-up word (e.g., "Alexa") to enter +.>Interact and alternately interact with gosstate of goose via voice input using a second wake-up word (e.g., "OK"). Thus, even if individual playback devices cannot be associated with multiple VASs, a user can access the multiple VASs through a single UI entity via a binding zone. This advantageously allows a user to realize the benefits of multiple VASs (where each VAS may be advantageous in different aspects), rather than requiring the user to limit their interaction to a single VAS to the exclusion ofAny other VAS.

In some embodiments, the binding region may include three or more voice assistants. For example, in the context of a home theater binding five devices into a single region, a left channel playback device may be associated with ALEXA of AMAZON, a right channel device may be associated with corana of MICROSOFT, and an intermediate channel playback device may be associated with an Assistant of GOOGLE. In another example, the left and right channel devices may be associated with a first VAS (e.g., ALEXA of AMAZON) and the center channel may be associated with a second VAS (e.g., assistant of GOOGLE).

Although some embodiments described herein may refer to functionality performed by a given participant, such as a "user" and/or other entity, it should be understood that this description is for illustrative purposes only. The claims should not be interpreted as requiring any such example participant-implemented actions unless expressly required by the claim's own language.

Example operating Environment

Fig. 1A and 1B illustrate example configurations of a media playback system 100 (or "MPS 100") in which one or more embodiments disclosed herein may be implemented. Referring first to fig. 1A, MPS 100 is shown associated with an example home environment having multiple rooms and spaces, which may be collectively referred to as a "home environment" or "environment 101". The environment 101 includes a home environment having several rooms, spaces, and/or playback areas, including a primary bathroom 101a, a primary bedroom 101b (referred to herein as a "nike room"), a secondary bedroom 101c, a home activity or study 101d, an office 101e, a living room 101f, a dining room 101g, a kitchen 101h, and an outdoor counter 101i. Although certain embodiments and examples are described below in the context of a home environment, the techniques described herein may be implemented in other types of environments. For example, in some embodiments MPS 100 may be implemented in one or more commercial settings (e.g., restaurants, shopping centers, airports, hotels, retail stores, or other stores), one or more vehicles (e.g., sport utility vehicles, buses, cars, boats, yachts, aircraft), multiple environments (e.g., a combination of home and vehicular environments), and/or other suitable environments that may require multi-zone audio.

Within these rooms and spaces, MPS 100 includes one or more computing devices. Referring collectively to fig. 1A and 1B, such computing devices may include playback devices 102 (each identified as playback devices 102a-102 n), network microphone devices 103 (each identified as "NMD"103a-103 i), and controller devices 104a and 104B (collectively referred to as "controller devices 104"). The home environment may include additional and/or other computing devices, including local network devices, such as one or more intelligent lighting devices 108 (fig. 1B), intelligent thermostats 110, and local computing devices 105 (fig. 1A).

Referring to fig. 1b, the various playback devices, network microphone devices, and controller devices 102-104 and/or other network devices of mps 100 may be coupled to each other via point-to-point connections and/or by other connections, which may be wired and/or wireless, via LAN111, including network router 109. For example, playback device 102j (which may be designated as "left") in study 101d (fig. 1A) may have a point-to-point connection with playback device 102a (which may be designated as "right") in study 101 d. In one embodiment, the left playback device 102j may communicate with the right playback device 102a via a point-to-point connection. In a related embodiment, the left playback device 102j may communicate with other network devices via a point-to-point connection and/or via other connections of the LAN 111.

As further shown in fig. 1B, in some embodiments MPS 100 is coupled to one or more remote computing devices 106, which remote computing devices 106 may include different groups of remote computing devices 106a-106c associated with various services, including a voice assistant service ("VAS"), a media content service ("MCS"), and/or a service for supporting the operation of MPS 100 via Wide Area Network (WAN) 107. In some embodiments, the remote computing device may be a cloud server. The remote computing device 106 may be configured to interact with the computing devices in the environment 101 in a variety of ways. For example, the remote computing device 106 may be configured to facilitate streaming and controlling playback of media content, such as audio, in a home environment. In one aspect of the technology described in greater detail below, the various playback devices, network microphone devices, and/or controller devices 102-104 are coupled to at least one remote computing device associated with a VAS and at least one remote computing device associated with a media content service. Moreover, as described in more detail below, in some embodiments, various playback devices, network microphone devices, and/or controller devices 102-104 may be coupled to several remote computing devices each associated with a different VAS and/or to multiple remote computing devices associated with multiple different media content services.

In some embodiments, the one or more playback devices 102 may include an on-board (e.g., integrated) network microphone device. For example, the playback devices 102a-e include corresponding NMDs 103a-e, respectively. Playback devices that include a network microphone device may be interchangeably referred to herein as playback devices or network microphone devices unless otherwise indicated in the specification.

In some embodiments, one or more NMDs 103 may be stand-alone devices. For example, NMDs 103f and 103g may be stand-alone network microphone devices. The separate network microphone device may omit components typically included in playback devices, such as speakers or related electronic devices. In this case, the stand-alone network microphone device may not produce audio output or may produce limited audio output (e.g., relatively low quality audio output).

In use, the network microphone device may receive and process voice input from a user in its vicinity. For example, the network microphone device may capture voice input when detecting user spoken input. In the example shown, the NMD 103d of the playback device 102d in the living room can capture voice input of users in its vicinity. In some cases, other network microphone devices (e.g., NMDs 103f and 103 i) in the vicinity of the voice input source (e.g., user) may also detect voice input. In this case, the network microphone devices may arbitrate among each other to determine the devices that should capture and/or process the detected voice input. Examples of methods for selecting and arbitrating between network microphone devices may be found in U.S. application Ser. No.15/438,749, entitled "Voice Control of a Media Playback System," filed on, for example, at 2/21, 2017, the contents of which are incorporated herein by reference in their entirety.

In some embodiments, the network microphone device may be assigned to a playback device that may not include the network microphone device. For example, the NMD 103f may be assigned to playback devices 102i and/or 1021 in its vicinity. In a related example, the network microphone device may output audio through a playback device assigned thereto. Additional details regarding the association of a network microphone device and a playback device as designated devices or default devices can be found, for example, in previously referenced U.S. patent application No.15/438,749.

In use, the network microphone device 103 is configured to interact with a voice assistant service VAS, such as a first VAS 160 hosted by one or more remote computing devices 106 a. For example, as shown in fig. 1B, the NMD 103f is configured to receive voice input 121 from a user 123. The NMD 103f sends data associated with the received voice input 121 to the remote computing device 106a of the first VAS 160, which is configured to: (i) Processes the received voice input data and (ii) sends a corresponding command to MPS 100. For example, in some aspects, the remote computing device 106a includes a VAS (e.g., by Or->VAS operated by one or more of the above) and/or a server. The remote computing device 106a can receive voice input data from the NMD 103f, for example, via the LAN 111 and the router 109. In response to receiving the voice input data, the remote computing device 106a processes the voice input data (i.e., "play Hey jede by the draping band") and may determine that the processed voice input includes a command to play a song (e.g., "Hey jede"). In response, one of the computing devices 106a of the first VAS 160 The command is sent to one or more remote computing devices associated with MPS 100 (e.g., remote computing device 106 d). In this example, the first VAS 160 can send a command to MPS 100 to play "Hey jede" of the shawl band. As described below, MPS 100 may in turn query a plurality of suitable media content services ("MCSs") 167 for media content, for example, by sending a request to a first MCS hosted by one or more first remote computing devices 106b and a second MCS hosted by one or more second remote computing devices 106 c. For example, in some aspects, remote computing devices 106b and 106c include respective MCSs (e.g., by +.>AMAZON/>MCS, etc.), and/or a server.

Other aspects related to the different components of the example MPS 100 and how the different components interact to provide a media experience to a user may be found in the following sections. Although the discussion herein may generally refer to the example MPS 100, the techniques described herein are not limited to application in a home environment as shown in fig. 1A. For example, the techniques described herein may be useful in other home environment configurations including more or fewer devices of any of the playback devices, network microphone devices, and/or controller devices 102-104. For example, the techniques herein may be used in an environment containing a single playback device 102 and/or a single network microphone device 103. In this case, LAN 111 may be eliminated and the single playback device 102 and/or single network microphone device 103 may communicate directly with remote computing devices 106 a-d. In some embodiments, a telecommunications network (e.g., an LTE network, a 5G network) may communicate with various playback devices, network microphone devices, and/or controller devices 102-104 independent of a LAN.

g.Example playback device and network microphone device

Fig. 2A is a functional block diagram illustrating certain aspects of a selected one of the playback devices 102 shown in fig. 1A. As shown, such playback devices may include a processor 212, a software component 214, a memory 216, an audio processing component 218, an audio amplifier 220, a speaker 222, and a network interface 230 including a wireless interface 232 and a wired interface 234. In some embodiments, the playback device may not include a speaker 222, but rather a speaker interface for connecting the playback device to an external speaker. In some embodiments, the playback device may include neither the speaker 222 nor the audio amplifier 220, but rather an audio interface for connecting the playback device to an external audio amplifier or audiovisual receiver.

The playback device may also include a user interface 236. The user interface 236 may facilitate user interaction independent of or in conjunction with one or more controller devices 104. In various embodiments, the user interface 236 includes one or more of physical buttons and/or graphical interfaces disposed on a touch-sensitive screen and/or surface, among other things, for a user to provide input directly. The user interface 236 may also include one or more of lights and speakers to provide visual and/or audio feedback to the user.

In some embodiments, the processor 212 may be a clock-driven computing component configured to process input data according to instructions stored in the memory 216. Memory 216 may be a tangible computer-readable medium configured to store instructions executable by processor 212. For example, the memory 216 may be a data storage device that may be loaded with one or more software components 214 executable by the processor 212 to perform certain functions. In one example, the functionality may involve the playback device retrieving audio data from an audio source or another playback device. In another example, the functionality may involve the playback device sending audio data to another device over a network. In yet another example, the functionality may involve pairing a playback device with one or more other playback devices to create a multi-channel audio environment.

Some functions may involve a playback device synchronizing playback of audio content with one or more other playback devices. During synchronized playback, the listener may not be aware of the delay difference between playback of the audio content by the synchronized playback device. Some examples of audio playback synchronization between playback devices are provided in more detail in U.S. patent No.8,234,395, entitled "System and method for synchronizing operations among a plurality of independently clocked digital data processing devices," filed 4/2004, the contents of which are incorporated herein by reference in their entirety.

The audio processing component 218 can include one or more digital-to-analog converters (DACs), audio preprocessing components, audio enhancement components, digital Signal Processors (DSPs), or the like. In some embodiments, one or more of the audio processing components 218 may be a subcomponent of the processor 212. In one example, the audio content may be processed and/or intentionally altered by the audio processing component 218 to produce an audio signal. The generated audio signal may then be provided to an audio amplifier 220 for amplification and playback through a speaker 222. In particular, the audio amplifier 220 may include a device configured to amplify the audio signal to a level for driving the one or more speakers 222. Speaker 222 may include a separate transducer (e.g., a "driver") or an integral speaker system that incorporates one or more drivers. Specific drivers for speaker 222 may include, for example, a subwoofer (e.g., for low frequencies), a midrange driver (e.g., for medium frequencies), and/or a tweeter (e.g., for high frequencies). In some cases, each transducer of the one or more speakers 222 may be driven by a respective individual audio amplifier of the audio amplifier 220. In addition to generating analog signals for playback, the audio processing component 208 may be configured to process audio content to send it to one or more other playback devices for playback.

Audio content to be processed and/or played back by the playback device may be received from an external source, for example, via an audio line-in input connection (e.g., an auto-detect 3.5mm audio line-in connection) or network interface 230.

The network interface 230 may be configured to facilitate data flow between the playback device and one or more other devices on the data network. As such, the playback device may be configured to receive audio content over the data network from one or more other playback devices in communication with the playback device, a network device within a local area network, or an audio content source over a wide area network such as the internet. In one example, audio content and other signals transmitted and received by a playback device may be transmitted in the form of digital packet data containing an Internet Protocol (IP) based source address and an IP based destination address. In this case, the network interface 230 may be configured to parse the digital packet data so that the data destined for the playback device is properly received and processed by the playback device.

As shown, the network interface 230 may include a wireless interface 232 and a wired interface 234. The wireless interface 232 may provide network interface functionality for the playback device to wirelessly communicate with other devices (e.g., other playback devices, speakers, receivers, network devices, control devices within a data network with which the playback device is associated) according to a communication protocol (e.g., any wireless standard, including IEEE 802.1la, 802.1lb, 802.1lg, 802.1ln, 802.1lac, 802.15, 4G mobile communication standards, etc.). The wired interface 234 may provide network interface functionality for the playback device to communicate with other devices over a wired connection according to a communication protocol (e.g., IEEE 802.3). Although the network interface 230 shown in fig. 2A includes both a wireless interface 232 and a wired interface 234, in some embodiments, the network interface 230 may include only a wireless interface or only a wired interface.

As described above, the playback device may include a network microphone device, such as one of the NMDs 103 shown in fig. 1A. The network microphone devices may share some or all of the components of the playback device, such as the processor 212, memory 216, microphone 224, and the like. In other examples, the network microphone device includes components specific to operational aspects of the network microphone device. For example, the network microphone device may include far-field microphone and/or voice processing components, and in some cases, the playback device may not include these components. In another example, the network microphone device may include a touch sensitive button for enabling/disabling the microphone. In yet another example, as described above, the network microphone device may be a stand-alone device. Fig. 2B is an isometric view illustrating an example playback device 202 incorporating a network microphone device. The playback device 202 has a control area 237 at the top of the device for enabling/disabling the microphone. The control region 237 is adjacent to another region 239 at the top of the device for controlling playback.

For example, SONOS corporation currently offers (or has offered) to sell certain playback devices, including "PLAY:1"," PLAY:3"," PLAY:5"," PLAYBAR "," CONNECT: AMP "," CONNECT "and" SUB ". The playback devices of the example embodiments disclosed herein may additionally or alternatively be implemented using any other past, present, and/or future playback device. In addition, it should be understood that the playback device is not limited to the example shown in fig. 2A or SONOS provision products. For example, the playback device may include wired or wireless headphones. In another example, the playback device may include or interact with a docking station for a personal mobile media playback device. In yet another example, the playback device may be integrated with another device or component (e.g., a television, a lighting fixture, or some other device for indoor or outdoor use).

h.Example playback device configuration

Fig. 3A-3D illustrate example configurations of playback devices in a region and region group. Referring first to fig. 3D, in one example, a single playback device may belong to a region. For example, playback device 102c on a deck may belong to region a. In some implementations described below, multiple playback devices may be "bundled" to form a "bundled pair," which together form a single zone. For example, the playback device 102f (FIG. 1A) named "Bed1" in FIG. 3D may be bound to the playback device 102g (FIG. 1A) named "Bed2" in FIG. 3D to form zone B. The bound playback devices may have different playback responsibilities (e.g., channel responsibilities).

Each zone in MPS 100 may be provided as a single User Interface (UI) entity for control. For example, region a may be provided as a single entity named "terrace". Region C may be provided as a single entity named "living room". Region B may be provided as a single entity named "stereo".

In various embodiments, a zone may employ the name of one of the playback devices belonging to the zone. For example, region C may take the name of living room device 102m (as shown). In another example, region C may take the name of the book case device 102 d. In yet another example, region C may employ a name that is some combination of the bookcase device 102d and the living room device 102 m. The selected name may be selected by the user. In some embodiments, a region may be given a different name than the devices belonging to the region. For example, region B is named "stereo", but all devices in region B do not have this name.

The bound playback devices may have different playback responsibilities, such as responsibilities of a particular audio channel. For example, as shown in fig. 3A, the Bed1 device 102f and the Bed2 device 102g may be bundled to produce or enhance a stereo effect of the audio content. In this example, the Bed1 playback device 102f may be configured to play back the left channel audio component, while the Bed2 playback device 102g may be configured to play back the right channel audio component. In some implementations, such stereo binding may be referred to as "pairing.

In addition, the bundled playback devices may have additional and/or different respective speaker drivers. As shown in fig. 3B, a playback device 102B named "Front" may be bound to a playback device 102k named "SUB". The Front device 102b may present a medium-high frequency range and the SUB device 102k may present a low frequency (e.g., a subwoofer). When unbound, front device 102b may present the entire frequency range. As another example, fig. 3C shows that Front device 102b and SUB device 102k are further bound to right playback device 102a and left playback device 102k, respectively. In some implementations, the right device 102a and the left devices and 102k may form a surround or "satellite" channel of a home theater system. The bound playback devices 102a, 102b, 102j, and 102k may form a single zone D (fig. 3D).

In some embodiments, playback devices in a bonded region may be calibrated together and simultaneously, rather than separately. For example, it is possible to makeBy means such as SONOSAnd the like to calibrate the binding regions together as a single entity. This is in contrast to playback devices that are grouped together only, which may be calibrated before or after group formation. In a related embodiment, as shown in fig. 3E and 3F, the bundled playback device can cause MPS 100 and/or VAS 160 to initiate a multi-pass command or other command to calibrate the playback device. In one example, MPS 100 or VAS 160 may initiate calibration after bonding Bed1 playback device 102f and Bed2 playback device 102g to form a stereo pair. For example, as shown in FIG. 3F, VAS 160 may prepare 3F software, such as +.>Software.

In some embodiments, the standalone network microphone device may itself be in a region. For example, NMD 103h in fig. 1A is named "closet" and forms region E. The network microphone device may also be tied or merged with another device to form a locale. For example, an NMD device 103f named "island" may be bound with a playback device 102i "kitchen" that together form a zone G (also referred to as a "kitchen"). Additional details regarding the association of a network microphone device and a playback device as designated or default devices can be found, for example, in previously referenced U.S. patent application No.15/438,749. In some embodiments, the standalone network microphone device may not be associated with a region.

The locales of individual, bound and/or merged devices may be grouped to form a locality group. For example, referring to fig. 3D, region a may be grouped with region B to form a region group comprising two regions. As another example, region a may be grouped with one or more other regions C-I. The locales a-I may be grouped and ungrouped in a number of ways. For example, three, four, five, or more (e.g., all) of regions A-I may be grouped. As described in the previously cited U.S. patent No.8,234,395, the regions of individual and/or bound playback devices, when grouped, can play back audio in synchronization with each other. Playback devices may be dynamically grouped and ungrouped to form new groups or different groups that synchronously play back audio content.

In various embodiments, the regions in the environment may be default names of regions within a group or a combination of region names within a group of regions, such as "dining room+kitchen," as shown in fig. 3D. In some embodiments, the zone group may be given a unique name selected by the user, such as "Nick's Room," as also shown in FIG. 3D.

Referring again to fig. 2A, certain data may be stored in memory 216 as one or more state variables that are updated periodically and used to describe the state of the playback zone, playback device, and/or group of zones associated therewith. Memory 216 may also include data associated with the status of other devices of the media system and that is shared between the devices from time to time, such that one or more devices have up-to-date data associated with the system.

In some embodiments, the memory may store instances of various variable types associated with the state. The variable instance may be stored with an identifier (e.g., tag) corresponding to the type. For example, the specific identifier may be a first type "a1" for identifying playback devices of a zone, a second type "b1" for identifying playback devices that may be bound in the zone, and a third type "c1" for identifying a zone group to which the zone may belong. As a related example, in fig. 1A and 1B, an identifier associated with a deck may indicate that the deck is the only playback device for a particular zone and is not in a zone group. The identifier associated with the living room may indicate that the living room is not grouped with other locales, but includes the bound playback devices 102a, 102b, 102j, and 102k. The identifier associated with the dining room may indicate that the dining room is part of a dining room + kitchen group and that devices 103f and 102i are bound. Since the kitchen is part of a dining room + kitchen land group, the identifier associated with the kitchen may indicate the same or similar information. Other example locale variables and identifiers are described below.

In yet another example, MPS 100 may include other associated variables or identifiers that represent regions and groups of regions, such as identifiers associated with regions, as shown in fig. 3D. A region may relate to a group of regions and/or a cluster of regions that are not in a group of regions. For example, fig. 3D shows a first region named "first region" and a second region named "second region". The first area includes areas and groups of areas of terraces, study rooms, dining halls, kitchens and bathrooms. The second area includes areas and regional groups of bathrooms, nike rooms, bedrooms and living rooms. In an aspect, a region may be used to invoke a region group and/or cluster of regions that share one or more regions and/or region groups with another cluster. In another aspect, this is different from a regional group, which does not share a region with another regional group. Other examples of techniques for implementing an area may be found in, for example, U.S. application Ser. No.15/682,506, entitled "Room Association Based on Name," filed on 8/21, 2017, and U.S. patent No.8,483,853, entitled "Controlling and manipulating groupings in a multi-zone media system," filed on 9/11, 2007. The contents of each of these applications are hereby incorporated by reference in their entirety. In some embodiments, MPS 100 may not implement a region, in which case the system may not store variables associated with the region.

The memory 216 may also be configured to store other data. Such data may belong to a playback device or an audio source that may be accessible to a playback queue associated with the playback device (or some other playback device (s)). In the embodiments described below, the memory 216 is configured to store a set of command data for selecting a particular VAS when processing speech input.

During operation, one or more playback zones in the environment of fig. 1A may each be playing back different audio content. For example, a user may be grilling in a terrace region and listening to hip-hop music played back by the playback device 102c, while another user may be preparing food in a kitchen region and listening to classical music played back by the playback device 102 i. In another example, one playback zone may play back the same audio content in synchronization with another playback zone. For example, the user may be in an office area where the playback device 102n is playing back the same hip-hop music as is being played back by the playback device 102c in a terrace area. In this case, the playback devices 102c and 102n can play back the hip-hop music synchronously, so that the user can enjoy the audio content played back aloud seamlessly (or at least substantially seamlessly) when moving between different playback regions. As described in the previously cited U.S. patent No.8,234,395, synchronization between playback zones may be achieved in a manner similar to synchronization between playback devices.

As described above, the regional configuration of MPS 100 can be dynamically modified. Thus, MPS 100 may support a variety of configurations. For example, if a user physically moves one or more playback devices into or out of a region, MPS 100 may be reconfigured to accommodate such changes. For example, if the user physically moves playback device 102c from the terrace region to the office region, the office region may now include both playback devices 102c and 102 n. In some cases, the user may pair or group the moved playback device 102c with the office area and/or rename the players in the office area, for example, using one of the controller device 104 and/or voice input. As another example, if one or more playback devices 102 are moved to a particular region in the home environment that is not yet a playback zone, the moved playback devices may be renamed or associated with the playback zone for that particular region.

Further, different playback zones of MPS 100 can be dynamically combined into a zone group or partitioned into separate playback zones. For example, the dining area and kitchen area may be combined into a group of areas for the dinner party such that the playback devices 102i and 1021 may synchronously present audio content. As another example, the bundled playback devices 102 in a study area may be partitioned into (i) a television area and (ii) a separate listening area. The television zone may include a Front playback device 102b. The listening area may include a right playback device 102a, a left playback device 102j, and a SUB playback device 102k, which may be grouped, paired, or combined as described above. Splitting the study area in this manner may allow one user to listen to music in a listening area in one region of the living room space and allow another user to watch television in another region of the living room space. In a related example, a user may implement either of the NMDs 103a or 103B (fig. 1B) to control the study area before differentiating the study area into a television area and a listening area. Once separated, the listening area may be controlled, for example, by a user near NMD 103a and the television area may be controlled, for example, by a user near NMD 103 b. However, as described above, any of the NMDs 103 can be configured to control various playback devices and other devices of the MPS 100.

i.Example controller device

Fig. 4A is a functional block diagram illustrating certain aspects of a selected one of the controller devices 104 of MPS 100 of fig. 1A. Such a controller device may also be referred to as a controller. The controller device shown in fig. 4A may include components substantially similar to certain components of the network device described above, such as the processor 412, the memory 416, the microphone 424, and the network interface 430. In one example, the controller device may be a dedicated controller of MPS 100. In another example, the controller device may be a network device on which media playback system controller application software, such as an iPhone, may be installed ^TM 、iPad ^TM Or any other smart phone, tablet, or network device (e.g., a networked computer, such as a PC or Mac ^TM )。

The memory 416 of the controller device may be configured to store controller application software and other data associated with MPS 100 and users of system 100. Memory 416 may be loaded with one or more software components 414 that are executable by processor 412 to perform certain functions (e.g., facilitating access, control, and configuration by a user of MPS 100). As described above, the controller device communicates with other network devices through a network interface 430, such as a wireless interface.

In one example, data and information (e.g., state variables) may be transferred between the controller device and other devices via the network interface 430. For example, playback zone and zone group configurations in MPS100 may be received by a controller device from a playback device, a network microphone device, or another network device, or transmitted by a controller device to another playback device or network device via network interface 406. In some cases, the other network device may be another controller device.

Playback device control commands such as volume control and audio playback control may also be communicated from the controller device to the playback device via the network interface 430. As described above, the change to the configuration of MPS100 may also be performed by a user using the controller device. Configuration changes may include adding/removing one or more playback devices to/from a zone, adding/removing one or more zones to/from a zone group, forming a bundled or consolidated player, separating one or more playback devices from a bundled or consolidated player, etc.

The user interface 440 of the controller device may be configured to facilitate user access and control of MPS100 by providing a controller interface such as controller interfaces 440a and 440B shown in fig. 4B and 4C, respectively (which may be collectively referred to as controller interface 440). Referring collectively to fig. 4B and 4C, the controller interface 440 includes a playback control zone 442, a playback zone 443, a playback status zone 444, a playback queue zone 446, and a source zone 448. The illustrated user interface 400 is only one example of a user interface that may be disposed on a network device, such as the controller device shown in fig. 4A, and accessed by a user to control a media playback system, such as MPS 100. Alternatively, other user interfaces of different formats, styles, and interaction sequences may be implemented on one or more network devices to provide comparable control access to the media playback system.

Playback control zone 442 (fig. 4B) may include selectable (e.g., by touching or by using a cursor) icons to cause playback devices in a selected playback zone or zone group to play or pause, fast forward, fast reverse, skip to next, skip to previous, enter/exit a shuffle mode, enter/exit a repeat play mode, enter/exit a cross-fade mode. The playback control zone 442 may also include selectable icons or the like for modifying the equalization settings and playback volume.

The playback zone 443 (fig. 4C) can include a representation of the playback zone within MPS 100. The playback zone may also include a representation of a zone group, such as a dining room + kitchen zone group, as shown. In some embodiments, the graphical representation of the playback zone may be selectable to call up additional selectable icons to manage or configure the playback zone in the media playback system, such as creating a binding zone, creating a zone group, separating a zone group, renaming a zone group, and so forth.

For example, as shown, a "group" icon may be provided within each graphical representation of the playback zone. The "group" icon provided within the graphical representation of a particular zone may be selectable to call up an option for selecting one or more other zones in the media playback system to be grouped with that particular zone. Once grouped, playback devices in a zone that have been grouped with the particular zone will be configured to play audio content in synchronization with playback devices in the particular zone. Similarly, a "group" icon may be provided within the graphical representation of the regional group. In this case, the "group" icon may be selectable to call up an option to deselect one or more regions of the region group to remove from the region group. Other interactions and implementations for grouping and ungrouping regions via a user interface, such as user interface 400, are also possible. As the playback zone or zone group configuration is modified, the representation of the playback zone in the playback zone 443 (fig. 4C) may be dynamically updated.

Playback status region 444 (fig. 4B) may include a graphical representation of audio content in the selected playback zone or zone group that is currently being played, previously played, or scheduled for next play. The selected playback zone or zone group may be visually identified on the user interface, such as within the playback zone 443 and/or the playback status zone 444. The graphical representation may include track title, artist name, album year, track length, and other relevant information that may be useful to the user in controlling the media playback system via the user interface 440.

Playback queue area 446 may include a graphical representation of audio content in a playback queue associated with a selected playback zone or zone group. In some embodiments, each playback zone or zone group may be associated with a playback queue containing information corresponding to zero or more audio items to be played back by the playback zone or zone group. For example, each audio item in the playback queue may include a Uniform Resource Identifier (URI), a Uniform Resource Locator (URL), or some other identifier that may be used by a playback device in a playback zone or zone group to find and/or retrieve audio items from a local audio content source or a networked audio content source that are likely to be played back by the playback device.

In one example, a playlist may be added to the playback queue, in which case information corresponding to each audio item in the playlist may be added to the playback queue. In another example, the audio items in the playback queue may be saved as a playlist. In yet another example, the playback queue may be empty or filled but "unused" when the playback zone or zone group is continuously playing back streaming audio content (e.g., an internet broadcast that may continue to play until otherwise stopped) rather than discrete audio items having playback durations. In alternative embodiments, the playback queue may include internet broadcast and/or other streaming audio content items, and may be "in use" when the playback zone or zone group is playing back these items. Other examples are also possible.

When a playback zone or zone group is "grouped" or "ungrouped," the playback queue associated with the affected playback zone or zone group may be cleared or re-associated. For example, if a first playback zone including a first playback queue is grouped with a second playback zone including a second playback queue, the established zone group may have an associated playback queue that is initially empty, contains audio items from the first playback queue (e.g., if the second playback zone is added to the first playback zone), contains audio items from the second playback queue (e.g., if the first playback zone is added to the second playback zone), or contains a combination of audio items from both the first playback queue and the second playback queue. Subsequently, if the established zone group is ungrouped, the resulting first playback zone may be re-associated with the previous first playback queue, or with a new playback queue that is empty or contains audio items from the playback queue associated with the established zone group (prior to ungrouping the established zone group). Similarly, the resulting second playback zone may be re-associated with a previous second playback queue, or with a new playback queue that is empty or contains audio items from the playback queue associated with the established zone group (prior to the cancellation of the established zone group). Other examples are also possible.

Still referring to fig. 4B and 4C, the graphical representation of the audio content in the playback queue 446 (fig. 4C) may include track titles, artist names, track lengths, and other related information associated with the audio content in the playback queue. In one example, the graphical representation of the audio content may be selectable to call up additional selectable icons to manage and/or manipulate the playback queue and/or the audio content represented in the playback queue. For example, the represented audio content may be removed from the playback queue, moved to a different location in the playback queue, or selected for immediate play or play after any audio content currently being played, etc. The playback queues associated with a playback zone or zone group may be stored in memory on one or more playback devices in the playback zone or zone group, on playback devices not in the playback zone or zone group, and/or on some other designated device. Playback of such a playback queue may involve one or more playback devices playing back queued media items, possibly in sequential or random order.

The source regions 448 can include graphical representations of the selectable audio content sources and selectable voice assistants associated with the respective VASs. The VAS may be selectively assigned as described in more detail below with respect to FIGS. 8-12C. In some examples, the same network microphone device may invoke multiple VASs, such as AM AZONMICROSOFT>Etc. In some embodiments, the user may assign the VAS exclusively to one or more network microphone devices. For example, a user may assign a first VAS to one or both of the NMDs 102a and 102B in the living room shown in fig. 1A and 1B, and a second VAS to the NMD 103f in the kitchen. Other examples are also possible.

j.Example Audio content Source

The audio sources in source region 448 may be sources of audio content from which audio content may be retrieved and played by a selected playback zone or zone group. One or more playback devices in a region or group of regions may be configured to retrieve audio content from various available audio content sources (e.g., according to a respective URI or URL of the audio content) for playback. In one example, the playback device may retrieve audio content directly from a corresponding audio content source (e.g., a line-in connection). In another example, audio content may be provided to a playback device over a network via one or more other playback devices or network devices. As described in more detail below, in some embodiments, audio content may be provided by one or more media content services.

Example audio content sources may include memory of one or more playback devices in a media playback system such as MPS 100 of fig. 1A, a local music library on one or more network devices (e.g., a controller device, a network-enabled personal computer, or a network-attached storage device (NAS)), a streaming audio service that provides audio content via the internet (e.g., a cloud), or an audio source connected to a media playback system via a line-in input connection on a playback device or network device, etc.

In some embodiments, audio content sources may be added to or removed from a media playback system, such as MPS 100 of fig. 1A, periodically. In one example, indexing of audio items may be performed each time one or more audio content sources are added, removed, or updated. Indexing of audio items may involve scanning identifiable audio items in all folders/directories shared through a network accessible to playback devices in a media playback system, and generating or updating an audio content database containing metadata (e.g., title, artist, album, track length, etc.) and other relevant information (e.g., URI or URL of each identified audio item found). Other examples for managing and maintaining audio content sources are also possible.

k.Example network microphone apparatus

Fig. 5A is a functional block diagram illustrating additional features of one or more NMDs 103 in accordance with aspects of the present disclosure. The network microphone device shown in fig. 5A may include substantially similar components as certain components of the network microphone device described above, such as the processor 212 (fig. 2A), the network interface 230 (fig. 2A), the microphone 224, and the memory 216. Although not shown for clarity, the network microphone device may include other components, such as speakers, amplifiers, signal processors, as described above.

The microphone 224 may be a plurality of microphones arranged to detect sound in the environment of the network microphone device. In one example, microphone 224 may be arranged to detect audio from one or more directions relative to a network microphone device. Microphone 224 may be sensitive to a portion of the frequency range. In one example, a first subset of microphones 224 may be sensitive to a first frequency range and a second subset of microphones 224 may be sensitive to a second frequency range. Microphone 224 may also be arranged to capture positional information of an audio source (e.g., speech, audible sound) and/or to assist in filtering background noise. Notably, in some embodiments, microphone 224 may have a single microphone instead of multiple microphones.

The network microphone device may also include a beamformer component 551, an Acoustic Echo Cancellation (AEC) component 552, a voice activity detector component 553, and/or a wake word detector component 554. In various embodiments, one or more of the components 551-556 may be sub-components of the processor 512.

The beamforming component 551 and AEC component 552 are configured to detect an audio signal and determine aspects of speech input in the detected audio, such as direction, amplitude, frequency spectrum, and the like. For example, the beamforming component 551 and AEC component 552 can be utilized in determining an approximate distance between a network microphone device and a user speaking into the network microphone device. In another example, a network microphone device may detect a relative proximity of a user to another network microphone device in a media playback system.

The voice activity detector activity component 553 is configured to cooperate closely with the beamforming component 551 and the AEC component 552 to capture sound from the direction in which voice activity is detected. Potential speech directions may be identified by monitoring metrics that distinguish speech from other sounds. Such measures may for example include energy relative to background noise in the speech band and entropy in the speech band (which is a measure of the spectral structure). Speech generally has entropy below most common background noise.

The wake word detector component 554 is configured to monitor and analyze received audio to determine whether wake words are present in the audio. The wake word detector component 554 may analyze the received audio using a wake word detection algorithm. If the wake word detector component 554 detects a wake word, the network microphone device can process the voice input contained in the received audio. An example wake word detection algorithm accepts audio as input and provides an indication of whether a wake word is present in the audio. Many first and third party wake word detection algorithms are known and commercially available. For example, an operator of a voice service may make its algorithms available in a third party device. Alternatively, algorithms may be trained to detect certain wake words.

In some embodiments, the wake word detector component 554 runs multiple wake word detection algorithms on the received audio simultaneously (or substantially simultaneously). As described above, different voice services (e.g., AMAZONAPPLE (APPLE)MICROSOFT>Gostile's Assistant, etc.) each use a different wake-up word to invoke its respective voice service. To support multiple services, the wake word detector component 554 can run the received audio through the wake word detection algorithm in parallel for each supported voice service. In such embodiments, the network microphone device 103 may include a VAS selector component 556 configured to communicate voice input to an appropriate voice assistant service. In other embodiments, the VAS selector assembly 556 may be omitted. In some embodiments, each NMD 103 of MPS 100 may be configured to run different wake-up word detection algorithms associated with a particular VAS. For example, NMDs of playback devices 102a and 102b of the living room may be +. >In association and configured to run a corresponding wake word detection algorithm (e.g., configured to detect the wake word "Alexa" or other related wake word), while the NMD of the playback device 102f in the kitchen may be associated with the assant of GOOGLE and configured to run a corresponding wake word detection algorithm (e.g., configured to detect the wake word "OK, GOOGLE" or other related wake word).

In some embodiments, the network microphone device may include a voice processing component 555 configured to further facilitate voice processing, such as by performing voice recognition trained to recognize a particular user or a particular set of users associated with the home environment. The speech recognition software may implement a speech processing algorithm tuned to a particular speech profile.

In some embodiments, one or more of the above-described components 551-556 can operate in conjunction with microphone 224 to detect and store a voice profile of a user, which can be associated with a user account of MPS 100. In some embodiments, the voice profile may be stored as and/or compared to variables stored in the command information set (or data table 590, as shown in FIG. 5A). The voice profile may include tonal or frequency aspects of the user's voice and/or other unique aspects of the user, such as those described in previously referenced U.S. patent application Ser. No.15/438,749.

In some embodiments, one or more of the above-described components 551-556 can operate in conjunction with the microphone array 524 to determine a location of a user in a home environment and/or a location relative to a location of one or more NMDs 103. Techniques for determining a user's location or proximity may include one or more of the techniques disclosed in the following documents: previously cited U.S. patent application Ser. No.15/438,749, U.S. patent No.9,084,058 entitled "Sound Field Calibration Using Listener Localization" filed on day 2011, 12, and U.S. patent No.8,965,033 entitled "Acoustic Optimization" filed on day 2012, 8, 31. Each of these applications is incorporated herein by reference in its entirety.

Fig. 5B is a diagram of an example speech input according to aspects of the present disclosure. The voice input may be captured by a network microphone device, such as by one or more of the NMDs 103 shown in fig. 1A. The voice input may include a wake word portion 557a and a speech utterance portion 557b (collectively, "voice input 557"). In some embodiments, wake word portion 557a may be a known wake word, e.g., with AMAZONAssociated "Alexa". In other embodiments, the voice input 557 may not include a wake word.

In some embodiments, the network microphone device may output an audible and/or visual response upon detection of the wake word portion 557 a. Additionally or alternatively, the network microphone device may output an audible and/or visual response after processing the voice input and/or a series of voice inputs (e.g., in the case of a multi-pass request).

The speech utterance portion 557b of the speech input 557 may, for example, include one or more spoken commands 558 (identified as first and second commands 558a and 558b, respectively) and one or more spoken keywords 559 (identified as first and second keywords 559a and 559b, respectively). Keywords may be, for example, words in the voice input that identify a particular device or group in MPS 100. As used herein, the term "keyword" may refer to a single word (e.g., "Bedroom") or a phrase (e.g., "the Living Room"). In one example, the first command 558a may be a command to play music, such as a particular song, album, playlist, or the like. In this example, the keywords 559 may be one or more words identifying one or more regions in which music is to be played, for example, "the Living Room" and "the learning Room" (fig. 1A). In some examples, the speech utterance portion 557B may include other information, such as pauses (e.g., periods of no speech) detected between words spoken by the user, as shown in fig. 5B. The pause may delineate the location of separate commands, keywords, or other information spoken by the user within the speech utterance section 557 b.

In some embodiments, MPS 100 is configured to temporarily reduce the volume of the audio content it is playing while wake word portion 557a is detected. As shown in fig. 5B, MPS 100 may resume volume after processing speech input 557. Such a process may be referred to as evasion, examples of which are disclosed in previously referenced U.S. patent application Ser. No.15/438,749.

1.Example network and remote computing System

As described above, MPS 100 may be configured to communicate with one or more remote computing devices (e.g., cloud servers) associated with one or more VAS. Fig. 6 is a functional block diagram illustrating an example remote computing device associated with an example VAS configured to communicate with MPS 100. As shown in fig. 6, in various embodiments, one or more NMDs 103 may send voice input to one or more remote computing devices associated with one or more VAS via WAN 107. For illustration purposes, the selected communication path of the voice input 557 is represented by an arrow in fig. 6. In some embodiments, the one or more NMDs 103 send only the speech utterance portion 557B (fig. 5B) of the speech input 557 to a remote computing device associated with the one or more VAS (without sending the wake word portion 557 a). In some embodiments, the one or more NMDs 103 send both the speech utterance section 557B and the wake word section 557a (fig. 5B) to a remote computing device associated with the one or more VAS.

As shown in fig. 6, a remote computing device associated with the VAS can include a memory 616, an intent engine 662, and a system controller 612 including one or more processors. In some embodiments, the intent engine 662 is a subcomponent of the system controller 612. The memory 616 may be a tangible computer-readable medium configured to store instructions executable by the system controller 612 and/or one or more of the playback devices, NMDs, and/or controller devices 102-104.

The intent engine 662 may receive speech input from MPS 100 after the speech input has been converted to text by a speech-to-text engine (not shown). In some embodiments, the speech-to-text engine is a component on a remote computing device associated with a particular VAS. Additionally or alternatively, the speech-to-text engine can be located at or distributed across one or more other computing devices, such as one or more of one or more remote computing devices 106d (fig. 1B) and/or local network devices of MPS 100 (e.g., one or more of playback devices, NMDs, and/or controller devices 102-104).

When a speech input 557 is received from MPS 100, intent engine 662 processes speech input 557 and determines the intent of speech input 557. In processing the voice input 557, the intent engine 662 may determine whether a particular command criterion is met for a particular command detected in the voice input 557. Command criteria for a given command in a voice input may be based, for example, on including certain keywords within the voice input. Additionally or alternatively, the command criteria for a given command may involve detecting one or more control state variables and/or zone state variables in conjunction with detecting the given command. The control state variables may include, for example, indicators identifying volume levels, queues associated with one or more devices, and playback status (e.g., whether a device is playing a queue, pausing, etc.). The zone state variables may, for example, include an indicator identifying which zone players are grouped (if any). The command information may be stored, for example, in the memory of the database 664 and/or in the memory 216 of one or more network microphone devices.

In some embodiments, the intent engine 662 communicates with one or more databases 664 associated with the selected VAS and/or one or more databases of MPS 100. The VAS database 664 and the database of MPS 100 can store various user data, analytics data, catalogs, and other information for NLU related and/or other processing. The VAS database 664 can reside in the memory 616 of a remote computing device associated with the VAS or elsewhere, such as in the memory of one or more of the remote computing device 106d and/or local network devices (e.g., playback devices, NMDs, and/or controller devices 102-104) of MPS 100 (fig. 1A). Likewise, the media playback system database can reside in memory of remote computing devices and/or local network devices of MPS 100 (e.g., playback devices, NMDs, and/or controller devices 102-104) (fig. 1A). In some embodiments, the VAS database 664 and/or a database associated with MPS 100 can be updated for adaptive learning and feedback based on speech input processing.

The various local network devices 102-105 (fig. 1A) and/or remote computing devices 106d of MPS 100 can exchange various feedback, information, instructions, and/or related data with the remote computing devices associated with the selected VAS. Such exchanges may be related to or independent of the transmitted message containing the voice input. In some embodiments, the remote computing device and the media playback system 100 may exchange data via a communication path as described herein and/or using a metadata exchange channel as described in previously referenced U.S. patent application No.15/438,749.

Fig. 7 depicts an example network system 700 in which a voice-assisted media content selection process is performed. Network system 700 includes MPS 100 coupled to: (i) a first VAS 160 and an associated remote computing device 106a; (ii) One or more second VASs 760, each hosted by one or more respective remote computing devices 706 a; and (iii) a plurality of MCSs 167, such as a first media content service 762 (or "MCS 762") hosted by one or more respective remote computing devices 106b and a second media content service 763 (or "MCS 763") hosted by one or more respective remote computing devices 106 c. In some embodiments, MPS 100 may be coupled to more or fewer VASs (e.g., one VAS, three VASs, four VASs, five VASs, six VASs, etc.) and/or more or fewer media content services (e.g., one MCS, three MCSs, four MCSs, five MCSs, six MCSs, etc.).

As previously described, in some embodiments, individual playback devices of MPS 100 may be coupled to or associated with first VAS 160, while other playback devices may be coupled to or associated with second VAS 760. For example, a first playback device of MPS 100 may be configured to detect a first wake word (e.g., "OK" to the Assist of GOOGLE) associated with first VAS 160. After detecting the first wake word, the first playback device may send the speech utterance to the first VAS 160 for further processing. Meanwhile, the second playback device of MPS 100 may be configured to detect a second wake word (e.g., "ALEXA" for ALEXA of AMAZON) associated with second VAS 760. After detecting the second wake word, the second playback device may send the speech utterance to the second VAS 760 for processing. As a result, MPS 100 may enable a user to interact with a plurality of different VASs via voice control.

MPS 100 may be coupled to VAS 160, 760 and/or first MCS 762 and second MCS 763 (and/or their associated remote computing devices 106a, 706a, 106B, and 106 c) via a WAN and/or LAN 111 (fig. 1B) connected to WAN 107 and/or one or more routers 109. As such, various local network devices 102-105 of MPS 100 and/or one or more remote computing devices 106d of MPS 100 can communicate with remote computing devices of VAS 160, 760 and MCSes 762, 763.

In some embodiments, MPS 100 may be configured to communicate simultaneously with both MCS 167 and/or VAS 160, 760. For example, MPS 100 may send search requests for particular content to both first MCS 762 and second MCS 763 in parallel, and may send voice input data to one or more of VASs 160, 760 in parallel.

Example systems and methods for associating playback devices with voice assistant services

Fig. 8, including fig. 8A-8H, illustrates an example process flow for associating a Voice Assistant Service (VAS) with one or more playback devices. As described above, example VAS includes ALEXA of AMAZON, CORTANA of SIRI, MICROSOFT of Assistant, APPLE of GOOGLE, and the like. The VAS may be a remote service implemented by a cloud server to process voice input and perform certain responsive actions. In some embodiments, the VAS can communicate with the playback device 102 via the integrated network microphone device 103. In other embodiments, the VAS may communicate with a separate microphone device (e.g., GOOGLE HOME, ECHO DOT of AMAZON, etc.), which in turn communicates with the playback device 102.

In the process flow shown in fig. 8, a user may associate a first VAS (referred to as "VAS1" in fig. 8) with one or more playback devices. This association may be established for playback devices that include an integrated network microphone device or playback devices that do not include an integrated network microphone device but are coupled to a separate network microphone device, e.g., the playback devices communicate with a separate network microphone device over LAN 111 (fig. 1B). Several of the steps shown in fig. 8 may be performed via control device 104 (fig. 4A) (e.g., graphical images may be displayed and user input received via user interface 440). As described below, certain steps may be performed by a software control application ("MPS app") running on a smart phone, tablet or computer associated with the media playback system, while other steps may be performed by another software control application ("VAS 1 app") running on a smart phone, tablet or computer associated with the first VAS.

Referring to fig. 8A, the process may begin at any one of three different stages. In interface 801, the user's option is to select "add voice control" via MPS app. At stage 803, the user may initiate the process by adding a new playback device to MPS 100 (and may also initiate via MPS app). At interface 805, the user may select "add voice service" via the setting screen of the MPS app. Process 800 proceeds to decision block 807 through interface 801 or 805, or by selecting "add player" at stage 803.

At decision block 807, if voice control has been previously enabled on the media playback system, processing proceeds to FIG. 8B to select a voice service at interface 809. If voice control was not previously enabled on the media playback system at decision block 807 of FIG. 8A, processing continues to decision block 811 to determine whether a voice-enabled playback device is present in the home environment. If a playback device with voice functionality is present, processing proceeds to interface 813 where the user is prompted to select "add voice service".

If, in decision block 811, no voice-enabled playback device is present in the home environment, processing continues to decision block 815 to determine whether a separate network microphone device (e.g., a VAS1 home device) associated with the first VAS is present. If VAS1 home devices are present, the user is prompted via interface 817 to access the VAS1 app. If the user selects "access VAS1 app" at interface 817, processing continues to stage 819 in FIG. 8B where the user accesses the VAS1 app. Via the VAS1 app, the user can identify a previously configured network microphone device (e.g., a VAS1 home device) and associate the device with a selected playback device to enable voice control. For example, a non-voice enabled playback device (e.g. PLAY：5 ^TM ) Coupled to->HOME MINI networking microphone device. Coupled devices may together provide voice-enabled control of audio playback.

As described above, the process can identify the presence of a VAS1 NMD (i.e., an NMD that has been previously associated with a VAS 1) located in an environment (e.g., the same home environment, the same local area network, etc.). For example, the Media Playback System (MPS) 100 (fig. 1B) can identify the VAS1 NMD as a speech-enabled device associated with the VAS1, and thus is a candidate device for pairing with a playback device.

MPS can use a number of different techniques to identify VAS1 NMD. For example, the VAS1 NMD may be configured to transmit beacon signals to nearby devices, e.g., using low power consumption Technology (e.g., eddystone of iBeacon, GOOGLE of application, etc.), wi-Fi signals (e.g., wi-Fi Aware), or other suitable identification signals. In some embodiments, the VAS1 NMD may be identified as being present on the same network as the MPS based on known attributes of the VAS1 NMD (e.g., default IP address, MAC address, etc.). For example, the MPS may query a database of default IP addresses and MAC addresses for different VAS1 NMDs. If the IP address or MAC address identified on any device on the local area network corresponds to the IP address or MAC address in the queried database, the MPS can identify those devices as corresponding VAS1 NMDs. In some embodiments, MPS app and VAS1 app may be installed on the same smart phone or other device, and MPS app may query or otherwise interact with VAS1 app to determine if any VAS1 NMD is present on the same network.

Returning to FIG. 8A, if there is no VAS1 home device at decision block 815, then processing returns to the main menu of stage 821 or to interface 823 where the MPS app provides the user with a "Learn more" option to receive help and prompts for accessing the enabled VAS. If this option is selected, processing proceeds to provide "help and hint" content at stage 825 (FIG. 8B). At this node, the user does not detect a voice-enabled playback device nor a separate VAS1 home device, so the process terminates without associating a VAS with the playback device.

Referring now to fig. 8B, the user is prompted via a user interface 809 of the MPS app to select one of the multiple VASs (here, VAS1 and VAS 2) to add to a particular playback device. After selection, the MPS app presents the selected content at interface 826 (here, the user has selected VAS 1) and prompts the user to continue the setup process (i.e., by selecting the "Let's get started" button).

Fig. 8B shows three variations of interfaces that may follow the user's selection at interface 826. In the respective interfaces 827, 829, and 831, the user interface can present all rooms with voice-enabled devices, including playback devices with integrated network microphone devices and independent network microphone devices. For example, as shown, the kitchen and living room devices are not associated with any VAS, while the master bedroom device has been previously associated with another VAS (here VAS 2). The office equipment has previously been associated with VAS1 and is therefore not user selectable at this interface. Using the radio button, the user can select the device to which the VAS1 should be added. If VAS1 has been previously added to a device (e.g., associated with "Office"), then Office's option may be set to gray to indicate that VAS1 cannot be added to the room. The user interface may also indicate a room in which another VAS (e.g., VAS 2) has been previously enabled, such as "Master Bedroom" in which VAS2 has been enabled.

Once the user has selected a room to which VAS1 should be added via one of the interfaces 827, 829 or 831, the process continues to fig. 8C to decision block 833 to determine if other voice assistants are enabled in the selected room. If not, processing continues to interface 834 to prompt the user to go to VAS1 app to continue the setup process. Once the user selects "Go to VAS1 App" at interface 834, processing proceeds to decision block 835 to determine if the VAS1 App has been installed on the user device, and if not, the user is prompted to download the VAS1 App at stage 837.

Returning to decision block 833, if the process determines that additional voice assistance is enabled in one or more selected rooms, the user is prompted via interface 839 to disable or unlink the previously enabled VAS (e.g., by displaying "disable VAS2" and providing a first button labeled "Add VAS1" and a second button labeled "do not, continue to use VAS 2"). As used herein, "disabled" may indicate that a particular VAS will not be associated with a playback device and will not provide voice control functionality. However, in some embodiments, the media playback system or playback device may maintain previously granted permissions, user credentials, and other information. Thus, if the user wishes to re-enable a previously disabled VAS, processing may be simplified and the VAS may be re-enabled relatively easily on a given playback device.

In the illustrated embodiment, a user may select only one VAS from a number of VASs for a particular room or playback device. Thus, if VAS2 was previously enabled in the master bedroom, adding VAS1 to the master bedroom requires that VAS2 be unlinked or otherwise disabled from the master bedroom. If the user selects "none, continue using VAS2" at interface 839, then at decision block 841 processing returns to interfaces 827, 829 or 831 in FIG. 8B for adding the selected VAS to the particular room. If the user selects "add VAS1" at interface 839, then at decision block 841 processing continues to decision block 835 (in FIG. 8C) to determine if the VAS 1app is on the user device, as previously described. If the VAS 1app is not on the user device, the user is prompted to download the VAS 1app at stage 837. If VAS 1app is on the user device, processing continues at interface 843 to FIG. 8D.

Via interface 843 (which may be displayed via the VAS1 app), the user is prompted to log into a user account associated with the MPS. If the user selects login, processing continues at decision block 845 to interface 847 to provide login credentials. If at decision block 845, the user has selected cancel, the process terminates. Returning to interface 847, once the user has provided credentials and selected "login," processing continues at decision block 849 to interface 851 (fig. 8E), where the user is prompted to provide the VAS 1app with permission to perform selection functions, such as playing back audio and video content, displaying metadata, and viewing device groupings on a playback device. If the user selects "I forgot my password" at interface 847 (FIG. 8D), then at decision block 849 the process proceeds to stage 853 to initiate the retrieve password process.

Returning to FIG. 8E, if the user allows the requested access at interface 851 (e.g., by selecting the "allow" button), then at decision block 853 of FIG. 8E, processing continues to interface 855. If the user selects "cancel" at interface 855, then at decision block 853 the process terminates. Returning to interface 855, the VAS1 app searches for a voice-enabled device. If one or more such devices are found, at decision block 857, processing continues to interface 859, where the identified devices are displayed to the user to select or confirm a playback device to be associated with VAS1 (e.g., by selecting a "next" button). In some embodiments, the user may select multiple devices to associate with VAS1 at interface 859, while in other embodiments the user may be limited to associating a single identified device at a time. If no device is found, processing proceeds to display an error message at decision block 857. At interface 859, once the user confirms the identified device (e.g., by selecting the "next" button), processing proceeds to interface 861 where the user is prompted to provide the selected device with permission to use the user account associated with VAS 1. If the user grants rights via interface 861, processing continues to interface 863, which displays the message while the selected playback device is connected to VAS 1.

If the connection is successful, at decision block 865, processing continues to interface 867 where the user is prompted to select from a pre-filled list of available music service providers that can be used by VAS1 to provide playback via the device. The list of music service providers may include providers previously associated with the user's media playback system as well as music services that have not been associated with the media playback system.

For music service providers (or other media service providers, such as podcast services, audio book services, etc.) that have been previously associated with the user's media playback system, the media playback system may have stored user credentials and login information. In some embodiments, these credentials may be shared with VAS1 to facilitate interaction and control of these services by VAS1. For example, if a user previously linked a spin account to the user's media playback system, during this stage of setting up the voice assistant service, the media playback system may send the user login credentials of the user's spin account to the VAS1. As a result, the VAS1 can interact directly with the bootify without the user having to reenter the login credentials.

Once the user has selected one or more music service providers via interface 867 and selected "next," processing proceeds to interface 868 (fig. 8G), where the user is prompted to provide a home address and allowed to provide personalized results via VAS1. In some embodiments, the user may be pre-populated with information based on previously obtained information. Once the user has provided the requested information or confirmed the pre-filled information and the "next" button is selected, processing continues at decision block 869 to interface 871 which prompts the user to provide the VAS1 with the rights to provide the personalized results. If the user selects "now not" at interface 868 and does not provide a home address, then at decision block 869 the process terminates or may be redirected to another step in the process (e.g., interface 875).

At interface 871, the user can select "next" to provide rights to the personalized results or "skip" to deny rights. If the user selects "skip," then at decision block 871 of FIG. 8H, the process terminates or is redirected to another step in the process (e.g., interface 875). If the user selects "Next," at decision block 873, processing continues to interface 875, which provides a message informing the user that the playback device has been set with VAS 1. The illustrated interface 875 shows various possible keywords or phrases that a user may speak for engaging the VAS1 via the paired playback device. These keywords or phrases may be the same species as the particular command. Example congeners of the command "Play back music" include "Turn on the radio", "Play today's top hits", and "Play some upbeat pop". Example homonyms that move playback from one location to another include "Move the music to (room)", "Play this in (room)", to "and" Take this into the (room) ". An example cognate for receiving information from VAS1 includes "What's the name of this song? "," When did Single Ladies come out? "and" white is Drake's concert near me? ".

In some embodiments, a user may customize the same species to be identified by the VAS1 to control one or more playback devices or other aspects of the user's environment (e.g., a smart appliance). Using the VAS1app, MPS app, or other suitable interface, a user may designate certain utterances (i.e., words or phrases) as the same category of words corresponding to a particular command. In some embodiments, the user may be presented with a list of possible commands (via a VAS1app, MPS app application, or other interface). When a particular command is selected (e.g., "increase volume"), the user may be prompted to provide a desired like language voice input or text input. The user may provide an input "credit it" which may then be stored as a cognate term for the command "increase volume". This input may be provided by speaking the phrase into the VAS1 NMD or by typing the phrase into the provided interface. After this customization, the VAS1 may respond to a speech utterance including the phrase "credit" by increasing the volume of one or more playback devices. Such a process may be repeated for any number of cognate words and any number of related commands.

In some embodiments, the stored cognate language may be updated or adjusted by the VAS1 based on user feedback, training, or adaptive learning. For example, consider a configuration in which the phrase "cut it" has not been specified as a cognate to any particular command. Over time, the user may repeatedly stop playback after the phrase "cut it" using "stop music" or other similar language. In response to this mode, VAS1 may add "cut" as the cognate to stop playback of the command so that the user may use the phrase "cut" to activate this command in the future.

Such customization is applicable to a wide range of possible commands, including such classes of commands as playback initiation, playback control, zone target commands, and query commands. Example playback commands may include playback, playback (content), playback for mood (content), and the like. Example control commands may include pause, stop, next, last, etc. Example locale target commands include grouping devices, ungrouping devices, calibrating devices, pairing/binding devices, etc. Example query commands may include requests from users for information related to weather, stock reports, current track playback, and the like. In various embodiments, the user may select any or all of the available commands in each of the custom command categories. In at least some embodiments, any given command can be associated with the same category of two or more different user assignments.

Fig. 12A to 12C show other examples of the same kind of language. These cognate terms may be preloaded or may be user-defined results. The commands on the left side of the table in fig. 12A-12C may have certain cognate terms represented on the right side of the table. For example, referring to fig. 12A, the "playback" command in the left column has the same intention as the like phrase (including "break it down)", "let's jam)", "burst it)", in the right column. In various embodiments, commands and cognate words may be added, removed, or edited in a table. For example, commands and cognate words may be added, removed, or edited in response to user customization and preferences, feedback, training, and adaptive learning, as described above. Fig. 12B and 12C illustrate example cognate languages relating to control and regional targets, respectively.

With continued reference to fig. 8H, once the user selects "complete" via interface 875, the process returns the user to the MPS app (e.g., display interface 877) and the setup process is complete. The process flow described above with respect to fig. 8 is exemplary and various modifications may be made in different embodiments. For example, the order of the various steps may be altered, and certain steps may be omitted generally (e.g., one or more of the confirmation screen or permission request may be omitted). In addition, other steps may be incorporated into the process, for example, allowing the user to customize other aspects of the selected VAS1 to operate with the selected playback device.

Fig. 9A shows a bound device pair, where each device has a different associated VAS. As shown, the Bed1 playback device 102f and the Bed2 playback device 102g have been bundled to form a stereo pair. As a binding pair, these playback devices may be presented to the media playback system for control as a single User Interface (UI) entity. For example, the binding pairs may be presented to the user as a single device (e.g., via user interface 440 of fig. 4A). Although in some cases such stereo pairs may be limited to a single VAS, in the illustrated embodiment, each playback device 102f and 102g is associated with a different VAS. Specifically, the Bed1 playback device 102f is associated with a first VAS 160 and the Bed2 playback device 102g is associated with a second VAS 760. For example, the playback devices 102f and 102g may each be associated with a respective VAS via the process shown in FIG. 8. The playback devices may be bundled to form a stereo pair before or after the playback devices are enabled with their respective VASs. As a result, a single stereo pair can interact with two different VASs. It may be extended to three, four or more VASs such that additional playback devices that make up the binding locale are enabled with different VASs, as described below with respect to fig. 9B.

Fig. 9B illustrates a binding locale of four playback devices, each associated with a different voice assistant service, in accordance with aspects of the present disclosure. Here, the binding region is a home theater including a playback device 102b named Front and a playback device 102k named SUB that are bound to each other. The Front device 102b may present a medium-high frequency range and the SUB device 102k may present a low frequency (e.g., a subwoofer). The Front device 102b and SUB device 102k are further bound to the right playback device 102a and left playback device 102k, respectively. In some implementations, the right device 102a and the left device 102k may form a surround or "satellite" channel of a home theater system. The bonded playback devices 102a, 102b,102j, and 102k together form a single bonded locale. As shown in fig. 9B, these playback devices 102, 102B,102j, and 102k may each be associated with a different VAS (first VAS 160, second VAS 760, and third VAS 760B). As a result, a single home theater area can interact with four different VASs. In some embodiments, two or more separate playback devices of a binding locale may be associated with the same VAS.

9C-9F illustrate example user interfaces for managing VASs associated with particular playback devices of a bonded locale in accordance with aspects of the present disclosure. For example, these example interfaces may be accessed using a setup menu of an MPS app or other suitable software application. Specifically, the interface shown may be a setup screen associated with a single zone, here a stereo (zone B) zone, that includes a bonded stereo pair of the Bed1 playback device 102f and the Bed2 playback device 102g as shown in fig. 9A. Using this interface, a user can add or change the VAS associated with each device of the binding site even after the binding site has been formed. In some embodiments, adding or adjusting the VAS associated with any given device of the bonded locale does not require the bonded locale to be re-established (e.g., the device does not need to be recalibrated, and the bonded locale remains intact).

In the example of fig. 9C, the Bed1 playback device does not enable a VAS, and a button of "Add Voice Service (add voice service)" is presented to the user to add a new VAS. Meanwhile, the Bed2 playback device is indicated to have been associated with the VAS1, and if the user wishes to disable the VAS1 on that device, the user is presented with a "Change Voice Service (change voice service)" button. Through this interface, the user can implement various configurations of associating the VAS with playback devices, including two devices having the same associated VAS, each device having a different associated VAS, or only one of the devices having an associated VAS and the other device not.

Fig. 9D shows an example in which the Bed1 playback device and the Bed2 playback device each enable VAS 1. The user can change these VAS associations individually, if desired, including removing any VAS associations from any device.

In fig. 9E, a Bed1 playback device is associated with the VAS1, while a Bed2 playback device is associated with the VAS 2. Also, the user can modify these associations separately by selecting the "Change Voice Service (change voice service)" button as desired.

Fig. 9F shows an example where neither the Bed1 playback device nor the Bed2 playback device is associated with a VAS. By selecting the "Add Voice Service (add voice services)" button for either device, the user can associate the desired VAS with one or both of these playback devices.

In each of the examples shown in fig. 9C-9F, a process flow similar to that described above with respect to fig. 8 may be initiated once the user selects "Add Voice Service (add voice service)" or "Change Voice Service (change voice service)". For example, the user may be directed to interface 809 of fig. 8B to select a particular VAS for association with a selected playback device. FIG. 10 illustrates an example method 1000 of utilizing the binding device pair shown in FIG. 9A, wherein each device has a different associated VAS. In block 1002, the method forms a binding region of a media playback system that includes a first playback device and a second playback device. For example, the first playback device and the second playback device may be devices 102f and 102g shown in fig. 9A, which are bundled to form a stereo pair, and which may be presented to the media playback system as a single User Interface (UI) entity. For example, when displayed to a user via the user interface 440 (fig. 4A) of the controller device 104, the binding pairs may be displayed as a single "device" for control. As previously described, various devices in the bonded region may be assigned different playback responsibilities, such as responsibilities of a particular audio channel. For example, the Bed1 playback device 102f may be configured to play back the left channel audio component, while the Bed2 playback device 102g may be configured to play back the right channel audio component. In various embodiments, additional playback devices may be bundled together to form a bundled zone, e.g., a bundled zone may be formed by three, four, five, or more playback devices.

The method 1000 continues at block 1004 where a first wake word is detected via a first network microphone device of a first playback device. For example, the Bed1 playback device 102f may include a networked microphone device 103 (fig. 2A) configured to receive audio input. The audio input may be a speech input 557 including a wake word 557 portion a and a speech utterance portion 557B (fig. 5B).

In block 1006, the method sends the first speech utterance to the first speech assistant service in response to detecting the first wake word in block 1004. As previously described, the speech utterance may follow the first wake-up word detected by the networked microphone device in block 1004, and the first speech utterance may be captured via the same networked microphone device. The speech utterance may take a variety of forms, including, for example, a request to play back a first media content (e.g., a particular song, album, podcast, etc.). In other embodiments, the speech utterance may be a command to be executed locally by the playback device, such as grouping or binding the device with other playback devices, adjusting the playback volume of the device, disabling a microphone of the device, or other suitable command.

The method 1000 proceeds to block 1008 to play back the first media content synchronously via the first playback device and the second playback device. For example, the media playback system may receive the requested media content from the first voice assistant service. The requested media content is then played back via the binding region, which includes the first playback device and the second playback device playing back the audio content in synchronization with each other.

Returning to block 1002, the method 1000 also proceeds along a second flow to block 1010 to detect a second wake word via a second network microphone device of a second playback device. For example, the second playback device may be the Bed2 playback device 102g and may include a networked microphone device 103 (fig. 2A) configured to receive audio input. As previously described, the audio input may be a speech input 557 that includes a wake word portion 557a and a speech utterance portion 557B (fig. 5B). This second wake word detection may occur before or after the detection of the first wake word in block 1004.

In block 1012, the method 1000 sends a second speech utterance requesting playback of a second media content to a second VAS. As described above, the speech utterance may follow the second wake word detected by the networked microphone device in block 1010, and the second speech utterance may be captured via the same networked microphone device.

Method 1000 continues at block 1014 where the second media content is played back synchronously via the first playback device and the second playback device. For example, the media playback system may receive the requested media content from the second voice assistant service. The requested media content is then played back via the binding region, which includes the first playback device and the second playback device playing back the audio content in synchronization with each other.

In some embodiments, the second wake word may be different from the first wake word and may also be associated with a different VAS. For example, the first wake word detected in block 1004 may be "Alexa" and the first VAS may be Alexa of AMAZON, while the second wake word detected in block 1010 may be "OK, google" and the second VAS may be an Assistant of Google. Various other configurations are possible. For example, the first wake word or the second wake word may be associated with a local command, and the first VAS or the second VAS may be a local service rather than a service associated with one or more remote computing devices. For example, the second wake word may be "Hey Sonos" and the second voice assistant service may be a local VAS stored on one or more playback devices of the media playback system that is configured to respond to voice input and execute commands (e.g., adjust volume, group or bind playback devices, deactivate microphones, etc.).

By associating individual devices of a binding locale with different voice assistant services (which may have different corresponding wake words), the method allows a user to interact with a single UI entity (i.e., a binding pair or locale, which is displayed as a single device via a media playback system) that may interact with two different VASs. Thus, even if individual playback devices cannot be associated with multiple VASs, a user can access the multiple VASs through a single UI entity via a binding zone. This advantageously allows a user to realize the benefits of multiple VASs (where each VAS may be advantageous in different aspects), rather than requiring the user to limit their interaction to a single VAS to the exclusion of any other VAS.

Fig. 11 is a process flow for associating stereo pairs of playback devices with a single voice assistant service. In some embodiments, it may be desirable to prevent a user from associating various different devices of a binding locale with different VASs. For example, for two devices of a stereo pair or multiple devices of a home theater setting, it may be desirable to limit the binding area to a single VAS. However, in some cases, the device may be just before the binding region is formedHas previously been associated with a different VAS. To accommodate the formation of binding regions without allowing multiple VASs, a process flow as outlined in fig. 11 may be used. The process begins by binding two playback devices to create a stereo pair. As previously described, this may be accomplished, for example, via MPS app, user voice input, or other suitable input that results in a binding stereo pair. Processing continues with decision block 1103 to determine whether any of the two playback devices has an associated VAS. If neither device has an associated VAS, processing continues to decision block 1105 to determine whether the binding pair is capable of voice control (i.e., whether the playback device has an integrated NMD or associated NMD through which voice input is received). If the binding pair is capable of voice control at decision block 1105, processing continues to the voice service setup profile (interface 813 in FIG. 8B). Following this procedure, the binding device pairs may be associated together with a single VAS (e.g., VAS1 depicted in FIG. 8). If the binding pair is not voice controlled at decision block 1105, processing continues to Trueplay at stage 1107 ^TM To calibrate the device for stereo playback for a particular room or space. In this case, the binding pair is not associated with any VAS.

Returning to decision block 1103, if both devices have an associated VAS, processing continues to decision block 1111 to determine whether each device is associated with the same VAS. If so, processing continues with Trueplay at stage 1107 ^TM . In this case, the bonded stereo pair is configured to be associated with a single VAS that was previously associated with each device with the same VAS.

If in decision block 1111 the two playback devices do not have the same associated VAS, processing continues to decision block 1113. If only one of the two playback devices has an associated VAS, the user is prompted to re-authorize the VAS on the newly added playback device. For example, if a first playback device of a stereo pair has been previously associated with VAS1 and a second playback device has not been associated with any VAS, once the two devices are bound to form a stereo pair, the user may be prompted to authorize the second playback device to be associated with VAS 1. As a result, the bonded stereo pair may be configured to operate with the VAS 1.

At decision block 1113, if each playback device of the stereo pair has a different associated VAS, the user is prompted via interface 1115 to select one of the two different VASs. In decision block 1117, if the user selects the VAS, processing continues to the voice service option (interface 809 of FIG. 8). The process will cause one of the playback devices to disable or un-link from the VAS previously associated therewith and instead associate that playback device with the same VAS with which the other device of the binding pair was associated. For example, if a first playback device of a bonded stereo pair is associated with VAS1 and a second playback device of the bonded stereo pair is associated with VAS2, then at interface 1115 the user is prompted to select VAS1 or VAS2 for the bonded stereo pair. If the user selects VAS2, VAS1 will be disabled or unlinked from the first playback device and, alternatively, the first playback device will be associated with VAS2. As a result, the bonded stereo pairs will be limited to being associated with a single VAS. If the user selects "do not use speech" at interface 115, then at decision block 1117 the process terminates and the bonded stereo pair will not have an associated VAS.

Conclusion(s)

The above description discloses various example systems, methods, apparatus, articles of manufacture, etc., including firmware and/or software executed on hardware, as well as other components. It should be understood that such examples are illustrative only and should not be considered limiting. For example, it is contemplated that any or all of these firmware, hardware, and/or software aspects or components could be embodied exclusively in hardware, exclusively in software, exclusively in firmware or in any combination of hardware, software, and/or firmware. Thus, the examples provided are not the only way to implement such systems, methods, apparatus, and/or articles of manufacture.

In addition to the examples described herein with respect to grouping and binding playback devices, in some implementations, multiple playback devices may be consolidated together. For example, a first playback device may merge with a second playback device to form a single merged "device". The consolidated playback device may not explicitly assign different playback responsibilities. That is, the combined playback devices can individually play back the audio content as they are when not combined, in addition to synchronously playing back the audio content. However, the merge device may be presented to the media playback system and/or to the user for control as a single User Interface (UI) entity.

The present specification is provided primarily in terms of the following aspects: illustrative environments, systems, procedures, steps, logic blocks, processes, and other symbolic representations of operations that are directly or indirectly similar to those of data processing devices coupled to a network. These process descriptions and representations are generally used by those skilled in the art to most effectively convey the substance of their work to others skilled in the art. Numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be understood by those skilled in the art that certain embodiments of the present disclosure may be practiced without some of the specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments. The scope of the disclosure is, therefore, indicated by the appended claims rather than by the foregoing description of the embodiments.

When any of the appended claims are to be understood to cover an implementation of pure software and/or firmware, at least one element in at least one example is expressly defined herein to include a tangible, non-transitory medium such as a memory, DVD, CD, blu-ray disc, etc. that stores the software and/or firmware.

For example, the present technology is illustrated according to various aspects described below. For convenience, various examples of aspects of the present technology are described as numbered examples (1, 2, 3, etc.). These are provided as examples and not limiting the present technology. It should be noted that any of the subordinate examples may be combined in any combination and placed in respective independent examples. Other examples may be presented in a similar manner.

Example 1: a method, comprising: detecting a first wake word via a first network microphone device of a first playback device; detecting a second wake word via a second network microphone device of a second playback device; forming a bound zone of the media playback system, the bound zone comprising a first playback device and a second playback device; in response to detecting the first wake word via the first network microphone device: transmitting, to one or more remote computing devices associated with a first voice assistant service, a first voice utterance requesting playback of first media content; and playing back the first media content in synchronization with each other via the first playback device and the second playback device of the binding region; and in response to detecting the second wake word via the second network microphone device: transmitting a second speech utterance requesting playback of second media content to one or more remote computing devices associated with a second speech assistant service; and playing back the second media content in synchronization with each other via the first playback device and the second playback device of the binding region. Example 2: the method of example 1, wherein the first wake word is associated with a first voice assistant service, the second wake word is associated with a second voice assistant service, and wherein the first wake word is different from the second wake word. Example 3: the method of example 1 or 2, wherein at least a portion of the first speech utterance is additionally captured via the second network microphone device, and wherein the second network microphone device does not transmit the first speech utterance to one or more remote computing devices associated with the second speech assistant service. Example 4: the method of any of examples 1-3, further comprising presenting the binding locale as a single User Interface (UI) entity via a media playback system. Example 5: the method of example 4, wherein presenting the binding locale comprises displaying the binding locale as a single device via a controller device of the media playback system. Example 6: the method of any of examples 1-5, wherein the forming of the binding region is performed before the detecting of the first wake word and the detecting of the second wake word. Example 7: the method of any of examples 1-6, further comprising: the first network microphone device is associated with a first wake word engine before the first wake word is detected, and the second network microphone device is associated with a second wake word engine different from the first wake word engine before the second wake word is detected. Example 8: the method of any of examples 1-7, wherein the first playback device and the second playback device are assigned different playback responsibilities when playing back the first media content and the second media content in synchronization with each other. Example 9: the method of any of examples 1-8, further comprising calibrating the first playback device and the second playback device simultaneously after the formation of the bonded region. Example 10: the method of any of examples 1-9, further comprising grouping a third playback device with the binding region, and wherein playing back the first media content comprises playing back the first media content via the first playback device, the second playback device, and the third playback device in synchronization with each other. Example 11: a media playback system, comprising: one or more processors; a first network microphone device; a second network microphone device; and a tangible, non-transitory, computer-readable medium storing instructions executable by the one or more processors to cause the media playback system to perform operations comprising the method of any one of examples 1-10. Example 12: a tangible, non-transitory, computer-readable medium storing instructions executable by one or more processors to cause a media playback system to perform operations comprising the method of any one of examples 1-10.

Claims

1. A method for synchronized playback, comprising:

detecting a first wake word via a first network microphone device of a first playback device, wherein the first playback device is coupled to a binding region of a media playback system that includes at least a second playback device, the first playback device and the second playback device having different associated voice assistant services, the first playback device being associated with a first voice assistant service, the second playback device being associated with a second voice assistant service, the first voice assistant service being different from the second voice assistant service, the first playback device and the second playback device being presented to the media playback system as a single user interface UI entity for control, the binding region being configured to synchronously play back media;

in response to detecting, via the first network microphone device, the first wake word associated with a first voice assistant service:

the first playback device sending a first speech utterance requesting playback of first media content to one or more remote computing devices associated with the first speech assistant service; and

playback of the first media content via the binding zone such that the first playback device and the second playback device play back in synchronization with each other; and

Detecting, via a second network microphone device of a second playback device, a second wake word associated with a second voice assistant service;

in response to detecting the second wake word via the second network microphone device:

the second playback device sending a second speech utterance requesting playback of second media content to one or more remote computing devices associated with the second speech assistant service; and

the second media content is played back via the binding region such that the first playback device and the second playback device play back in synchronization with each other.

2. The method of claim 1, wherein a user is restricted to associating no more than one voice assistant service with a particular playback device.

3. The method of claim 1 or 2, wherein if the user wishes to change a voice assistant service associated with a particular playback device, the user must choose to disable a first voice assistant service associated with the particular playback device and choose to enable a second voice assistant service associated with the particular playback device.

4. The method of claim 1 or 2, wherein the first playback device and the second playback device are assigned different playback responsibilities when playing back the first media content and the second media content in synchronization with each other.

5. The method of claim 1 or 2, further comprising calibrating the first playback device and the second playback device of the bonded region simultaneously.

6. The method of claim 1 or 2, further comprising grouping a third playback device with the binding zone, and wherein playing back the first media content or the second media content via the binding zone comprises playing back the first media content or the second media content via the first playback device, the second playback device, and the third playback device in synchronization with one another.

7. The method of claim 1 or 2, further comprising presenting the binding region as a single User Interface (UI) entity via the media playback system.

8. The method of claim 7, wherein presenting the binding locale comprises displaying the binding locale as a single device via a controller device of the media playback system.

9. The method of claim 1 or 2, further comprising:

before associating a voice assistant service with at least one of the first playback device and/or the second playback device, selecting, by the user, to form a binding region including the first playback device and the second playback device;

After forming the binding site, a settings menu of the binding site is accessed that includes a selectable interface for each device of the binding site to enable the user to select from a plurality of voice aids to associate with each particular device of the binding site.

10. The method of claim 1 or 2, further comprising:

associating, by a user, the first playback device and the second playback device with respective first voice assistant service and second voice assistant service via a user interface prior to forming into a binding locale; and

after forming the bound zone including the first playback device and the second playback device, accessing a settings menu of the bound zone, the settings menu including selectable options for changing the voice assistant service associated with each member playback device of the bound zone.

11. The method of claim 10, further comprising selecting at least one of the selectable options to change the voice assistant service associated with at least one of the member devices of the binding region.

12. The method of claim 1 or 2, further comprising sending user credentials corresponding to a particular music service to one or more voice assistant services for retrieving the first media or the second media for playback.

13. The method of claim 1 or 2, wherein the first playback device and the second playback device are one of:

a playback device having an integrated network microphone device; and

there is no playback device that integrates the network microphone device and is coupled to a separate network microphone device.

14. The method of claim 1, further comprising:

associating the first network microphone device with a first wake word engine prior to detecting the first wake word; and

the second network microphone device is associated with a second wake word engine that is different from the first wake word engine prior to detecting the second wake word.

15. A media playback system, comprising:

a first playback device;

a second playback device; and

a controller interface configured to control the media playback system;

wherein the media playback system is configured to perform the method of any one of claims 1-14.