GB2574803A

GB2574803A - Communication between audio devices

Info

Publication number: GB2574803A
Application number: GB201809551A
Authority: GB
Inventors: Edwards John
Original assignee: Xmos Ltd
Current assignee: Xmos Ltd
Priority date: 2018-06-11
Filing date: 2018-06-11
Publication date: 2019-12-25
Anticipated expiration: 2038-06-11
Also published as: GB2574803B; GB201809551D0

Abstract

A smart-speaker in a set of devices comprising a loudspeaker 218, a microphone for receiving voice input 202, and a controller 210 which submits the voice to a speech recognition algorithm to recognise and execute the voice commands. There is a network interface 212 giving access for the controller to data networks and remote functionality. The controller is also configured to communicate with a controller on a first device using the loudspeaker by emitting an inaudible signal to be captured by the microphone of the other device. Also disclosed is a set of said devices wherein the controller on the first device is configured to stream a portion of content to the second device via the data network with a network delay and the two devices each emit a predetermined timing pattern relative to the audio played from each device which can be used to calculate the network delay and apply an inverse of the network delay to the play out of the content on the loud speakers of any of the devices. Also disclosed is the set of devices wherein the first device determines an acoustic filtering effect and apply an inverse of said effect to the audio playout.

Description

Communication Between Audio Devices

Field

The present disclosure relates to communication between audio devices such as a smartspeaker units.

Background

Speech control is an increasingly popular way of controlling devices around the home, office or such like.

Microphones can be included in user devices such as smart TV sets, music players, or other smart appliances in order to allow voice control of the device. E.g. the user can speak a command to invoke a certain function of the device such as to turn the volume up or down, mute, download or stream content, or such like. The speech command is picked up by the microphone unit incorporated in (or perhaps connected to) the user device and recognized by a speech recognition algorithm, which may either be embedded in the user device itself, or hosted on a server in a network such as the internet to which the user device is connected. The user device is configured to then act on the command accordingly.

Another form of user device controlled using speech is known as a smart speaker. A smart speaker is a stand-alone unit comprising a loudspeaker, microphone and network interface (typically a wireless interface) all incorporated into the same housing of the unit. The user can speak a command which is received by the microphone on the smart-speaker unit. The received command is processed by a speech recognition algorithm in order to process and execute the command (i.e. put the command into effect). The processing required to recognize and execute the speech command may be implemented on the smart-speaker unit itself, or may be offloaded to a server, or a combination of these.

The effect of executing at least some such commands is to instigate an action via the network interface and a corresponding network (typically a wireless network such as a

WLAN), for example to download or stream audio content to the smart-speaker, or to invoke an online (Internet) service such as an online search query or to order goods or services online. Another example is to control an appliance or utility on the same local network such as lighting, heating, etc. The loudspeaker of the smart speaker unit may be used to play out a result of the command, such as to play out the downloaded or streamed audio content, or to audibly play out the information resulting from the search query, or to provide a confirmation that the command has been successfully instigated or enacted.

A smart speaker unit typically has no display screen, and no more than a small number of manual controls (e.g. buttons or switches), for instance an on/off switch and perhaps a pairing button. Thus the smart-speaker unit provides a pure voice interface by which a user can access networked functionality such as retrieving audio content, conducing online queries, or controlling appliances or systems around the home, office or such like.

Nowadays a given user may in fact have a plurality of smart-speaker units distributed throughout his or her home or office, or even within a given room. Along with any other networked smart appliances, the user can create a network of smart devices networked together via a local network, typically a wireless radio network such as a Wi-Fi or Bluetooth network, etc. For instance, audio content may be downloaded or streamed to a first one of the smart speakers from the Internet or a home entertainment system on the local network. While the first smart speaker plays out this content, It. may also stream the same content live to a second smart speaker to be played out in parallel.

Summary

A number of advantageous applications are recognized herein whereby the functionality of a smart-speaker unit can be augmented by embedding an inaudible acoustic channel amongst acoustic output from the speaker of a smart-speaker unit.

According to a first aspect of the present disclosure, there is provided a device for use as a second one of a set of devices that also includes at least a first device, at least the second device taking the form of smart-speaker unit comprising a respective:

loudspeaker;

microphone for receiving voice inputs from the user;

controller configured to submit the voice inputs received by the respective microphone to a speech recognition algorithm to recognize and execute speech commands from the voice inputs, the respective controller being further configured to control its respective loudspeaker to play out audible content to the user including at least some content determined in response to one or more of the executed speech commands; and one or more network interfaces arranged to provide the respective controller with access to one or more data networks employing one or more electronic and/or electromagnetic access technologies, the respective controller being configured to thereby access remote functionality via at least one of the one or more networks in response to one or more of the executed speech commands;

wherein the respective controller is further configured to communicate with a respective controller on the first device by controlling the respective loudspeaker of the smart-speaker unit to emit an inaudible signal to be captured by a respective microphone of the first device.

The one or more data networks preferably comprise one or more wireless networks employing one or more wireless access technologies (e.g. one or more radio networks employing one or more radio access technologies), in which case the one or more network interfaces comprise a respective wireless interface (e.g. radio interface). The respective controller may be configured to perform said transfer of data via the one of the one or more wireless networks (e.g. one of the one or more radio networks, such as a Wi-Fi or Bluetooth network). Alternatively the possibility of a wired network such as an Ethernet network is not 25 excluded.

The remote functionality may comprise accessing a service on the Internet, via one of said data networks as an access network. For example the service could be to download or stream audio content to the respective smart-speaker or device, or to order a product online. Alternatively the remote functionality may comprise downloading or streaming audio content from another device on a local area network in the environmen t of the smartspeakers (e.g. another device in the home or office). As another example, the remote functionality could be to control another appliance or system, such as to control a home entertainment system; or to control lighting, heating or air conditioning in a home, office or other such environment of the user,

The playout from the respective loudspeaker in response to an executed voice command may be to play out the streamed or downloaded audio content, or to provide confirmation of the performance of the function or service.

In embodiments, the controller on the second device may be configured to control its respective loudspeaker to emit said pattern with a predetermined timing relative to a portion of the audible content played out from the respective loudspeaker.

In embodiments the inaudible signal may comprise a predetermined pattern, the controller on the second device being configured to control its respective loudspeaker to emit, said pattern with a predetermined timing relative to said portion of audible content.

In embodiments the controller on the second device may be configured to emit said pattern periodically.

In embodiments the controller on the second device may be configured to control its respective loudspeaker to emit the inaudible signal during said portion of audible content as played out from the respective loudspeaker.

In embodiments the inaudible signal may be an ultrasound signal.

Alternatively the inaudible signal may be an audible-frequency signal emitted at an inaudible power level relative to said portion of audio conten t played out by the loudspeaker of the smart speaker unit.

According to another aspect disclosed herein there is provided a system comprising the first device and the second device.

In embodiments the first device may comprise a respective loudspeaker.

In embodiments the first device may also take the form of a smart-speaker unit, wherein: the respective controller on the first device is configured to submit, voice inputs received by the respective microphone to a speech recognition algorithm to recognize and execute speech commands from the voice inputs, the respective controller being further configured to control its respective loudspeaker to play out audible content to the user including at least some content determined in response to one or more of the executed speech commands; and the first device also comprises a respective one or more network interfaces arranged to provide the respective controller with access to one or more data networks employing one or more electronic and/or electromagnetic access technologies, the respective controller being configured to thereby access remote functionality via at least one of the one or more networks in response to one or more of the executed speech commands.

in embodiments:

the controller on the first device may be configured to determine said portion of content and stream it to the controller on the second device unit via one of the one or more data networks, wherein the streaming incurs a network delay;

the controllers on the first and second devices may be configured to control their respective loudspeakers to play out said portion of audible content in parallel with one another;

the controller on the first device may be configured to detect the predetermined pattern in an acoustic signal received through the microphone of the first device based on a 25 comparison with a reference instance of the pattern, and thereby to determine an average value of said network delay; and the controller on the first device may be further configured to cause an inverse of the determined average delay to be applied to the play out of the audio content from the respective loudspeaker of one of the first and second devices.

In embodiments, the controller on the first device may be configured so as, if the absolute delay is positive, to send an instruction to the controller on the second device to apply the inverse of the determined delay to the play out of the audio content from the second device. The controller on the first device may be configured so as, if the absolute delay is negative, to apply the inverse of the determined delay to the play out of the audio content from the first device.

Preferably the controller of the first device is configured to perform said streaming via one of the one or more wireless networks (e.g. one of the one or more radio networks such as a Wi-Fi or Bluetooth network). Alternatively the possibility of a wired network such as an Ethernet network is not excluded. Note also, in general the network used for the streaming 10 may or may not comprise the same network via which the respective controller transfers data in response to the at least one executed speech command.

Said portion of content may be determined by being downloaded or streamed to the first smart-speaker unit, based on one of the executed speech commands that accesses the 15 remote functionality, e.g. the Internet or a local area network in the home or office.

In embodiments the respective controller on the first device may be configured to determine an acoustic filtering effect of an environment of the devices based on the inaudible signal received by the first device, and to cause the controller on the second 20 device to apply in inverse of the determined filtering effect to subsequent playout of audible content from the loudspeaker of the second device.

In embodiments said supplying of the voice inputs to the speech recognition algorithm comprises supplying the voice inputs to a local instance of the speech recognition algorithm

2.5 implemented in the respective controller, to perform the recognition locally by the respective controller.

Alternatively the speech recognition algorithm may be implemented on a server, and said supplying of the voice inputs to the speech recognition algorithm may comprise sending the 30 voice inputs to the speech recognition algorithm on the server to perform the recognition at the server.

in embodiments:

the controllers on the first and second devices may each be operable to receive, through their respective microphones., a respective instance of a same one of said speech commands that, accesses the remote functionality, and to measure a respective value of a property of the respective instance of the voice command indicative of a received quality thereof;

the controller on the second device may be configured to share its respective measured value with the controller of the first smart-speaker unit via the inaudible signal; and the controller on the first device may be configured to compare the measured values to determine which of the first and second devices has received said one of the speech commands with the greater quality, and to cause the determined device to execute its respective instance of that speech command.

in embodiments said property may be an audio property.

in embodiments said audio property may comprise one of: noise floor, signal to noise ratio, or received audio signal level, or a metric based one or more thereof.

Alternatively or additionally, said property could comprise a property of the speech recognition such as a recognition confidence value.

In embodiments the respective controller on the first device may be further configured to communicate with the controller on the second device unit by controlling the respective loudspeaker of the first device to emit a further inaudible signal, in embodiments the controller on the first device may be configured to include one or more control settings for the second device in the further inaudible signal, and by means of said further inaudible signal, to control the controller on the second device to apply the one or more control settings to the second device.

in embodiments the one or more control settings may comprise one or more network configuration settings for enabling the second device to join at least one of the one or more networks.

in embodiments the one or more network configuration settings may comprise a network ID and/or password for the second device to use to join the at least one network.

In embodiments the one or more control settings may comprise one or more audio settings for the second device to use in playing out. its respective audio content.

In embodiments the one or more audio settings comprise a volume and/or equalization setting.

in embodiments the system may comprise a plurality of second devices each configured to operate in relation to the first device as recited in any of the above statements or elsewhere herein.

The first device may be a master and the second device may be slaves of the master.

In embodiments the controller on the second device may be configured to include one or more control settings for the first device in said inaudible signal, and by means of said inaudible signal, to control the controller on the first device to apply the one or more control settings to the first device.

In embodiments the one or more control settings may comprise one or more network configuration settings for enabling the first device to join at least one of the one or more networks.

in embodiments the one or more network configuration settings may comprise a network ID and/or password for the first device to use to join the at least one network.

in embodiments the one or more control settings may comprise one or more audio settings for the first device to use in playing out respective audio content from the first device.

In embodiments the one or more audio settings may comprise a volume and/or equalization setting.

In embodiments the system may comprise a plurality of first devices each configured to controlled by the second device as recited in any of the above statements or elsewhere herein.

The second device may be a master and the first device may be slaves of the master.

In embodiments, the or each smart-speaker unit comprises no display screen.

In embodiments there may be provided a set of two or more smart-speaker units, each comprising a respective:

loudspeaker;

microphone for receiving voice inputs from the user;

wherein the respective controller on at least a second one of said smart-speaker units is further configured to communicate with the controller on a first one of said smartspeaker units by controlling the respective loudspeaker of the second smart-speaker unit to emit an inaudible signal.

in embodiments the first smart-speaker unit may be configured as recited of any of the embodiments of the first device mentioned above, and/or the second smart-speaker unit may be configured as recited of any of the embodiments of the second device mentioned above. In embodiments the first device (e.g. first-smart speaker unit) and/or second device (e.g. second smart-speaker unit) may be further configured in accordance with any of the embodiments disclosed elsewhere herein.

More generally, the various techniques disclosed herein can be applied to other types of devices, not just smart-speaker units.

According to a second aspect of the present disclosure, there is provided a set of devices comprising at least a first, device and a second device, at least the first device comprising a microphone, and each of the first and second devices comprising a respective loudspeaker and controller: wherein:

the respective controller on the second device is configured to control its respective loudspeaker to play out audible content;

at least the first device comprises a network interface arranged to provide the respective controller with access to one a data networks employing an electronic and/or electromagnetic access technology:

the respective controller on the second device is further configured to communicate with the controller on the first device by controlling the respective loudspeaker of the second device to emit an inaudible signal comprising a predetermined pattern;

the controller on the first device is configured to stream a portion of content to the controller on the second device via the data network, wherein the streaming incurs a network delay;

the controllers on the first and second devices are configured to control their respective loudspeakers to play out. said portion of audible content in parallel with one another;

the respective controller on the second device is configured to control its respective loudspeaker to emit said pattern with a predetermined timing relative to a portion of the audible content, played out from the respective loudspeaker;

the controller on the first device is configured to detect the predetermined pattern in an acoustic signal received by the microphone based on a comparison with and a reference instance of the pattern, and to thereby determine an average value of said network delay: and the controller on the first device is further configured to cause an inverse of the determined average delay to be applied to the play out of the audio content from the respective loudspeaker of one of the first and second devices.

According to a third aspect of the present disclosure, there is provided set of devices comprising at least a first device and a second device, at least, the first device comprising a microphone, and at least the second device comprising a loudspeaker, and each of the first and second devices comprising a respective controller; wherein:

the respective controller on the second device is further configured to communicate with the controller on the first device by controlling the loudspeaker of the second device to emit an inaudible signal; and the respective controller on the first device is configured to determine an acoustic filtering effect of an environment of the devices based on the inaudible signal received by the first device, and to cause the controller on the second device to apply in inverse of the determined filtering effect to subsequent playout of audible content from the loudspeaker of the second device.

Any features or combination of features of the above-recited embodiments of the first aspect, or disclosed elsewhere herein, may equally apply to the devices of the second or third aspect. Also, in embodiments the features of the second and third aspects may be combined.

According to another aspect disclosed herein, there may be provided a method comprising the operations of the first device and/or second device. .According to another aspect there may be provided a computer program product comprising code embodied on computer readable storage, and configured so as when run on one or more processors of the first device to perform the operations of the controller of the first device. According to another aspect there may be provided a computer program product comprising code embodied on computer readable storage, and configured so as when run on one or more processors of the second device to perform the operations of the controller of the second device.

Brief Description of the Drawings

To assist understanding of the present disclosure and to illustrate how embodiments of such may be put into effect, reference will be made, by way of example only, to the accompanying drawings in which:

Figure 1 is a schematic block diagram of a system of smart speakers including a master-slave architecture;

Figure 2 is a schematic block diagram of a smart speaker according to embodiments disclosed herein, including input and output processing;

Figure 3 is a schematic block diagram of a master smart-speaker unit according to embodiments disclosed herein; and

Figure 4 is a schematic block diagram of a slave smart-speaker unit according to embodiments disclosed herein.

Detailed Description of Embodiments

The present application provides a system and method which use inaudible acoustic signals for enhancing functionality such as the audio listening experience and/or voice processing capabilities of a smart speaker.

Smart, speakers include both loudspeakers and microphones. According to the present disclosure, these can be used for the transmission and reception of additional information to enhance the user experience. This additional information is communicate through a different audio band than used for audible playout. Such information can be transmitted using either out-of-band (ultrasonic) or very quiet (in-band) acoustic signals. Thus between then the smart speakers can be used to create an acoustic network of smart microphones and loudspeakers, which can work in combination to optimize the audio listening experience and voice processing capabilities of the smart loudspeaker. In embodiments the smart, speakers may be arranged in a master-slave configuration (as shown in Figure 1, to be discussed in more detail shortly).

In embodiments, such an arrangement may be used for performing one or more of the following operations: (a) time synchronization of multiple speakers, for example when audio is streamed over a packet network such as Ethernet or Wi Fi; (b) frequency equalization and room impulse response compensation, (c) local multiple keyword detection, (d) management and configuration (e.g. addition of new loudspeakers into the network).

Both the master device and slave devices comprise at least one microphone and at least one loudspeaker. This allows for bi-directional communications between the master device and the slave devices, and in addition to their normal function (for example as a smart speaker) they superimpose additional inaudible information on the audible audio signal. The transmitting device combines the inaudible data onto the audible signal to create a control or status data channel. The receiving device separates the audible and inaudible channels and decodes the latter for use as defined below. The transmitting devices add the additional information signal to the audible audio stream and the receiving devices listen to the acoustic signals received via the microphones and separate this into the separate audible audio and additional information paths (as shown in Figure 2, t.c> be discussed in more detail shortly).

Figure 1 illustrates an example system in accordance with embodiments disclosed herein. The system comprises a set of smart-speaker units (smart speakers) 102. Each smartspeaker unit comprises at least, one respective microphone 204 and a respectiveloudspeaker 218, each integrated into the same respective housing (casing) of the respective smart-speaker unit 102. In embodiments the system is arranged into a masterslave architecture, whereby one of the smart -speaker units is configured as the master

102m of the set, and the other smart-speaker unit(s) is/are configured as slaves 102a. In the illustrated example the set comprises a plurality of slaves 102s(#l)...102s(#N) under the same master 102m.

Each smart-speaker unit 102 comprises a respective one or more network interfaces (not shown) for accessing one or more networks. At least one of these networks may enable the smart-speaker units 102 to connect to one another, or at least for the slaves to connect to the master, e.g. to enable the master unit 102m to stream audio content to the slave unit(s) 102s and/or to control the slave(s) 102s. The network used for this may comprise a local .10 area network (LAN). The network used for this may comprise a wireless network (e.g.

WLAN) employing a wireless access technology, i.e. a wireless medium for the communications of the network. For instance the network in question may comprise a radionetwork such as a Bluetooth, Wi-Fi or Zigbee network, etc. Alternatively the possibility of a wired network, e.g. a wired LAN such as an Ethernet network, is not excluded.

Alternatively or additionally, at least one of the one or more networks may enable the smart-speaker unit 102 to access the Internet, and/or another source of data content and/or services (e.g. a home entertainment system). The network used for this may comprise a local area network. The network used for this may comprise a wireless network 20 (e.g. WLAN) employing a wireless access technology. For instance the network in question may comprise a radio network such as a Bluetooth. Wi-Fi or Zigbee network, etc.

Alternatively the possibility of a wired network, e.g. a wired LAN such as an Ethernet network, is again not excluded.

The network used to access the Internet, or other source of data or services, may be the same or different than the network used to connect the smart-speakers 102 to one another. Embodiments below may be described as using the same network for these purposes, but it will be appreciated that any such embodiments could instead use different networks for the different purposes, or even for the same purpose at different times. Or equivalently, one could consider that the described data network comprises one or more constituent networks employing one or more access technologies (for instance one or more wireless access technologies, e.g. one or more radio access technologies such as Wi-Fi, Bluetooth or Zigbee).

The role of master may comprise the master unit 102m being configured to determine audio 5 content to be played out. by one of the slaves 102s. The role of master may comprise the master unit 102m being configured to control one of the slaves 102s to play out the audio content determined by the master 102m. The role of master may comprise the master unit. 102m being configured to stream the determined content to the slaves 102s to be played out. The role of master may comprise the master unit 102m being configured to control the 10 slave(s) 102s to play out the streamed content in parallel with the master unit 102m.

Alternatively or additionally, the role of master may comprise the master unit 102m being configure to control a state of the slave(s) 102, e.g. to apply one or more control settings. This may for example comprise configuring one or more network configuration settings of 15 the slave(s) 102s, such as to join the slave(s) 102 to the network or networks. As another example, the master may be responsible for controlling one or more audio settings of its slave(s) 102s, such as volume, audio equalization (e.g. bass, mid, treble), and/or mute.

Figure 2 illustrates an example architecture for implementing each of the smart-speaker units 102. The smart-speaker unit 102 comprises: a microphone apparatus 202, a microphone processing module 206, an incoming audio processing module 208, a controller 210, at least one network interface 212, an outgoing audio processing module 214, an adder 216, and at least one loudspeaker 218. The microphone processing module 206 is operatively coupled to the microphone apparatus 202. The incoming audio processing

2.5 module is operatively coupled to the microphone processing module 206 and the network interface 212. The controller 210 is operatively coupled to the microphone processing module 206, the incoming audio processing module 208, the outgoing audio processing module 214, the network, interface 212 and the adder 216. The network interface 212 is operatively coupled to the incoming audio processing module 208. the controller 210 and the outgoing audio processing module 214. The outgoing audio apparatus 214 is operatively coupled to the network interface 212, the controller 210 and the adder 216. The adder 216 is operatively coupled to the controller 210, the outgoing audio processing module 214 and the loudspeaker 218,

Each of the microphone processing module 206, incoming audio processing module 208, controller 210, outgoing audio processing module 214 and adder 216 may be implemented in the form of software code embodies on a memory on the smart-speaker unit 102 and arranged to run on a processing apparatus on the smart-speaker unit 102, The memory in which the code is stored may comprise one or more memory units employing one or more memory media, e.g. an electronic memory medium such as an EEPROM, and/or a magnetic memory medium such as a hard disk. The processing apparatus on which the code is arranged to run may comprise on or more processing units, e.g. a general purpose CPU or a dedicated signal processing unit or application specific processing unit. Alternatively, it is not excluded that one, some or all of these modules 2.06, 208, 210, 214, 216 could be implemented partially or wholly in the form of dedicated circuitry, or a PGA or FGPA, or any combination of any of the described approaches.

The microphone apparatus 202 comprises one or more individual microphones 204 arranged to capture audio signals from an environment of the smart-speaker unit. 102. (e.g. from the room), in the case of multiple microphones 204, these may be arranged in the form of a microphone array for performing receive beamforming.

The audio processing module 206 is arranged to receive the audio signals captured by the microphone(s) 204. The audio processing module 206 is configured to perform a frequency band separation whereby it separates out an audible frequency component, and an ultrasound component from the signals received from the microphone(s) 204. The microphone processing module 206 is arranged to pass the audible frequency components to the incoming audio processing module 208 for further audio processing, and to pass the ultrasound components to the controller 210. For the present purposes, audible frequency range means in the human audible range, whereas ultrasound means above the human audible frequency range (at least 20kHz, and in embodiments the ultrasound component may be in a higher range, e.g. greater than 28kHz, greater than 50kHz or greater than 100kHz). The audio microphone processing module 206 may also be configured to perform other microphone-related processing operations on the received microphone signals, for example a beamforming operation, as will be familiar to a person skilled in the art. In the case of receive beamforming, the microphone processing module 206 is preferably arranged to separate out the ultrasound component prior to the receive beamforming, since this may otherwise in fact hinder the reception of the desired ultrasound components.

The controller 210 is arranged to receive the separated-out ultrasound component from the microphone processing module 206. and to extract information therefrom. The controller 210 is configured to perform one or more control-related operations based on information extracted from the received ultrasound component, examples of which will be discussed in more detail shortly.

The incoming audio processing module 208 is configured to perform audio processing on the separated-out audio components, under the control of the controller 210. This may include a speech recognition algorithm to detect speech commands in the audio components. Speech command comprise words spoken by a user in a natural language. The controller 210 is configured to execute the detected speech commands detected by the speech recognition algorithm, i.e. to trigger one or more functions specified by the speech command. This may comprise triggering one or more actions to be performed via the network interface 212 and the corresponding data network (e.g. WLAN, which may comprise a radio network such as a Wi-Fi, Bluetooth or Zigbee network).

The (at least one) network interface 212 is arranged to provide the controller 210 with access to the (at least one) data network using (at least one) access technology. The action triggered by controller 210 in response to a speech command may comprise, for example (depending on the speech command): accessing a service via the internet, such as to perform a search query or order a product online. In this case the WLAN is acting as an access network via which the controller 210 is configured to access another network (the Internet) for performing the service in question. As another example of executing a speech command, the controller 210 may (again depending on the speech command) control an appliance or system around the home or office, for example to turn on or off, mute or control the volume of a television or home entertainment system, or to control the television or entertainment system to play out certain media such as a certain TV channel, video or piece of music. Another example would be to control the lighting, heating or air conditioning In the home, office or other such environment in which the smart-speakers are deployed. In yet further examples, the execution of a speech command may comprise retrieving audible content via the data network (e.g. WLAN) to play out via the smartspeaker unit 102. This may comprise downloading or streaming content from another device on the same WLAN, e.g. a music player device, TV, home entertainment system, or storage device; or downloading or streaming content from the Internet to the smartspeaker unit using the WLAN as an access network.

The incoming audio processing by the audio processing module 208 may also include one or more other audio processing operations, such to adapt a frequency profile of the received audio.

The controller 210 is further configured to control the outgoing audio processing module 214 to play out audible content to the user through the loudspeaker 218, including at least to play out some content. In response to one or more of the speech command. This playedout content may comprise an audible acknowledgement that the speech command has been successfully recognized, instigated (execution begun) or enacted (execution complete).

As another example, the played-out content may comprise the audio content retrieved via the data network (e.g. WLAN) in response to the speech command, such as the audio content streamed or downloaded from another device on the WLAN or from the Internet.

The outgoing audio processing by the outgoing audio processing module 214 may also include one or more other audio processing operations, such to equalization (e.g. to control treble, mid or bass or the frequency profile generally), or to control the volume of the played-out audio. Such operations may be controlled by the controller 210.

The audio processing module 214 is arranged to pass the outgoing audio content (the content to be played out) to the loudspeaker 218 via the adder 216. In the case where the smart-speaker unit 102 has no ultrasound signal to transmit, the adder adds nothing and the outgoing audio content is simply passed straight to the loudspeaker 218 for play out.

However, when the smart-speaker unit does have an ultrasound signal to transmit, the controller 210 generates the ultrasound signal and supplies it to the other input of the adder 216 to be added to the outgoing audio content from the audio processing module 214 before play-out. The adder 216 adds the audio content and ultrasound signal, such that the acoustic output from the loudspeaker comprises a superposition of the audible content and the inaudible ultrasound signal. Various example uses of this ultrasound signal will be described shortly. As will also be discussed shortly, in variants the ultrasound signal may be generalized to other forms of inaudible signal.

The controller 210 on one smart-speaker unit 102 may be configured to control the audio content played out by one or more others of the smart-speaker units 102. For instance this may comprise the controller 210 on the master 102m controlling the controller 210 on one or more of the slaves 102s to control their audio play out. This may comprise specifying what content to play, and/or setting one or more audio settings such as volume, mute or equalization (e.g. bass, mid or treble, or the frequency profile generally). This control may be performed via the network interfaces 212 and data network (e.g. WLAN), or via the ultrasound channel, or a combination. For instance, in embodiments the controller 210 on the controlling (e.g. master) smart-speaker unit 102m may control its local audio processing module 214 to play out certain audio content, and in parallel may stream this content to the controller(s) 210 on one of more of the slave units 102s, along with an instruction to play out that same content in parallel with the master 102m. In response, the controller 210 on each of the one or more slaves 102s control their respective audio processing module to play out the streamed content. As will be discussed in more detail shortly, the controller 2.10 on the master 102m may also perform a time synchronization between the audio played out from the master 102m and siave(s) 102s.

Note that the microphone processing module 206, the audio processing modules 208, 214, and the adder 216 are illustrated in Figure 2 as a separate modules operating under control of the controller 210. However the figure is only schematic and any one, more or all of these modules may also equivalently be considered part of the controller 210.

Preferably each of the master 102m and slave units 102s is configured to be able to both transmit, and receive ultrasound signals (or more generally inaudible signals), as illustrated in Figure 2 and as described in relation thereto, for instance to enable the master 102m to communicate with the slaves and to enable the slaves 102 to communicate with the master (and optionally even to allow the slaves 102s to communicate with one another). However, in alternative embodiments it is not excluded that only one or some of the smart-speaker units 102 has/have the ability to transmit ultrasound (or inaudible signals) and only one or some has/have the ability to receive the ultrasound (or more generally the inaudible signals). For instance the master 102m may only transmit the ultrasound signals for reception by the slaves, and the slaves 102s may only be able to receive the ultrasound signals, in such embodiments it. 'will be understood that only the relevant components and functionality described In relation to Figure 2 need be implemented on the transmitting or receiving smart-speaker unit 102 respectively.

Figure 2 illustrates embodiments wherein the inaudible acoustic channel is implemented by means of ultrasound, i.e. in a different acoustic frequency band than the audible play-out. However, note that this is not the only possible implementation of the inaudible channel. In other implementations the inaudible channel may alternatively be implemented by embedding an audible range signal in the played-out audible content, wherein the embedded signal is below the lower decibel limit of human dynamic hearing relative to the currently played-out audio content (approximately -80dB relative to the audio content). Figure 2 and elsewhere herein may be exemplified in terms of the inaudible acoustic signal being implemented as an ultrasound signal, but this is not limiting, and any of the techniques disclosed herein can alternatively be implemented using a low-power audible range signal, or a combination of this and an ultrasound signal. Thus in Figure 2 for example, more generally the microphone processing module 206 may be configured to separate the audible and inaudible signals and pass the inaudible signals to the controller 210, and the controller 210 may be configured to extract the information from these inaudible signals in order to perform the one or more control-related operations. And/or, when the controller 210 has an inaudible signal to transmit, the controller 210 may generate this signal and the adder 216 may add this to the audio signal before play-out from the loudspeaker 218 (but when the controller 210 has no inaudible signal to transmit, the adder 216 adds nothing).

Further, while embodiments herein are described in terms of a master-slave relationship, this is not essential. Other possible implementation may instead employ a negotiated protocol where all the smart-speaker units have equal status and negotiate with one another to determine the outcome in any given scenario (where the negotiation may be via the data network and network interfaces 212, or via the inaudible acoustic channel).

Note also that while the above, or other examples herein, may be described in terms of the speech recognition algorithm being part of the incoming audio processing module 2.08 implemented on-board the smart-speaker unit 102. in alternative implementations the speech recognition may instead be offloaded to a server (not shown). In this case the controller 210 on the respective smart-speaker unit executes the speech command by sending the incoming audio components t.o the server, via the network interface 212 and local data network (e.g. WLAN, such as a Wi-Fi network, etc.) and invoking the server to perform the speech recognition and execute the speech command accordingly. The server may for example be accessed on the Internet, via the local data network as an access network. Anywhere herein that describes the recognition and execution of speech commands being performed by the controller 210 on a smart-speaker unit 102, this may be implemented by performing the speech recognition and command execution locally or by having the controller 210 offload either or both of these to a server.

In yet further variants, the master unit 102m need not necessarily be a dedicated smartspeaker unit. Instead the master unit 102m may instead be another type of user device with a speaker, such as a music player, TV, set-top box or home entertainment system.

The following now describes a number of example techniques that may be implemented using of the inaudible acoustic channel (e.g. ultrasound channel as described in relation to Figure 2). Any of these techniques may be used either alone or in any combination with one another.

A first exemplary use is for time synchronization. This relates to the scenario where the controller 210 on a first smart-speaker unit 102 (e.g. the master 102m) both i) controls its own local outgoing audio processing module 214 to play out audio content via its respective local loudspeaker 218, and at the same time ii) live streams the same content a second, remote smart-speaker unit 102 (e.g, a slave 102s) and controls the controller 210 on that, second smart-speaker unit 102s to play out that same content in parallel (via its respective outgoing audio processing module 214 and loudspeaker 218). The audio content may for example be streamed to the master 102a from another source such as the Internet of local device on the same WLAN, and is then forwarded onward from the first smart-speaker unit 102m to the second 102s in parallel. Alternatively the audio content, being played out could be pre-stored on the first smart-speaker unit. Either way, this parallel play-out may be triggered by a speech command captured by the first smart-speaker unit. In other embodiments, the master device 102m itself could be a source device other than a dedicated smart-speaker unit, streaming audio content directly to the slave smart-speaker unit(s) 102s. E.g. the master unit 102m could be a music player, TV or home entertainment system.

When audio streams are transmitted over digital networks (for example Ethernet,. Wi-Fi, Bluetooth etc.) then each digital path adds a different network delay to the audio stream. This delay may occur for example due to packet loss and the consequential need for retransmission of some packets (the delay due to the speed of light is considered negligible for the present purposes). This delay has both an absolute component (average time offset) and a relative delay component (the jitter about this average). Most networks operate on a best effort method - i.e. there is no prioritization of data packets so delays can also depend on what other data is being transmitted on the network. The other big unknown is the absolute delay over different wireless networks. Each 'wireless protocol uses a different frame length and bandwidth allocation method so absolute delays will vary between the different wireless protocols.

The relative delay between packets received is usually compensated for, in the incoming audio processing 208 of the smart-speaker 102, by the use of a jitter buffer. The jitter buffer must, be long enough that it can accommodate the maximum packet delay variation in the receiving stream but it does not compensate for the average absolute delay through the network. As a consequence, when sending audio streams to multiple receivers the audio

Z5 streams will all have different average absolute delays, which results in the audio streams playing out at different absolute times. While there are network based solutions forcompensating for these delays (e.g. NTP and IEEE1588), they do not work over all digital networks, for example Bluetooth.

Embodiments of the present disclosure support time synchronization using the following method. As shown in Figures 3 and 4, the incoming audio processing module 208 of the master smart-speaker unit 102m comprises a correlator 302. The outgoing audio processing module 214 on each of the master and slave units 102m, 102s comprises a respective delay compensator 304, e.g. this may be implemented in the form of a first-in, first-out (FIFO) buffer.

The controller 210 on each of the slave smart-speaker units 102s is configured to transmit a known pattern in the inaudible channel, e.g. by means of the ultrasound transmission described in relation to Figure 2. Preferably the pattern is transmitted periodically, i.e. at regular intervals in time. The controller 210 on the master unit 102m, which may or may not be a loudspeaker, listens to the inaudible audio stream and passes it to the correlator 302 which processes the received data in order to listen for the known pattern. In the case of multiple slave units 102s, the controller 210 on each of the slaves 102s emits its respective inaudible signal in a different time slot (time division multiplexing) such that the master unit 102m can distinguish between the instances of the pattern received from the different slave units 102s. Alternatively any other form of multiplexing could be used, such as by having each slave 102s emit its signal with a different respective ultrasound frequency (frequencydivision multiplexing), or a different respective pattern (code division multiplexing).

Ί he controller 210 on the slave 102s is configured to emit the pattern in its inaudible signal with a predetermined timing relative to the audio content that it is playing out via its respective loudspeaker 218 (the content being streamed from the master 102m). The inaudible signal including this predetermined pattern is received by the correlator 302 of the master 102m via its respective microphone apparatus 202. The controller 210 on the master 102m performs a correlation by supplying a reference signal to a correlator 302 to be correlated with the incoming acoustic signal. The reference signal comprises a predetermined instance of the known pattern, held locally on the master unit (either stored in a non-volatile memory on the master unit 102m, or held temporarily in a volatile memory having been temporarily downloaded from another source such as a server on the internet).

By correlating the incoming signal with the reference signal, the correlator 302 can detect the known pattern in the incoming signal. The correlation may for example comprise a time or frequency domain correlation. Generally the correlator 302 can be any means for comparing two signals and by inference their relative time or frequency differences, it may also be referred to as a comparator. Note also that this can be used for both time delay and 10 frequency compensation (see later).

From detection of the known pattern, the correlator 302 on the master 102m calculates a time offset relative to the master's own audio output. From this it determines the absolute network delay, i.e. the average delay (e.g. the mean, median or modal delay). That is, the 15 delay minus the effect of jitter (the absolute delay about which the jitter occurs). The absolute delay will be constant because any timing variation caused by the network will be mopped up in the jitter buffer used in the network interface.

if the delay is positive then the controller 210 on the master device 102m will send this delay offset to the controller(s) 210 on the slave device(s) 102m, which can then use this value to compensate for any network delays applied to the audible channel, using a FIFO buffer 304 (see again Figure 4). Where the delay is negative (i.e. the master device 102m is running ahead of the slaves 102s) then controller 210 on the master device 102m will apply the delay to its own audible stream, using its local FIFO buffer 304 (Figure 3). The control signals that are used to adjust the timing of the slave remote devices can be routed over the digital data network (via the network interfaces 212) or alternatively via the inaudible acoustic channel.

A second example use of the inaudible acoustic channel is for acoustic compensation.

The master device 102m can also analyse the received inaudible signal for equalization, echo and reverberation effects the results can also be sent to the slave speakers so that pre compensation can be applied to the audible signal prior to playout from the speaker. In this case the controller 210 on the master 102m is configured to perform a frequency domain transform (e.g. Fourier transform) on the received pattern in the inaudible signal in order to determine it received frequency profile (spectrum), e.g. power spectral density. Bycomprising with a reference spectrum (i.e. the known transmitted spectrum of the predetermined pattern), the controller 210 on the master 102m can thus determine a filtering effect of the environment, (e.g. room). In the architecture of Figure 3 this may be a function of the correlator 302 (on the master 102m), The controller 210 on the master 102m can then control the controllers 210 on the slaves 102 to apply in inverse of this filtering effect (wherein this control may be via the inaudible channel or may be via the data network).

The experienced filtering effect of the room or environment is typically a function of the frequency of the acoustic signals. Therefore the above works best in embodiments where the inaudible signal embedded in the audio content is an audible frequency range signal (inband signal). However, in alternative embodiments the inaudible signal may still be the ultrasound signal (out-of-band signal), as there will typically be some correlation between the filtering effect on the audio and ultrasound frequencies (but not as much as when using in-band frequencies).

The experienced filtering effect of the room or environment may also a function of the position of the listener. However, again there may be some correlation between the filtering effect experienced from the position of the master device 102m and the filtering effect experienced from the position of the user.

Note: The hidden control signal will allow better acoustic compensation than just listening to the raw audio because, for example, if the same audio stream is being played out of multiple speakers then there is no way to know which one needs the compensation added because the master controller cannot, differentiate the signals sent from each of the independent source speakers.

A third example use of the inaudible acoustic channel is for local multiple keyword detection and selection.

Smart speakers play out audio as well as receive voice commands from users. These voice commands are typically processed locally on-board the smart-speaker unit 102 that receives the command, i.e. via a local instance of the speech recognition algorithm implemented in its respective audio processing module 2.08. For example, one or more first commands may be processed locally to enhance the received voice command, and then the controller 210 waits for a second command in the form of a keyword (i.e. trigger word) that tells the smart speaker to further process the incoming voice signal. Typically this additional incoming voice signal will be a command that is sent to a cloud service to ask the cloud service to perform a task.

With the rapid uptake of smart speakers, many people now have multiple smart speakers in a single household (or other environment, e.g. office). If a user speaks a command when there is a network of multiple smart speakers present, then one or more of the devices will detect and respond to the request. Being able to detect multiple voice requests to these devices is important to ensure that only one device responds to the request. For example, if the user asks the smart speakers to purchase an item via the cloud service it is important that only one smart speaker responds to the request to ensure that only one item is purchased, not multiple items.

While there are network based solutions to this problem, a local solution for detecting multiple keyword detection events would enhance the user experience by ensuring that only the request with the highest audio quality is sent to the cloud service for processing. Accordingly, embodiments of the present disclosure provides a system which detects multiple instances of a command such as a keyword locally to ensure that only one command is forwarded onto the cloud service.

When the controller 210 on a slave smart-speaker unit 102s receives and detects a certain command via its local speech recognition algorithm (part of its audio processing module 208), it. sends a detection notification via the inaudible acoustic channel (e.g. ultrasound) to the controller 210 on the master unit 102m. For example the detected command may be a keyword whose execution triggers the carrying out of one or more preceding commands. Alternatively it is not excluded that the command in question could be a stand-alone command. Either way, the command is a command that will invoke some remote function to be performed via the network interface 212 and local data network (and in embodiments also via another network such as the Internet). For example the command could be to order a product such as an item or service to be ordered via the internet. It is also possible that two or more of the smart-speaker units 102 in the same environment (e.g. same room, home or office) all receive and detect the same command. If the command is being detected and executed locally, it is important that more than one of the smart-speaker units does not duplicate the execution of the commands (e.g. make multiple instances of the same requested purchase over the Internet).

To address this scenario, according to embodiments of the present disclosure., the controller 210 on each slave smart speaker 102s will send its detection notification over the inaudible channel to the master device 102m. This notification may also include a measure of a property related to the received quality of the respective instance of the received command. This may comprise an audio property such as noise floor, signal to noise ratio, received audio signal level (e.g. received amplitude or power), etc. Alternatively or additionally, the measured property may comprise a measure of detection confidence by the respective instance of the speech recognition algorithm.

The controller 210 on the master device 102m receives all of the detection notifications (e.g. keyword detection notifications) and then, based on the respective reports of received quality, selects the smart-speaker with the highest quality. If this is the master, the master 102m executes the command itself. If the unit with the best received quality is one of the slaves 102s, the controller 210 on the master 102 sends an instruction over the inaudible channel to the controller 210 on the selected slave 102s, instructing it to execute its received instance of the command. For instance this may comprise the slave 102s sending the recognized voice request to the relevant cloud service.

A fourth possible use of the inaudible acoustic channel is for control and management.

in this case the controller 210 on the master unit 102m uses the inaudible acoustic channel to send one or more control settings to the controller 210 on one or more of the slaves 102s, and to thereby to control them to apply the specified setting to the respective smart 5 speakers. The specified control settings may comprise network configuration settings for enabling the slave 102s to join the local data network (e.g. WLAN) via its respective network interface 212, or to further configure the slave 102s within the network once joined. Alternatively the specified control settings may comprise audio settings to apply to the play out of audio content from the respective loudspeaker 218 of the slave 102s

In a system containing multiple smart speakers there may be a requirement to aggregate these devices into a network so that they operate as a single combined entity. There are three main components to this: I) creating the network, II) adding new devices to the active network, III) controlling and managing the active network. Creating the network and adding 15 devices to the network can be a complex task for the user. Embodiments of the present disclosure implement a local data path that allows for easy setup and configuration using the inaudible acoustic channel. During setup and configuration, the master controller sends network configuration information to the slave devices and receives status updates from the slaves. Configuration information such as Ethernet or Wi-Fi SSID and password details 20 will allow new slave loudspeakers 102s to be added to the digital network with very little interaction from the user. An example of how this could be used is that a registration button on each device 102 could be used so that the master 102m and slave devices 102s are set into pairing mode, and this would allow the new slave device 102s to be added to the network.

Once the network is active, the same inaudible data channel can be used by the master to control and manage the entire network. For example, volume control, equalization (e.g. bass, middle and treble controls) can be applied across the network. At the same time, slave devices 102s can report back status information to the master device (e.g. microphone mute 30 activated).

In any of the above uses of the inaudible acoustic cannel or other use cases, in embodiments it also possible to create a mesh network were by remote speakers can monitor other slave speaker audio, for example in other rooms, and communicate the received delays back to the central master, using a network or the hidden audio.

it wiII be appreciated that the above embodiments have been described by way of example only. Other variants or applications may become apparent to a person skilled in the art once given the disclosure herein. The scope of the present disclosure is not limited by the abovedescribed embodiments but only by the accompanying claims.

Claims

1. A device for use as a second one of a set of devices that also includes at least a first device,, at least the second device taking the form of smart-speaker unit comprising a respective:

loudspeaker;

microphone for receiving voice inputs from the user;

2. The second device of claim 1, wherein the controller on the second device is configured to control its respective loudspeaker to emit said pattern with a predetermined timing relative to a portion of the audible content played out from the respective loudspeaker.

3. The second device of claim 2, wherein the inaudible signal comprises a predetermined pattern, the controller on the second device being configured to control its respective loudspeaker to emit said pattern with a predetermined timing relative to said portion of audible content.

4. The second device of claim 3, wherein the controller on the second device is configured to emit said pattern periodically.

5. The second device of claim 2, 3 or 4, wherein the controller on the second device is configured to control its respective loudspeaker to emit the inaudible signal during said portion of audible content as played out from the respective loudspeaker.

6. The second device of any preceding claim, wherein the inaudible signal is an ultrasound signal.

7. The second device of claim 6, wherein the inaudible signal is an audible-frequency signal emitted at an inaudible power level relative to said portion of audio content played out by the loudspeaker of the smart-speaker unit.

8. A system comprising the first device and the second device of any preceding claim.

9. The system of claim 8, wherein the first device comprises a respective loudspeaker.

10. The system of claim 9, wherein the first device also takes the form of a smartspeaker unit, wherein:

the respective controller on the first device is configured to submit voice inputs received by the respective microphone to a speech recognition algorithm to recognize and execute speech commands from the voice inputs, the respective controller being further configured to control its respective loudspeaker to play out audible content to the user including at least some content determined in response to one or more of the executed speech commands; and the first device also comprises a respective one or more network interfaces arranged to provide the respective controller with access to one or more data networks employing one or more electronic and/or electromagnetic access technologies, the respective controller being configured to thereby access remote functionality via at least one of the one or more networks in response to one or more of the executed speech commands.

11. The system of claim 8, 9 or 10, as dependent on at least claim 3, wherein: the controller on the first device is configured to determine said portion of content and stream it to the controller on the second device unit via one of the one or more data networks, wherein the streaming incurs a network delay;

the controllers on the first and second devices are configured to control their respective loudspeakers to play out said portion of audible content in parallel with one another;

the controller on the first device is configured to detect the predetermined pattern in an acoustic signal received through the microphone of the first device based on a comparison with a reference instance of the pattern, and thereby to determine an average value of said network delay; and the controller on the first device is further configured to cause an inverse of the determined average delay to be applied to the play out of the audio content from the respective loudspeaker of one of the first and second devices.

12. The system of claim 11, wherein:

the controller on the first device is configured so as, if the absolute delay is positive, to send an instruction to the controller on the second device to apply the inverse of the determined delay to the play out of the audio content from the second device.

13. The system of claim 11 or 12, wherein:

the controller on the first device is configured so as, if the absolute delay is negative, to apply the inverse of the determined delay to the play out of the audio content from the first device.

14. The system of any of claims 8 to 13, wherein:

the respective controller on the first device is configured to determine an acoustic filtering effect of an environment of the devices based on the inaudible signal received by the first device, and to cause the controller on the second device to apply in inverse of the determined filtering effect to subsequent playout of audible content from the loudspeaker of the second device.

15. The second device or system of any preceding claim, wherein said supplying of the voice inputs to the speech recognition algorithm comprises supplying the voice inputs to a local instance of the speech recognition algorithm implemented in the respective controller, to perform the recognition locally by the respective controller.

16. The second device or system of any of claims 1 to 14, wherein the speech recognition algorithm is implemented on a server, and said supplying of the voice inputs to the speech recognition algorithm comprises sending the voice inputs to the speech recognition algorithm on the server to perform the recognition at the server.

17. The system of claim 16 as dependent on at. least claim 10, wherein:

the controllers on the first and second devices are each operable to receive, through their respective microphones, a respective instance of a same one of said speech commands that accesses the remote functionality, and to measure a respective value of a property of the respective instance of the voice command indicative of a received quality thereof;

the controller on the second device is configured to share Its respective measured value with the controller of the first smart-speaker unit via the inaudible signal; and the controller on the first device is configured to compare the measured values to determine which of the first and second devices has received said one of the speech commands with the greater quality, and to cause the determined device to execute its respective instance of that speech command.

18. The system of claim 17, wherein said property is an audio property.

19. The system of claim 18, wherein said audio property comprises one of: noise floor, signal to noise ratio, or received audio signal level, or a metric based one or more thereof.

20. The system of any of claim 9 or any claim as dependent thereon, wherein the respective controller on the first device is further configured to communicate with the controller on the second device unit by controlling the respective loudspeaker of the first device to emit a further inaudible signal.

21. The system of claim 20. wherein the controller on the first device is configured to include one or more control settings for the second device in the further inaudible signal, and by means of said further inaudible signal, to control the controller on the second device to apply the one or more control settings to the second device.

22. The system of claim 21, wherein the one or more control settings comprise one or more network configuration settings for enabling the second device to join at least one of the one or more networks.

23. The system of claim 22. wherein the one or more network configuration settings comprises a network ID and/or password for the second device to use to join the at least, one network.

24. The system of any of claims 21 to 23, wherein the one or more control settings comprise one or more audio settings for the second device to use in playing out its respective audio content.

25. The set of smart-speaker units of claim 24, wherein the one or more audio settings comprise a volume and/or equalization setting.

26. The second device or system of any of claims 1 to 19, wherein the controller on the second device is configured to include one or more control settings for the first device in said inaudible signal, and by means of said inaudible signal, to control the controller on the first device to apply the one or more control settings to the first device.

27. The second device or system of claim 26, wherein the one or more control settings comprise one or more network configuration settings for enabling the first device to join at least one of the one or more networks.

28. The second device or system of claim 27, wherein the one or more network configuration settings comprises a network ID and/or password for the first device to use to join the at least one network.

29. The second device or system of any of claims 26 to 28, wherein the one or more control settings comprise one or more audio settings for the first device to use in playing out respective audio content from the first device.

30. The second device or system of claim 29, wherein the one or more audio settings comprise a volume and/or equalization setting.

31. The second device or system of any preceding claim, wherein the or each smartspeaker unit comprises no display screen.

32. A set of devices comprising at. least a first device and a second device, at least the first device comprising a microphone, and each of the first and second devices comprising a respective loudspeaker and controller; wherein:

the respective controller on the second device is configured to control its respective loudspeaker to play out audible content:

at least the first device comprises a network interface arranged to provide the respective controller with access to one a data networks employing an electronic and/or electromagnetic access technology;

the respective controller on the second device is configured to control its respective loudspeaker to emit said pattern with a predetermined timing relative to a portion of the audible content played out from the respective loudspeaker;

the controller on the first device is configured to detect the predetermined pattern in an acoustic signal received by the microphone based on a comparison with and a reference instance of the pattern, and to thereby determine an average value of said network delay; and the controller on the first device is further configured to cause an inverse of the determined average delay to be applied to the play out of the audio content from the respective loudspeaker of one of the first and second devices.

33. A set of devices comprising at least a first device and a second device, at least the first device comprising a microphone, and at least the second device comprising a loudspeaker, and each of the first and second devices comprising a respective controller; wherein: