GB2613898A

GB2613898A - Noise cancellation

Info

Publication number: GB2613898A
Application number: GB2118540.0A
Authority: GB
Inventors: Williams Douglas; Bicknell John
Original assignee: British Telecommunications PLC
Current assignee: British Telecommunications PLC
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2023-06-21
Also published as: WO2023117272A1

Abstract

Crosstalk interference in digital audio streams is reduced by predicting acoustic crosstalk between two digital audio streams 140, 150 encoding acoustic signals via matrix of transfer/impulse response functions based on device 136, 138 properties if the acoustic signals were simultaneously output, and then modifying the second digital audio stream by subtracting a cancellation signal. Actual acoustic signals therefore need not be detected.

Description

NOISE CANCELLATION

Technical Field

Embodiments of the present invention described herein relate to methods and systems for noise cancellation, in particular, methods and systems for reducing the effect of acoustic crosstalk interference using knowledge of digital audio streams in a network.

Background to the Invention and Prior Art

Active noise control, also known as noise cancellation, is a known method in the art for reducing unwanted sound by adding a second sound which is chosen to cancel the unwanted sound. Noise cancellation has many applications and can be used both in upstream audio streams and downstream audio streams.

An example of a downstream application is if a user is playing streamed music (downstream), the stream may be noise cancelled by adding a second sound chosen to cancel out any unwanted sound. This second sound is calculated based on detecting acoustic signals in the user's environment which form the unwanted sound.

An example of an upstream application is the use of voice commands with virtual assistants on smart speakers and smartphones. Virtual assistants are often activated by a wake word or phrase. For the virtual assistant to work, the user's voice commands need to be detectable even in environments with significant background noise present. Smart speakers currently have arrays of directional microphones and depend on being able to identify the location from where a voice command is being issued. The location of a voice command is identified through a pattern matching technique where the sound identified is matched to patterns that are known to be examples of the wake word (or other specific commands known to the service). When the device has identified a voice, it can then improve its focus on the voice by changing the relative levels of gain in the different microphones in the device.

For example, there may be seven microphones in a smart speaker (microphone 1 to 7) and microphone 1 may be the one identified as being used to record the voice command. Signals from one or more of Microphones 2 to 7 may be used to characterise the unwanted background noise that may be subtracted from the signal recorded by microphone 1 to improve the "signal to noise" ratio and thus make the voice command more intelligible.

Smartphones utilise multiple microphones. The signal from the microphone that is furthest from the user's mouth, and includes a lot of background noise, is subtracted from the signal recorded by the microphone placed near the user's mouth, this microphone includes the voice and the background noise. Subtraction yields a signal which should be largely clear of background noise. This type of solution works well when the form factor, the device, and the likely position of the source of the voice can be anticipated with some accuracy. The technique will work less well for microphones that are situated a long way from the person speaking.

A paper written by Yong Xu et al. in 2015 titled "A regression approach to speech enhancement based on deep neural networks" proposes a regression method which learns to produce a ratio mask for every audio frequency. The produced ratio mask aims to leave human voice intact and deletes extraneous noise.

US 9,330,652 B2 describes a noise cancellation technique based on detecting noise using multiple microphones.

A paper submitted as part of the European DIRHA project (Distant-speech Interaction for Robust Home Applications) titled "A multi-channel corpus for distant-speech interaction in presence of known interferences" presents a new corpus of multi-channel audio data for is studying distant-speech recognition systems. As part of experiments to validate the corpus, the authors ran experiments to baseline microphone Beam-forming and Acoustic Echo Cancellation (AEC) techniques. This processing is not the focus of the European DIRHA project paper, but it refers to the removal of interferences to improve speech recognition when other known audio sources are active, and may be acquired at source and used as a reference to provide suppression.

A particular challenge with these approaches is the speed with which the machine can make decisions. Humans like voice-based interaction systems to have delays less than 200ms otherwise the interaction feels unnatural and people are likely to think they have not been heard. It is in aim to overcome the problems of the prior art.

Summary of the Disclosure

Embodiments of the invention provide a noise cancellation technique which does not require detecting acoustic signals in the vicinity of a device, rather the invention uses knowledge of what other audio streams are being transported in a network to predict background noise (referred to as acoustic crosstalk interference) present or that will be present at an end-node recording or reproduction device. The prediction can then be used to mitigate the background noise.

One application of the invention relates to increasing the intelligibility of voice commands received by voice assistants, for example, with devices such as smart phones and smart speakers like Amazon® Echo®, GoogleC) Nest®, Sonos0 speakers and Apple® HomePod® etc. These devices need to hear voice commands clearly, even when a TV may be on or music may be playing. Such prior art devices do not have prior knowledge of other sounds playing in the vicinity so instead listen for wake words and commands above or within the melee of other sounds, for example by focusing the microphone array to where a voice is detected. The present invention improves the intelligibility of the voice commands by providing knowledge of other sounds that may be playing in the vicinity of the smart device and allow these sounds to be subtracted (or minimised) from the signal the microphones detect.

Another application of the invention relates to pre-emptively cancelling (or minimising/mitigating) noise from neighbouring devices in downstream streams. For example, two neighbouring devices (first and second devices) both receiving downstream streams (first and second streams) from the network router may both be playing audio out loud. This may result in the audio of the first device causing unwanted background noise (i.e., acoustic crosstalk interference) for the audio of the second device. The present invention provides a way of pre-emptively cancelling (or alternatively minimising or merely reducing) this unwanted background noise using knowledge of the first stream which will cause the unwanted background noise. The knowledge of the first stream at the network router is used to make a prediction of the unwanted background noise which will be caused in the vicinity of the second device by the first stream playing through the first device. The prediction of the unwanted background noise can then be used to minimise the impact, in the vicinity of the second device, of the unwanted noise caused by the first device.

The present disclosure addresses the above problem, by providing methods and systems of using knowledge of digital audio streams to reduce the effect of acoustic crosstalk interference reported by listening devices such as microphones or heard by people.

Embodiments of the present invention provide a noise cancellation technique which does not require detecting acoustic signals in the vicinity of a device, rather the invention uses knowledge of what other audio streams are present and being transported in a network to predict background noise present at a sound reproduction or recording device.

In view of the above, from a first aspect, the present disclosure relates to a method for reducing the effect of acoustic crosstalk interference using knowledge of a digital audio stream. The method comprises: (i) detecting at a network router, at least a first audio digital stream and a second audio digital stream, the first audio digital stream encoding a first audio acoustic signal and the second audio digital stream encoding at least a second audio acoustic signal; (ii) predicting acoustic crosstalk interference of the first audio acoustic signal with the second audio acoustic signal that would occur upon simultaneous production of the first and second acoustic signals, the prediction using the first audio digital stream; and (iii) reducing the effect of the predicted acoustic crosstalk interference by modifying the second audio digital stream with a cancellation signal generated based on the prediction of the acoustic crosstalk interference.

Several advantages are obtained from embodiments according to the above described aspect. For example, the invention uses knowledge of digital audio streams present in a network to predict background noise (acoustic crosstalk interference) present in the vicinity of a device. The prediction can then be used to minimise the background noise audible to a user of the device by mitigating the background noise based on the prediction.

Or, the prediction can be used to mitigate background noise of a recorded stream which contains the background noise (acoustic crosstalk interference) resulting from other audio streams in the vicinity. "Acoustic crosstalk interference" is used herein to refer to any undesired effect (e.g., background noise) present in a signal or stream which is the result of another signal or stream.

In some embodiments, the first audio digital stream and the second audio digital stream are sent between the network router and the same device. In other embodiments, the first audio digital stream is sent between the network router and a first device and the second audio digital stream is sent between the network router a second device. When the first and second audio digital streams are to/from the same device (e.g. a smart speaker simultaneously playing music and listening for voice commands), the invention provides a way of cancelling out the music so that voice commands can be distinguished. When the first and second audio digital streams are to/from different devices (e.g. one stream is being sent to/from one device, e.g. a TV, and the other stream is being set to/from another device, e.g. a smart speaker), the invention provides a way of cancelling out the TV from the background of voice commands picked up by the speaker, to better distinguish the cornmands.

In some embodiments, modifying the second audio digital stream with the cancellation signal generated based on the prediction of the acoustic crosstalk interference comprises subtracting the predicted acoustic crosstalk interference from the second audio digital stream. Subtracting the predicted background noise (acoustic crosstalk interference) from the second audio digital stream cancels out (or at least minimises) the background noise in the vicinity of the device at which the second audio digital stream is sent to or from.

In some embodiments, the first audio digital stream and the second audio digital stream are travelling in the same direction. In other embodiments, the first audio digital stream and the second audio digital stream are travelling in opposite directions. As described herein, the invention may be applied to at least two digital streams. Possible scenarios include: both streams being upstream (i.e., sent from a device to the network router), both streams being downstream (i.e., originating from the cloud or any digital storage facility, which may include local storage, and being sent from the network router to a device), or one stream being upstream and the other being downstream (opposite directions).

In some embodiments, the first audio digital stream is sent between the network router and a first device and predicting the acoustic crosstalk interference comprises using properties of the first device. In a scenario where the first audio digital stream is sent to or from (i.e., the stream may be upstream or downstream) a first device, the prediction of the acoustic crosstalk interference (resulting from the background noise) may use the properties of that first device (e.g., the type of device, the volume of the device, the microphone on the device, location of the device) to better predict the acoustic crosstalk interference.

In some embodiments, the second audio digital stream is sent between the network router and a second device and predicting the acoustic crosstalk interference comprises using properties of the second device. In a scenario where the second audio digital stream is sent to or from (i.e., the stream may be upstream or downstream) a second device, the prediction of the acoustic crosstalk interference (resulting from the background noise) may use the properties of that second device (e.g., the type of device, the volume of the device, the microphone on the device, location of the device) to better predict the acoustic crosstalk interference.

In some embodiments, predicting the acoustic crosstalk interference comprises using a matrix of transfer functions comprising the properties of the first device and/or the properties of the second device. Storing properties of the devices in the network in a matrix allows the system to quickly look up transfer functions (i.e., how the set up of the devices affects how the first acoustic signal appears in the second audio digital stream) for a given set up. A transfer function theoretically models a system's output for each possible input.

In some embodiments, predicting the acoustic crosstalk interference comprises using a matrix of impulse response functions comprising impulse response functions of the environment where the first audio acoustic signal is generated. An impulse response is a system's output when presented with a brief input signal, called an impulse. The impulse response measures how sound travels in an environment between where it is generated and where it is measured.

In some embodiments, the impulse response functions are learned via a machine learning process. This allows the system to learn, through a training phase, the way that the specific location and set up of the device(s) transform the first digital stream and the resultant first acoustic signal. The training phase thus defines a measured impulse response that can be subtracted from the second audio digital stream to remove the known noise from the second audio digital stream.

In some embodiments, the first audio digital stream is upstream and the first acoustic signal is recorded and encoded as the first audio digital stream by a first device. In this scenario, the first acoustic signal recorded by the first device is encoded to form the first audio digital stream and sent to the network router.

zo In some embodiments, the first device comprises a microphone. The microphone records the first acoustic signal.

In some embodiments, the first audio digital stream is downstream and the first audio digital stream is decoded and reproduced as the first acoustic signal by a first device. In this scenario, the first audio digital stream is sent from the network router to the first device, the first audio digital stream is then decoded and reproduced as the first acoustic signal at the first device.

In some embodiments, the first device comprises a speaker. The speaker reproduces the first audio digital stream as the first acoustic signal.

In some embodiments, the second audio digital stream further encodes the acoustic crosstalk interference of the first audio acoustic signal; and the effect of the acoustic crosstalk interference is reduced by minimising the acoustic crosstalk interference component in the second audio digital stream by applying the cancellation signal to the second audio digital stream. The second audio digital stream therefore comprises two components, the second audio acoustic signal (i.e., the wanted sound) and the acoustic crosstalk interference from the first audio acoustic signal (i.e., the unwanted sound or background noise). The unwanted sound is removed from the second audio digital stream using the cancellation signal based on the prediction using the first digital audio stream.

In some embodiments, the second audio digital stream is upstream and the second audio acoustic signal is recorded and encoded as the second audio digital stream by the first device or a second device, preferably the first or second device comprises a microphone. In this scenario, the second digital stream encodes a second acoustic signal, the second acoustic signal comprising acoustic crosstalk interference (background noise) resulting from the first acoustic signal, the second digital stream is then sent from the device to the network router.

In some embodiments, the effect of the predicted acoustic crosstalk interference is reduced by applying the cancellation signal to the second audio digital stream such that the second audio digital stream encodes the second audio acoustic signal and a cancellation acoustic signal. The second audio digital stream therefore comprises two components, the second audio acoustic signal (i.e., the wanted sound) and the cancellation acoustic signal to cancel out the acoustic crosstalk interference from the first audio acoustic signal. The unwanted background noise is pre-emptively cancelled using the cancellation signal in the second audio digital stream.

In some embodiments, the second audio digital stream is downstream and the second audio digital stream is decoded and reproduced as the second acoustic signal by the first device or a second device, preferably the first or second device comprises a speaker. In this scenario, the second audio digital stream is received at the network router from the cloud (or any digital storage facility, which may include local storage), and then sent to the device where it is reproduced as a second acoustic signal. The first acoustic signal which is present in the vicinity of the device results in background noise (acoustic crosstalk interference) in the background of the second acoustic signal. By subtracting (or otherwise removing) the predicted acoustic crosstalk interference from the second audio digital stream before it is reproduced at the device, the background noise of the first acoustic signal is mitigated (pre-emptively cancelled).

From a second aspect, the present disclosure relates to a system for reducing the effect of acoustic crosstalk interference using knowledge of a digital audio stream, the system comprising: a processor; and a memory including computer program code. The memory and the computer code configured to, with the processor, cause the system to perform the method of any of the preceding embodiments.

From a third aspect, the present disclosure relates to a system for reducing the effect of acoustic crosstalk interference using knowledge of a digital audio stream. The system comprises: (i) an audio digital stream detector arranged to detect at least a first audio digital stream and a second audio digital stream at a network router, the first audio digital stream encoding a first audio acoustic signal and the second audio digital stream encoding at least a second audio acoustic signal; and (ii) a processor. The processor is arranged to: (a) predict acoustic crosstalk interference of the first audio acoustic signal with the second audio acoustic signal that would occur upon simultaneous production of the first and second acoustic signals, the prediction using the first audio digital stream; and (b) reduce the effect of the predicted acoustic crosstalk interference by modifying the second audio digital stream with a cancellation signal generated based on the prediction of the acoustic crosstalk interference.

In some embodiments, such a system may be located in the network router itself.

From a fourth aspect, the present disclosure relates to a method for mitigating background noise (acoustic crosstalk interference) present in an audio digital stream comprising voice commands. The method comprises: (i) detecting at a network router, at least a first audio digital stream and a second audio digital stream, wherein the second audio digital stream comprises at least one voice command and background noise, the background noise resulting from a first audio acoustic signal encoded by the first audio digital stream; (ii) predicting the background noise present in the second audio digital stream, the prediction using the first audio digital stream; and (iii) removing the predicted background noise from the second audio digital stream to generate a background noise reduced second audio digital stream having less background noise from the first audio acoustic signal. This results in a cleaner stream of the voice command, thereby increasing the intelligibility of the voice command, enabling easier processing.

In this scenario, the first audio digital stream is downstream. The first audio digital stream is received at the network router from the cloud (or any digital storage facility, which may include local storage) and then sent to a first device. The first audio digital stream is decoded and reproduced as the first acoustic signal by the first device. The second audio digital stream is upstream. A second acoustic signal is recorded at a second device (which may be the same as the first device), the second acoustic signal is then encoded as the second audio digital stream and sent to the network router. The second acoustic signal comprises the voice command and the background noise (resulting from the first acoustic signal which originates nearby).

In some embodiments, the first audio digital stream and the second audio digital stream are sent between the network router and the same device or two different devices.

In some embodiments, removing the predicted background noise from the second audio digital stream comprises subtracting the predicted background noise from the second audio digital stream.

In some embodiments, the first audio digital stream is sent between the network router and a first device and predicting the background noise comprises using properties of the first device.

In some embodiments, the second audio digital stream is sent between the network router io and a second device and predicting the background noise comprises using properties of the second device.

In some embodiments, predicting the background noise comprises using a matrix of transfer functions comprising the properties of the first device and/or the properties of the second device.

In some embodiments, predicting the background noise comprises using a matrix of impulse response functions comprising the properties of the first device and/or the properties of the second device. An impulse response is a system's output when presented with a brief input signal, called an impulse.

In some embodiments, the impulse response functions are learned via a machine learning process.

Brief Description of the Drawings

Embodiments of the invention will now be further described by way of example only and with reference to the accompanying drawings, wherein: Figure 1 is a system diagram illustrating embodiments of the present invention; Figure 2 is a system diagram illustrating embodiments of the present invention; Figure 3 is a system diagram illustrating embodiments of the present invention; Figure 4 is a system diagram illustrating embodiments of the present invention; Figure 5 is a flow chart illustrating an application of embodiments of the present invention; Figure 6 is a system diagram illustrating embodiments of the present invention; Figure 7 is a system diagram illustrating embodiments of the present invention; Figure 8 illustrates a simplified example of how embodiments of the present invention work; Figure 9 is a flow chart illustrating embodiments of the present invention; Figure 10 is a flow chart illustrating embodiments of the present invention; Figure 11 is a flow chart illustrating embodiments of the present invention; io Figure 12 is an example of a look up table that shows generic transfer functions; Figure 13 is an example of a look up table that shows modelled transfer functions; Figure 14 is an example of a look up table that shows measured transfer functions; Figure 15 is an example ledger which records the preferred transfer function to use for a given (j,k) pairing; and Figure 16 is an example ledger which shows the preferred transfer function to use for each possible listening device (1-j) and loudspeaker (1-k) pairing.

Description of the Embodiments

Embodiments of the present invention relate to methods and systems for cancelling noise in a digital audio stream by predicting what unwanted sound will arise from other audio digital streams present in the network.

The present invention provides a way to use the knowledge of digital streams in the network router to predict what background noise (referred to as acoustic crosstalk interference) will be present at devices in the network. The predicted background noise can then be pre-emptively cancelled (or minimised) by subtracting (or any other suitable method of noise removal) the prediction from a stream going to or from the device where

the background noise is present.

In more detail, the current state of the art of noise cancelling relies on detecting acoustic signals in the vicinity of a device/user and then subtracting those detected acoustic signals from the incoming or outgoing sound to produce noise cancelled sound. For example, if a user is listening to music with headphones, current state of the art detects what acoustic signals are in the vicinity of the headphones, characterises these acoustic signals as background noise and then subtracts the background noise from the music being reproduced at the headphones. The user then only hears the music. Another example is if a user is speaking on the phone. Current state of the art has two microphones, one to detect the user's speech plus any background noise, and one to detect the background noise. Again, these microphones are detecting acoustic signals in the vicinity of the device. The detected background noise is subtracted from the signal containing background noise and the user's speech, thereby only leaving the user's speech.

The present invention provides another way of reducing background noise without having to detect acoustic signals in the vicinity of a device/user. Although, it is envisaged that the present invention could be used in combination with existing prior art noise cancelling methods to further enhance noise cancellation effects. The present invention uses knowledge of digital streams within a network to predict what background noise is likely to be present. The digital streams can then be corrected for this predicted background noise.

Various aspects and details of these principal components will be described further below with reference to Figures 1 to 11.

Figure 1 illustrates an example system 100 of the present invention comprising a cloud in communication with a network router 120. The network router 120 is able to send and receive data to and from the cloud 110. Instead of the cloud 110, the network router 120 may send and receive data from any digital storage facility, which may include local storage. The network router 120 comprises an audio stream detector 122 able to detect the presence of digital audio streams in the network router, a processor 124 able to perform noise prediction based on detected streams and to cancel or minimise noise from streams based on that prediction, a receiver 126 able to receive digital streams from one or more devices in the network and a transmitter 128 able to send digital streams to one or more devices in the network.

In example system 100, there is a device 130 connected to the network router 120. The device 130 comprises both a transmitter 132 and a receiver 134. The device 130 further comprises a microphone 136 and a speaker 138.

The speaker 138 does not have to be contained within the device 130, the speaker may be external to the device. Similarly, the microphone 136 does not have to be contained within the device 130, the microphone may be external to the device. This applies to all embodiments described herein (Figures 1-4).

The device 130 is arranged such that an acoustic audio signal can be detected at the microphone 136, converted at the device 130 to an upstream digital audio stream 140 (converter not shown in Figure), and transmitted from the device 130 to the network router 120 by the transmitter 132. The device 130 is further arranged such that a downstream digital audio stream 150 can be received at the receiver 134 from the network router 120, converted to an acoustic audio signal (converter not shown in Figure) and reproduced by speaker 138.

One example of a system like example system 100 would be a smart speaker which is simultaneously (1) playing streamed music received via downstream stream 150 and played by speaker 138; and (2) listening for voice commands being detected at microphone 136 and sending any received voice commands to the network router 120 in the form of upstream stream 140.

The downstream stream 150 for the music being played on the smart speaker device 130 will be received at the network router 120 from the cloud 110 and then sent to the device 130 which converts the digital audio stream 150 to an acoustic audio signal to be reproduced by the speaker 138. If a person wishes to speak to their smart speaker, they may use a pre-determined wake phrase to activate the smart speaker. The acoustic audio signal recorded by the smart speaker's microphone 136 is transformed into a digital audio stream 140 and sent to the network router 120 to be sent onto the cloud 110 where the audio will be processed to understand what the voice command is. The stream 140 will comprise the person's voice speaking the wake phrase and the audio from the music being played. The router 120 therefore contains two audio digital streams 140, 150. The first digital audio stream 150 comprises the outgoing music audio (from the router 120 to the device 130) and the second digital audio stream 140 comprises the incoming audio from the smart speaker microphone 136 (from the device 130 to the router 120) which contains both the person's voice and audio from the music. For good interpretation of the person's voice, it is advantageous to remove the audio from the music (unwanted sound or noise).

Knowledge of the first digital audio stream 150 can be used to remove the noise. The first digital audio stream 150 (the outgoing music) can be used to predict what noise will be picked up by the microphone 136 as a result of the music played by the speaker 138. The first digital audio stream 150 is used to generate a signal pattern which would be expected to be picked up at the microphone 136, based on the first digital audio stream 150. This signal pattern corresponds to the noise. Further details of how the background noise (acoustic crosstalk interference) is predicted are below. The signal pattern of the background noise can then be subtracted from the second digital audio stream 140 to remove the noise caused by the music. This noise removal can be completed before the second audio digital stream 140 is sent to the cloud 110 to be processed, allowing a noise-cancelled second audio digital stream to be sent to the cloud 110.

Figure 2 illustrates an example system 200 of the present invention comprising a cloud (or any digital storage facility, which may include local storage) in communication with a network router 120 which functions as described above in relation to Figure 1.

In example system 200, there is a first device 230 and a second device 260 connected to the network router 120. The first device 230 comprises a transmitter 232. The first device 1.0 230 further comprises a microphone 236. The first device 230 is arranged such that an acoustic audio signal can be detected at the microphone 236, converted at the device 230 to an upstream digital audio stream 240 (converter not shown in Figure), and transmitted from the first device 230 to the network router 120 by the transmitter 232. The second device 260 comprises a receiver 234. The second device further comprises a speaker 238.

is The second device 260 is arranged such that a downstream digital audio stream 250 can be received at the receiver 234 from the network router 120, converted to an acoustic audio signal (converter not shown in Figure) and reproduced by speaker 238.

One example of a system like example system 200 would be the first device 230 being a smart speaker which is listening for voice commands being detected at microphone 236 and sending any received voice commands to the network router 120 in the form of upstream stream 240. The second device 260 could be a music speaker or a TV speaker (or any other device which reproduces streamed audio) playing streamed music/TV received via downstream stream 250 and played by speaker 238. The first and second devices 230, 260 are neighbouring devices such that the acoustic audio signal generated at speaker 238 causes background noise detected at microphone 236. The system then acts in a similar way to what has been described above in relation to Figure 1. More detailed examples of upstream-downstream systems are described below.

A first example upstream-downstream system is a situation where there is a smart speaker 230 and a TV 260 connected to the same network streaming a TV program. The audio digital stream (a "first audio digital stream") 250 for the sound relating to the TV program being displayed on the TV 260 will be received at the network router 120 and then sent to the TV speakers 260 which convert the digital audio stream 250 to an acoustic audio signal so that a person watching the TV can hear the audio relating to the TV program. If a person wishes to speak to their smart speaker 230, they may use a pre-determined wake phrase to activate the smart speaker 230. The acoustic audio signal recorded by the smart speaker's microphone 236 is transformed into a digital audio stream 240 (a "second audio digital stream") and sent to the network router 120 to be sent onto the cloud 110 where the audio will be processed. The second audio digital stream 240 will comprise the person's voice speaking the wake phrase and the audio from the TV program. For example, the TV 260 may be in the same room as the smart speaker 230 (or nearby in a different room) and therefore audio from the TV will be picked up by the smart speaker's microphone 236. The router therefore contains two audio digital streams 240, 250. The first digital audio stream 250 comprises the outgoing TV audio (from the router to the TV) and the second digital audio stream 240 comprises the incoming audio from the smart speaker microphone 236 (from the smart speaker to the router) which contains both the person's voice and audio from the TV program. For good interpretation of the person's voice, it is advantageous to remove the audio from the TV program (unwanted sound or noise). Knowledge of the first digital audio stream 250 can be used to remove the noise. The first digital audio stream (the outgoing TV program sound) 250 can be used to predict what is noise will be picked up by the smart speaker 230 as a result of the TV program sound. The first digital audio stream is used to generate a signal pattern of the background noise which would be expected to be picked up at the smart speaker 230, based on the first digital audio stream 240. Further details of how the background noise (acoustic crosstalk interference) is predicted are below. This signal pattern corresponds to the noise. The signal pattern can then be subtracted from the second digital audio stream 250 to remove the noise caused by the TV program. This noise removal can be completed before the second audio digital stream 240 is sent to the cloud to be processed.

A second example upstream-downstream system is a situation where there is a person talking to their smart speaker 230 causing unwanted noise in the vicinity of a person watching streamed TV, the smart speaker 230 and the TV 260 being connected to the same network. An acoustic signal of the person's speech as they talk to their smart speaker 230 is received at a microphone 236 in the smart speaker 230 and converted to an audio digital stream (a "first audio digital stream") 240. The first audio digital stream 240 is sent from the smart speaker 230 to the network router 120 (the network router will then send the first audio digital stream 240 onto the cloud 110 to process the voice command). A second audio digital stream 250 is also present in the network router 120, sound relating to the TV program being displayed on the TV 260 will be received at the network router 120 from the internet and then sent to the TV speakers 238 which convert the digital audio stream 250 to an acoustic audio signal so that a person watching the TV can hear the audio relating to the TV program. As an example, the TV 260 may be in the same room as the smart speaker 230 (or nearby in a different room) and therefore the sound of the person talking to their smart speaker 230 will disrupt the person watching the TV. The router therefore contains two audio digital streams 240, 250. The first digital audio stream 240 comprises the incoming audio from the smart speaker microphone 236 (from the smart speaker to the router) which contains the person's voice. The second digital audio stream 250 comprises the outgoing TV audio (from the router to the TV). Advantageously, the TV sound signal could be processed to pre-emptively cancel out the noise of the person speaking to their smart speaker (unwanted sound or noise). Knowledge of the first digital audio stream 240 can be used to remove the noise. The first digital audio stream 240 (the incoming voice) can be used to predict what noise will be in the vicinity of the TV 260. The first digital audio stream 240 is used to generate a signal pattern which would be expected to be in the vicinity of the TV, based on the first digital audio stream 240. This signal pattern corresponds to the noise. The signal pattern can then be subtracted from the second digital audio stream 250 to remove the noise caused by the person talking to their smart speaker.

Figure 3 illustrates an example system 300 of the present invention comprising a cloud 110 in communication with a network router 120 which functions as described above in relation to Figure 1.

In example system 300, there is a first device 330 and a second device 360 connected to the network router 120. Both of the first and second devices 330, 360 comprise a transmitter 332, 362 and a microphone 336, 366. Both devices 330, 360 are arranged such that an acoustic audio signal can be detected at microphones 336, 366, converted at the devices 330, 360 to upstream digital audio streams 340, 350 (converter not shown in Figure), and transmitted from the devices 330, 360 to the network router 120 by transmitters 332, 262.

One example of a system like example system 300 would be two neighbouring smart speakers. Both devices 330, 360 could be smart speakers which listen for voice commands being detected at microphones 336, 366 and sending any received voice commands to the network router 120 in the form of upstream stream 340, 350. As the first and second devices 330, 360 are neighbouring devices, acoustic signals in the vicinity of one device (i.e., a person giving a voice command to one speaker) will be background noise for the other device. The system then acts in a similar way to remove the background noise to what has been described above in relation to Figure 1. A more detailed example of an upstream-upstream system is described below.

One example situation is where there are two people talking to separate smart speakers 330, 360 which are both connected to the same network router 120 (or smartphones, or a combination of, or any device which sends detected audio signals upstream to the network router). If the smart speakers 330, 360 and/or people are close enough to one another, there will be overlap of the two voices in the acoustic signals detected by one or both of the speakers 330, 360 (depending on speaker sensitivity and voice volume). An acoustic signal of a first person's speech as they talk to a first smart speaker 330 is received at a microphone 336 in the first smart speaker 330 and converted to an audio digital stream (a "first audio digital stream") 340. The first audio digital stream 340 is sent from the first smart speaker 330 to the network router 120 (the network router 120 will then send the first audio digital stream 340 onto the cloud 110 to process the voice command). Similarly, an acoustic signal of a second person's speech as they talk to a second smart speaker 360 is received at a microphone 366 in the second smart speaker 360 and converted to an audio digital stream (a "second audio digital stream") 350. The second audio digital stream 350 is sent from the second smart speaker 360 to the network router 120 (the network router 120 will then send the second audio digital stream 350 onto the cloud 110 to process the voice command). If, for example, the first person's voice is had been detected by the second smart speaker 360, and therefore the second audio digital stream 350 not only comprised the second person's voice command, but also background noise of the first person speaking, it would be advantageous to remove the background noise of the first person speaking from the second audio digital stream 350 to increase intelligibility of the second person's voice command before the second audio zo digital stream 350 is sent from the network router 120 to the cloud 110 to be processed.

The router 120 therefore contains two audio digital streams 340, 350. The first digital audio stream 340 comprises the incoming audio from the first smart speaker microphone 336 (from the first smart speaker to the router) which contains the first person's voice. The second digital audio stream 350 comprises the incoming audio from the second smart speaker microphone 366 (from the second smart speaker to the router) which contains the second person's voice. Advantageously, the second audio digital stream 350 could be processed to remove or minimise the noise of the first person speaking to the first smart speaker in the background (unwanted sound or noise). Knowledge of the first digital audio stream 340 can be used to remove or minimise the noise. The first digital audio stream 340 (the incoming first voice) can be used to predict what noise will be in the background of the incoming second voice. The first digital audio stream 340 is used to generate a signal pattern which would be expected to be in the second digital audio stream 350, based on the first digital audio stream 340. This signal pattern corresponds to the noise. The signal pattern can then be subtracted from the second digital audio stream 350 to remove the noise caused by the first person talking to the first smart speaker 330, leaving only (or at least increasing the intelligibility of) the second person's voice command.

Figure 4 illustrates an example system 400 of the present invention comprising a cloud 110 in communication with a network router 120 which functions as described above in relation to Figure 1.

In example system 400, there is a first device 430 and a second device 460 connected to the network router 120. Both of the first and second devices 430, 460 comprise a receiver 434, 464 and a speaker 438, 468. Both devices 430, 460 are arranged such that downstream digital audio streams 440, 450 can be received at the receivers 434, 464 from the network router 120, converted to an acoustic audio signal (converter not shown in Figure) and reproduced by speakers 438, 468.

One example of a system like example system 400 would be two devices both emitting streamed sound, e.g. a music speaker and a TV speaker. As the first and second devices 430, 460 are neighbouring devices, acoustic signals produced by the first device 430 will also be heard in the vicinity of the second device 460 (i.e., the music from one speaker will be background noise for the other device). The system then acts in a similar way to is remove the background noise to what has been described above in relation to Figure 1. A more detailed example of a downstream-downstream system is described below.

Another example is a situation where there is a speaker 460 playing streamed music and a TV 430 connected to the same network router 120 streaming a TV program. The audio digital stream (a "first audio digital stream") 440 for the sound relating to the TV program being displayed on the TV will be received at the network router 120 and then sent to the TV speakers 438 which convert the digital audio stream 440 to an acoustic audio signal so that a person watching the TV can hear the audio relating to the TV program. The audio digital stream (a "second audio digital stream") 450 for the sound relating to the streamed music being played on the speaker 468 will also be received at the network router 120 and then sent to the music speaker 468 which converts the digital audio stream 450 to an acoustic audio signal so that a person listening to the music can hear the music audio. If, for example, the TV 430 is in the same room as the music speaker 460 (or nearby in a different room) audio from the TV will "spill over" into the vicinity of the music speaker. Without the present invention, a person wanting to only listen to the music would have unwanted background noise from the TV playing nearby. The router 120 contains two audio digital streams 440, 450. The first digital audio stream 440 comprises the outgoing TV audio (from the router 120 to the TV 430) and the second digital audio stream 450 comprises the outgoing music audio (from the router 120 to the music speaker 460). To reduce the unwanted background noise from the TV playing nearby, it is advantageous to cancel (or minimise) the audio from the TV program (unwanted sound or noise) using knowledge of the first digital audio stream 440. The first digital audio stream 440 (the outgoing TV program sound) can be used to predict what noise will be in the vicinity of the music speaker 460 as a result of the TV program sound. The first digital audio stream 440 is used to generate a signal pattern which would be expected to be produced in the vicinity of the speaker 460, based on the first digital audio stream 440. This signal pattern corresponds to the noise. The signal pattern can then be subtracted from the second digital audio stream 450 to pre-emptively remove the noise caused by the TV program before the stream 450 is sent to the device 460.

Note this example could equally be reversed such that noise caused by the music stream in the vicinity of the TV is cancelled.

As described above, the invention can be applied to a variety of situations involving more than one audio digital stream transmitting through a network gateway, e.g., a router.

Applications First Audio Second Audio Example Stream Stream 1 Upstream Upstream Two people giving voice commands to their smart speakers at the same time.

2 Upstream Downstream One person talking to their smart speaker (causing unwanted noise) while someone else tries to watch streamed TV.

3 Downstream Upstream Streamed TV causing unwanted noise in the background of someone talking to their smart speaker.

4 Downstream Downstream Streaming music and TV to two different speakers in the same vicinity. Streamed TV causes unwanted background noise for someone trying to listen to the streamed music. Similarly, streamed music causes unwanted background noise for someone trying to watch the streamed TV.

As described above, some applications of the invention relate to increasing the intelligibility of voice commands received by voice assistants. This application will now be described in more detail. Cancelling (or minimising) background noise produced by neighbouring devices could have three levels. These are explained in the context of the application of increasing the intelligibility of voice commands but are equally applicable for any of the scenarios described above.

A first level, referred to as "generic impulse response", involves generating, based on the first audio digital stream that is known to generate noise (because it is known to be an audio format, mp3, etc.) a predicted signal pattern that we might expect to receive at the microphone. This will be an idealised frequency pattern based on specific impulse response that will generate a response that includes the effect of an idealised echo and frequency response. This generated signal is subtracted (or otherwise removed) from the recorded signal once it has been encoded as a second audio digital stream. Some common time signature between the signal recorded on the device and the incoming digital signal that will create noise/music/film soundtrack etc. should be achieved.

A second level, referred to as "modelled impulse response", is similar to the first level, but is in this case the system will seek to detect the kind of sound source used to generate the audio and to understand the microphone system being used to listen out for voice commands. Knowledge of the loudspeaker, the volume setting and the microphone would enable the system to choose from a range of impulse responses that characterise the response of the loudspeaker and the frequency response of the microphone used. Such knowledge may be stored in a matrix such that the system can look up what the impulse response is for a given speaker, volume and/or microphone.

A third level, referred to as "measured impulse response", is similar to the second level, but in this case the system has learned, through a training phase (described below), the way that the specific location and set up transform the incoming digital stream. The training phase thus defines a measured impulse response that can be subtracted from the recorded signal in order to remove the known noise from the recorded signal. Again, such knowledge may be stored in a matrix such that the system can look up what the impulse response is for a given location, set up, speaker, volume and/or microphone.

Measured impulse response is the most preferred option, followed by modelled impulse response and then generic impulse response. Measured impulse response will give the most accurate noise cancellation capabilities.

For example, in a situation in which there are (k) loudspeakers and (j) listening devices (listening devices may constitute one or more microphones) there will exist transfer functions that should be applied at the listening device (j) for each of the loudspeakers devices (1-k). There are three possible types of transfer function for each of the potential pairings of loudspeakers (1-k) and listening devices (1-j). Ideally there will be a measured impulse response for a given listening device and loudspeaker pairing and this we denote Me(j,k) -the measured impulse response of loudspeaker k as detected by microphone j. In the absence of a measured impulse response, the transfer function may be modelled.

In this case the Modelled transfer function is denoted Mo(j,k) -the measured impulse response of loudspeaker k as detected by the listening device j. If there is no Modelled or measured impulse response available then a generic impulse response could be used to generate the transfer function and this we denote Ge(j,k) -the generic impulse response of loudspeaker k as detected by listening device j.

The system will need to determine which transfer function is available to use. Look-up tables or matrices could be used to do this. Figure 12 depicts a look up table that shows all the generic transfer function Ge(j,k). All possible (j,k) pairings have a generic transfer function. Figure 13 depicts a look up table that shows the modelled transfer function Mo(j,k), when available. Not all (j, k) pairings have a modelled transfer function available.

is Figure 14 depicts a look up table that shows the measured transfer function Me(j,k). Not all (j,k) pairings have a modelled transfer functions available. In this example it is assumed that it is always possible to generate a generic transfer function and that it is not possible to have a measured transfer function without also having a modelled transfer function. In determining which transfer function to apply, the system may refer to a ledger, denoted in Figure 15, that recorded the preferred transfer function to use for a given (j,k) pairing. In Figure 15 for example the preferred transfer function for listening device (1) and loudspeaker (2) is the Modelled transfer function, specifically Mo(1,2). Figure 16 shows how the ledger denoted in Figure 15 could be used to generate the preferred transfer function to use for each possible listening device (1-j) and loudspeaker (1-k) pairing.

For the training phase, the impulse response of a system used to play out a sound can be measured. Traditionally, devices may record, on a microphone, the sounds heard following an electrical signal being fed through a loudspeaker. The focus is usually the performance of the loudspeaker, trying to identify how to manipulate the way the sound system operates to make the sound heard in the environment a better representation of the intent.

To do so, the system may be measuring the "time smearing" of an impulse noise die to phase inaccuracy caused by passive cross overs, resonance, energy storage in the loudspeaker cone or the internal volume of the loudspeaker as well as the acoustic characteristics of the location in which the loudspeaker is sited and its intrinsic frequency response. This approach is used in many audio applications including multi speaker surround-sound systems coupled to a TV as well as with some loudspeakers, such as those made by the company Sonos, for example. In this case we have a requirement to record the impulse response of sound made through a different audio system. For example, the voice activated response device may be a smart speaker in one room, e.g., the kitchen, and the audio source (e.g. TV) from which we want to obtain the impulse response may be in another room, e.g., the living room. The impulse response could be measured by following a setup process that sends a test signal, or range of test signals to the loudspeaker system associated with the TV in the living room, but recording the impulse response using the microphones in the smart speaker device (or devices) in the kitchen.

Figure 5 is a flowchart of an exemplary method 500 for removing interfering/additional audio streams (a first digital audio stream) from voice commands (a second digital audio stream) passed to a voice command interpreter.

The exemplary method 500 starts with checking 502 whether a new (i.e. additional) audio stream has been detected (the new stream being new in the sense that the system has not yet accounted for the presence of the stream yet).

If an additional audio stream is not detected, no action is taken and the system continues is to monitor for additional streams.

If an additional audio stream is detected, the additional audio stream may cause background noise in the background of a user's voice commands, and the method 500 proceeds to the next step.

When an interfering audio stream is detected at step 502 a prediction is made of what the interfering audio stream sounds like when picked up by the microphone that also picks up the voice command.

In the ideal case (the third level described above), this prediction is based on the measured impulse response comprising the device playing back the interrupting sound (background noise) (the device comprising the loudspeaker and associated electronics), the acoustic path between the device playing the interrupting sound and the device picking up the voice command, and the device picking up the voice command (comprising the microphone and associated electronics). This measured response is used to control a filter such that the unwanted signal is subtracted from the voice command signal thus creating the 'clean' voice command to be passed to the voice command interpreter. In practice, the measured response may provide the initial state of an adaptive filter whereby due to the time-varying nature of acoustic paths it may be necessary to update the filter as properties of the originally measured environment change (for example, due to movement of people and objects within the acoustic environment).

Therefore, at step 504 the method 500 checks whether the measured impulse response is available. If the measured impulse response is available, then the measured impulse response is used 506. If the measured impulse response is not available, a specific modelled response is preferred (the second level described above), using knowledge of the related components and configuration including the specific loudspeaker reproducing the interfering audio, the volume setting of loudspeaker reproducing the interfering audio, the microphone picking up the voice command and a prediction of the room-related acoustic response.

Therefore, at step 508 the method 500 checks whether the specific modelled impulse response is available. If the specific modelled impulse response is available, then the specific modelled impulse response is used 510. If the specific modelled response is not available, a generic model is used 512 (the first level described above).

No matter which impulse response is used (measured, modelled or generic), the impulse response is applied to the additional audio stream signal and then that modified stream is subtracted from the voice command stream 514 to provide a cleaned voice command stream. The cleaned voice command stream is then sent to a voice command interpreter (usually located in the cloud) 516.

Figures 6 and 7 are example system diagrams showing the main system components for implementing the removal of interfering audio streams from voice commands passed to a voice command interpreter. Although these figures relate to the application of the present invention to voice commands, the components and processes described here equally apply to the other applications described.

With respect to Figure 6, the training phase for the third level is illustrated (inactive components are greyed out). Example system 600 comprises an Internet router 120, a voice activated voice response device (e.g., a smart speaker) 630 and a media player/renderer 660 which is in communication with a loudspeaker 640.

The router 120 comprises an audio stream detector 122, an impulse response data store 621 and an impulse response data collector 623.

The voice activated voice response device 630 comprises an impulse response calculator 631, an adaptive interference canceller 632, a command sender 633, a command listener 634, a wake word detector 635, an audio transform calibrator 637 and a microphone 636. The audio transform calibrator 637 comprises a sound generator 638 and a sound recorder 639.

The audio transform calibrator 637 is responsible for orchestrating a calibration procedure. During the procedure the following is repeated for every media player/renderer 660 (i.e., device) that has the potential to reproduce interfering audio streams: a known test signal is generated at the sound generator 638 and sent to the media player/renderer 660 such that it is reproduced as an acoustic signal by its loudspeaker 640. This acoustic signal is recorded via the microphone 636 of the voice activated response device 630. The response of the signal chain and room is calculated and stored in the impulse response data collector 623 and the impulse response data store 621 of the network/Internet router 120.

With respect to Figure 7, normal operation of the system is illustrated (inactive components are greyed out). Example system 700 is similar to example system 600, and like components are labelled accordingly.

The router 120 comprises an audio stream detector 122, a transfer function data store 621 and a transfer function data collector 623.

The audio stream detector 122 in the Internet router 120 is responsible for detecting streams that contain interfering audio. For each interfering stream: the audio stream detector 122 informs the electroacoustic response data store 621 of the destination device. The audio stream detector 122 sends the audio stream to the adaptive interference canceller 632. The impulse response data store 621 sends the measured electroacoustic response (for the third level), modelled impulse response (for the second level) or generic impulse response (for the first level) to the adaptive interference canceller 632. The adaptive interference canceller 632 applies a filter to the interfering audio stream detected based on the electroacoustic response data 621, subtracts this signal from the signal received from the microphone of the voice activated response device and sends the resulting signal to the wake word detector 635 or command listener 634 as appropriate to the mode of operation. The wake word detector 635 and/or command listener 634 may be located in the cloud 110 rather than the device 630.

Figure 8 illustrates a simplified example of how the different streams, predicted signal pattern, and noise cancelled stream may look. For example, the first digital audio stream comprising an outgoing TV sound signal may look like Audio stream 1. The second digital audio stream comprising an incoming sound signal from a smart speaker, the stream comprising a user's voice commands and background noise from the TV which is playing in the vicinity of the smart speaker, may look like Audio stream 2. The generated prediction of what the background noise signal pattern will look like is exemplified as Generated signal pattern. This signal pattern may be predicted using any of the three levels described above, i.e., measured impulse response, modelled impulse response or generic impulse response. The generated signal pattern is then subtracted (or otherwise removed) from Audio stream 2 to produce a noise cancelled (or noise reduced) stream where the intelligibility of the voice command is increased.

Figure 9 is a flow chart of an exemplary method 900. Method 900 can be applied to the following scenarios: two downstream streams, two upstream streams, and one upstream stream and one downstream stream. The first step 910 is detecting at a network router, at least a first audio digital stream and a second audio digital stream. The second step 920 is predicting acoustic crosstalk interference present in the second audio digital stream, the acoustic crosstalk interference resulting from a first audio acoustic signal encoded by the first audio digital stream, the prediction using the first audio digital stream. The final step is removing the predicted acoustic crosstalk interference from the second audio digital stream to generate a crosstalk reduced second audio digital stream having less acoustic crosstalk interference from the first audio acoustic signal.

Figure 10 is a flow chart of an exemplary method 1000. Method 1000 is an example of some embodiments of the present invention applied to a scenario where the first audio digital stream is upstream and the second audio digital stream is downstream. Such an example is described in detail above in the second example relating to Figure 2. The first step 1010 is detecting a first audio acoustic signal at a device. At step 1020 the first audio acoustic signal is encoded to produce a first audio digital stream at the device. At step 1030 the first audio digital stream is transmitted from the device to a network router. At step 1040 the first audio digital stream is received at the network router. At step 1050 a second audio digital stream is received at the network router from the cloud. At step 1060 the first audio digital stream and the second audio digital stream are detected at the network router. At step 1070 acoustic crosstalk interference present in the second audio digital stream is predicted, the acoustic crosstalk interference resulting from the first audio acoustic signal encoded by the first audio digital stream, the prediction using the first audio digital stream. The final step 1080 is removing the predicted acoustic crosstalk interference from the second audio digital stream to generate a crosstalk reduced second audio digital stream having less acoustic crosstalk interference from the first audio acoustic signal.

Figure 11 is a flow chart of an exemplary method 1100. Method 1100 is an example of some embodiments of the present invention applied to a scenario where the first audio digital stream is downstream and the second audio digital stream is upstream. Such an example is described in detail above in the first example relating to Figure 2. The first step 1110 is receiving a first audio digital stream at a network router. At step 1120 the first audio digital stream is transmitted from the network router to a device. At step 1130 the first audio digital stream is decoded to produce a first audio acoustic signal at the device. At step 1140 a second audio acoustic signal is detected at a device (the same device or a different device), the second audio acoustic signal comprising acoustic crosstalk interference resulting from the first audio acoustic signal. At step 1150 the second audio acoustic signal is encoded to produce a second audio digital stream at the device. At step 1160 the second audio digital stream is transmitted from the device to a network router. At step 1170 the second audio digital stream is received at the network router. At step 1180 the first audio digital stream and the second audio digital stream are detected at the network router. At step 1190 acoustic crosstalk interference present in the second audio digital stream is predicted, the acoustic crosstalk interference resulting from the first audio acoustic signal encoded by the first audio digital stream, the prediction using the first audio digital stream. The final step 1195 is removing the predicted acoustic crosstalk interference from the second audio digital stream to generate a crosstalk reduced second audio digital stream having less acoustic crosstalk interference from the first audio acoustic signal.

Various modifications whether by way of addition, deletion, or substitution of features may be made to above described embodiment to provide further embodiments, any and all of which are intended to be encompassed by the appended claims.

Claims

Claims 1. A method for reducing the effect of acoustic crosstalk interference using knowledge of a digital audio stream, the method comprising: detecting at a network router, at least a first audio digital stream and a second audio digital stream, the first audio digital stream encoding a first audio acoustic signal and the second audio digital stream encoding at least a second audio acoustic signal; predicting acoustic crosstalk interference of the first audio acoustic signal with the second audio acoustic signal that would occur upon simultaneous production of the first and second acoustic signals, the prediction using the first audio digital stream; and reducing the effect of the predicted acoustic crosstalk interference by modifying the second audio digital stream with a cancellation signal generated based on the prediction of the acoustic crosstalk interference.
2. The method of claim 1, wherein modifying the second audio digital stream with the cancellation signal generated based on the prediction of the acoustic crosstalk interference comprises subtracting the predicted acoustic crosstalk interference from the second audio digital stream.
3. The method of any of the preceding claims, wherein the first audio digital stream is sent between the network router and a first device and predicting the acoustic crosstalk interference comprises using properties of the first device.
4. The method of claim 3, wherein the second audio digital stream is sent between the network router and a second device and predicting the acoustic crosstalk interference comprises using properties of the second device.
5. The method of claim 3 or 4, wherein predicting the acoustic crosstalk interference comprises using a matrix of transfer functions comprising the properties of the first device and/or the properties of the second device.
6. The method of claim 3, 4 or 5, wherein predicting the acoustic crosstalk interference comprises using a matrix of impulse response functions comprising impulse response functions of the environment where the first audio acoustic signal is generated.
7. The method of claim 6, wherein the impulse response functions are learned via a machine learning process.
8. The method of any of the preceding claims, wherein the first audio digital stream is upstream and the first audio acoustic signal is recorded and encoded as the first audio digital stream by a first device, preferably the first device comprises a microphone.
9. The method of any of claims 1-7, wherein the first audio digital stream is downstream and the first audio digital stream is decoded and reproduced as the first acoustic signal by a first device, preferably the first device comprises a speaker.
10. The method of any of the preceding claims, wherein the second audio digital stream further encodes the acoustic crosstalk interference of the first audio acoustic signal; and the effect of the acoustic crosstalk interference is reduced by minimising the acoustic crosstalk interference component in the second audio digital stream by applying the cancellation signal to the second audio digital stream.
11. The method of any of the preceding claims, wherein the second audio digital stream is upstream and the second audio acoustic signal is recorded and encoded as the second audio digital stream by the first device or a second device, preferably the first or second device comprises a microphone.zo
12. The method of any of claims 1-9, wherein the effect of the predicted acoustic crosstalk interference is reduced by applying the cancellation signal to the second audio digital stream such that the second audio digital stream encodes the second audio acoustic signal and a cancellation acoustic signal.
13. The method of any of claims 1-9 or 12, wherein the second audio digital stream is downstream and the second audio digital stream is decoded and reproduced as the second acoustic signal by the first device or a second device, preferably the first or second device comprises a speaker.
14. A system for reducing the effect of acoustic crosstalk interference using knowledge of a digital audio stream, the system comprising: a processor; and a memory including computer program code; the memory and the computer code configured to, with the processor, cause the system to perform the method of any of the preceding claims.
15. A system for reducing the effect of acoustic crosstalk interference using knowledge of a digital audio stream, the system comprising: an audio digital stream detector arranged to detect at least a first audio digital stream and a second audio digital stream at a network router, the first audio digital stream encoding a first audio acoustic signal and the second audio digital stream encoding at least a second audio acoustic signal; and a processor arranged to: predict acoustic crosstalk interference of the first audio acoustic signal with the second audio acoustic signal that would occur upon simultaneous production of the first and second acoustic signals, the prediction using the first audio digital stream; and reduce the effect of the predicted acoustic crosstalk interference by modifying the second audio digital stream with a cancellation signal generated based on the prediction of the acoustic crosstalk interference.