GB2575873A - Processing audio signals - Google Patents

Processing audio signals Download PDF

Info

Publication number
GB2575873A
GB2575873A GB1812289.5A GB201812289A GB2575873A GB 2575873 A GB2575873 A GB 2575873A GB 201812289 A GB201812289 A GB 201812289A GB 2575873 A GB2575873 A GB 2575873A
Authority
GB
United Kingdom
Prior art keywords
audio signal
audio
matched
signals
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1812289.5A
Other versions
GB201812289D0 (en
Inventor
Lambertus Muller Hendrik
Graham Stanford-Jason Andrew
Clarke Chris
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xmos Ltd
Original Assignee
Xmos Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xmos Ltd filed Critical Xmos Ltd
Priority to GB1812289.5A priority Critical patent/GB2575873A/en
Publication of GB201812289D0 publication Critical patent/GB201812289D0/en
Publication of GB2575873A publication Critical patent/GB2575873A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/055Time compression or expansion for synchronising with other signals, e.g. video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

An audio signal detected by a user device microphone (202) is processed by identifying signals from the user or the device 102 and then matching other, unknown signals 106 with predetermined signals for subsequent cancellation or removal 104. Unknown portions of audio are periodically transmitted for matching via spectral feature vectors to candidate prerecorded audio such as music or video (210) according to a difference threshold (fig. 4), and the modified audio may be used for speech recognition of commands 108 or keywords (208).

Description

PROCESSING AUDIO SIGNALS
Technical Field
The present disclosure relates io the processing of audio signals detected by a microphone of a user device in order to remove unwanted signals such as background music.
Background
Voice or speech recognition systems are becoming more prevalent in every day life. For example, speech recognition enables the recognition and translation of spoken language into text and actions by computers. Speech recognition can also enable recognition of speech commands for the purpose of speech control. Such techniques may also be known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). User devices such as smartphones and smart devices often have a voice-user interface that makes human interaction with computers possible through a voice (speech) platform in order to initiate an automated service or process. Speech recognition applications include: smart device control (e.g. voice dialling), domestic appliance control, searching (e.g. for a music track or television programme), data entry (e.g. entering a credit card number), speech-to-text processing (e.g., word processors or emails), etc. Some user devices have an “intelligent” virtual assistant, that can perform tasks or services for a user of the user device. Virtual assistants use natural language processing (NLP) to match voice input to executable commands.
A voice system, which may include a speech recognition system, typically has at least four components: an echo cancellation component, a noise cancellation component, a keyword spotting component, and a cloud service.
Keyword spotting is used to detect when a known phrase is received by the voice system to activate the user device, and send a command away to a cloud service (e.g. “Alexa, what is the time”, or “Hi Siri, do I need an umbrella”), or to perform a local operation (e.g. “Volume seven”). A cloud service may comprise one or more elements that may perform automatic speech recognition (ASR) and, for example, stream data (e.g. music, voice responses, commands, etc.) back to the device.
The purpose of the echo and noise cancellation components is to improve voice quality, e.g. to improve speech recognition or improve the perceptive quality of a phone call. Echo cancellation attempts to improve voice quality by removing echoes of the signal being played by the device. Echo cancellation involves first recognizing the originally transmitted audio signal that re-appears, with some delay, in the transmitted or received signal. Once the echo is recognized, it can be removed by subtracting it from the transmitted or received signal. Thai is, echo cancellation attempts to remove a known sound source that is being played through speakers controlled by the voice system itself, by modelling the echoes that a known sound has produced in an environment.
Echo cancellation can perform relatively well. In some scenarios it can remove around 30 dB of the signal. An echo canceller works by building one or more finite impulse response filters (FIRs), with transfer functions /7, έ that describe the acoustic path from the speakers) to the microphone(s) such that:
S-l ej = ~ Σ ΗΙ·ί''·Χι) i~0
Where:
y>j is the signal picked up by microphone j; Xi is the signal played out on speaker z;
( ) is the FIR transfer function describing the acoustic path from speaker i to microphonej;and ej is the near-end audio signal detected on microphone j, which is the audio signal picked up (v,·) minus the detected signal that was played back (Σ|Ζο #ί,ί(*ί))·
H is a room-dependent matrix that varies over time (e.g. as doors and windows may be opened and closed), and it is continually learned over time. The learning algorithm works on the assumption that often there is no other noise and just the source audio signal is playing and it then solves the equation:
S---1 ο==χ/-Σ//?·'έ(χΰ i=0 using an algorithm such as Least Mean Squares (LMS).
There are two requirements for echo cancellation. First, echo cancellation needs the signal x that is played out over the speakers, otherwise it cannot remove this signal. Second, in order to learn the contents of H the echo canceller again needs the signal x.
Summary
Noise cancellation attempts to remove any sounds present in the room that are not the desired voice or speech (noise as referred to herein means not just random noise, but also unwanted background sounds). A signal may be improved by suppressing a variety of noise sources not limited to, for example, a radio playing music in the background or a kitchen appliance. However, noise cancellation is much more difficult to perform than echo cancellation because there is no reference signal that can be used to model any echoes.
Keyword spotting is an example of a process that is negatively affected by noise. Keyword spotting takes place in the latter stages of an audio signal processing pipeline, after the signal has been cleaned to some extent. Keyword spotting is an imprecise science, and the keyword spotter is traditionally trained to still be able to spot a keyword despite a reasonable amount of background noise. However, keyword spotting will not work with too much background noise, and there is a trade-off between the resources (e.g. memory and processors) required by the keyword spotter and both the number of times that the keyword spotter creates a false positive response (i.e. it thought there was a keyword, but in fact there wasn’t), and the chances of a false negative (i.e. it missed a keyword). In genera], being able to remove more echoes and more “noise” improves the keyword spotting and/or reduces the resources required for the keyword spotter.
There may also be a variety of other scenarios where it would be desirable to remove background sounds such as background music or the like, e.g. to remove background sounds from a phone call, or prior to speech-to-txt conversion.
According to a first aspect disclosed herein, there is provided a method of processing a source audio signal detected by one or more microphones of a user device, wherein the source audio signal comprises a user voice signal from a user of the user device and one or more unknown audio signals, the unknown audio signals being audio signals that are not produced by the user nor the user device, wherein the 10 method comprises: accessing a matching service for matching received audio to predetermined audio signals; wherein said accessing of the matching service comprises obtaining, from the matching service, a matched audio signal matched with at least one of the unknown signals from amongst said predetermined audio signals; and generating a modified audio signa], wherein generating the modified audio signal 15 comprises removing the matched audio signal from the source audio signal.
The invention improves the process of removing unknown audio signals from a source audio signal by establishing what sound the unknown audio signal is. Samples of the unknown audio signal (e.g. background noise) are sent to a matching 20 service (e.g. a music database service) that matches those samples to a database of known sounds (e.g. based on pattern matching). When the matching service recognises the unknown audio signal, it can stream the known (matched) audio signal to the cancelling service, which can then remove the known audio signal from the source audio signal. For example, a traditional echo cancelling algorithm, as described 25 above, may be used to remove the known audio signal from the voice system. This therefore improves the pick-up of any user-voice signals present, in the audio system, making the perception of the user-voice (e.g. speech) easier for a recipient user or a voice recognition system, for example.
In embodiments, the method may comprise receiving the source signal from the user device; wherein said accessing comprises transmitting at least a portion of the source audio signal to the matching service and receiving back the matched audio signal in response.
In embodiments, said accessing may also comprise receiving the source audio signal from the matching service, and wherein said removing comprises removing the matched audio signal from the received source audio signal.
In embodiments, said transmitting may comprise: periodically transmitting at least a portion of the source audio signal to the matching service; or continuously transmitting at least a portion of the source audio signal to the matching service.
In embodiments, the method may comprise determining that at least the portion of the source audio signal comprises one or more unknown audio signals, wherein said transmitting comprises transmitting at least said portion of the source audio signal to the matching service in response to determining that said portion of the source audio signal comprises one or more unknown audio signals.
In embodiments, the method may comprise outputting the modified audio signal to be played out through one or more speakers.
In embodiments, the method may comprise outputting the modified audio signal to be played out through one or more speakers, wherein said one or more speakers are part of said user device.
In embodiments, the method may comprise outputting, in response to a request from the user, the modified audio signal to be played out through the one or more speakers in synchronisation with the source of the unknown audio signals.
In embodiments, the method may comprise outputting the modified audio signal to a speech recognition service to cause one or more speech commands to be recognised based on speech of the user in the user voice signal.
In embodiments, the source audio signal may be part of an audio or video call, and wherein the. method may comprise outputting the modified audio signal as part of said audio or video call to a user device of a recipient user.
In embodiments, the unknown audio signals may comprise one or more of: (i) audio signals from a music track, (ii) audio signals from a video, (Hi) audio signals from a podcast, (iv) audio signals from a television programme, and/or (iv) audio signals from a radio channel.
In embodiments, the at least one unknown audio signal may be a pre-recorded audio signals.
In embodiments, the at least one unknown audio signal may be a live stream.
In embodiments, the matching service may be for matching received feature vectors to predetermined feature vectors generated based on predetermined audio signals; and wherein said obtaining the matched audio signal may comprise obtaining, 15 from the matching service, the matched audio signal having a feature vector, matched with a feature vector generated based on said at least one unknown audio signal, from amongst said predetermined feature vectors.
In embodiments, the method may comprise generating the generated feature vector based at least the portion of the source audio signal; and wherein said transmitting at least the portion of the source audio signal to the matching service may comprise transmitting the generated feature vector to the matching service.
In embodiments, the generated feature vector may comprise a spectral fingerprint comprising a plurality of spectral elements from a spectrogram of at least the portion of the source audio signal.
In embodiments, the spectral fingerprint may be a condensed form of the spectrogram, wherein the plurality of spectral elements is fess than a total number of 30 spectral elements in the spectrogram.
In embodiments, the method may comprise generating the spectrogram of the at least the portion of the source audio signal; and generating the spectral fingerprint based on the generated spectrogram.
In embodiments, said generating of the modified audio signal may comprises determining a time offset between the matched audio signal and the portion of the source audio signal; and based on the determined time offset, synchronising the matched audio signal with the source audio signal, wherein said removing of the matched audio signal from the source audio signal may comprise removing the synchronised matched audio signal from the source audio signal.
In embodiments, the method may comprise determining a playback rate offset between the matched audio signal and the portion of the source audio signal; correlating the matched audio signal with the source audio signal by adjusting the 15 playback rate of the matched audio signal based on the determined playback rate offset; wherein said removing of the matched audio signal from the source audio signal may comprise removing the correlated matched audio signal .from the source audio signal.
In embodiments, the method may comprise determining an error value representing the difference between the matched audio signal and the source audio signal; and in response to determining that the error value is greater than a threshold value, obtaining, from the matching service, a new matched audio signal matched with a different one of the unknown signals from amongst said predetermined audio signals; and generating a new modified audio signal, wherein generating the new modified audio signal may comprise removing the new matched audio signal from the source audio signal.
In embodiments, the matching service may be configured to access a database 30 of predetermined audio signals to identify the matched audio signal.
In embodiments, said accessing of the matching service may comprise obtaining, from the matching service, a plurality of candidate audio signals, wherein the candidate audio signals have been identified as candidates from amongst the database of predetermined audio signals for matching with the at least one unknown audio signal; and wherein generating the modified audio signal may comprise: identifying one of the candidate audio signals as the matched audio signal; and removing the matched audio signal from the source audio signal.
In embodiments, the method may comprise transmitting a playlist of audio signals to the matching service, wherein the candidate audio signals are identified from the playlist of audio signals, and wherein said identifying of one of the candidate audio signals as the matched audio signal may be based on the candidate audio signals identified from the playlist of audio signals.
In embodiments, the method may comprise determining a respective error value for each of the candidate audio signals, wherein the respective error value represents a difference between a respective candidate audio signal and the source audio signal; wherein said identifying of one of the candidate audio signals as the matched audio signal may comprise identifying the candidate audio signal having the smallest respective error value.
In embodiments, the method may comprise transmitting a message to the matching service, wherein the message indicates, for one or more of the candidate audio signals, whether that candidate audio signal was identified as the matched audio signal.
In embodiments, the method may comprise the matching service learning, based on the candidate audio signals obtained by the cancelling service, a sequence list of audio signals.
In embodiments, said accessing of the matching service for matching received audio to predetermined audio signals may require at least one payment criteria to have been met.
According to a second aspect disclosed herein, there is provided a controller for processing a source audio signal detected by one or more microphones of a user device, wherein the source audio signal comprises a user voice signal from a user of the user device and one or more unknown audio signals, the unknown audio signals being audio signals that are not produced by the user nor the user device, wherein the controller is configured to: access a matching service, for matching received audio to predetermined audio signals to obtain, from the matching service, a matched audio signal matched with at least one of the unknown signals from amongst said predetermined audio signals; and generate a modified audio signal, wherein generating the modified audio signal comprises removing the matched audio signal from the source audio signal.
According to a third aspect disclosed herein, there is provided a system for processing a source audio signal detected by one or more microphones of a user device, wherein the source audio signal comprises a user voice signal from a user of the user device, and one. or more unknown audio signals, the unknown audio signals being audio signals that are not produced by the user nor the user device, wherein the system comprises: a matching service configured to match received audio signals to predetermined audio signals; and a cancelling service configured to: access a matching service for matching received audio to predetermined audio signals to obtain, from the matching service, a matched audio signal matched with at least one of the unknown signals from amongst said predetermined audio signals; and generate a modified audio signal, wherein generating the modified audio signal comprises removing the matched audio signal from the source audio signal.
According to a fourth aspect disclosed herein, there is provided a computer program for processing a source, audio signal detected by one or more microphones of a user device, wherein the source audio signal comprises a user voice signal from a user of the user device and one or more unknown audio signals, the unknown audio signals being audio signals that are not produced by the user nor the user device, wherein the computer program comprises instructions embodied on computerreadable storage and configured so as, when the program is executed by a computer of a cancelling service, cause the computer to perform operations of: accessing a matching service for matching received audio to predetermined audio signals; wherein said accessing of the matching service comprises obtaining, from the matching service, a matched audio signal matched with at least one of the unknown signals from amongst said predetermined audio signals; and generating a modified audio signal, wherein generating the modified audio signal comprises removing the matched audio signal from the source audio signal.
Brief Description of the Drawings
To assist understanding of the present disclosure and to show how embodiments may be put into effect, reference is made by way of example to the accompanying drawings in which:
Figure 1 shows schematically an example flow of audio signals between to and from a cancelling service for removing matched audio signals from a source audio signal,
Figure 2 shows schematically an example system for removing matched audio signals from a source audio signal,
Figure 3 shows schematically an example of a cancelling service for removing matched audio signals from a source audio signal,
Figure 4 shows schematically an example state diagram for learning a finite impulse response filter, and
Figure 5 shows schematically an example of a power spectrum of an audio signal.
Detailed Description
Figure 1 il lustrates an example system 100 for processing a source audio signal. The system 100 comprises a user device 102, a cancelling service 104, a matching service 106 and, optionally, a speech recognition service 108. The user device 102, cancelling service 104, matching service 106 and speech recognition service 108 may each be connected via a network (not. shown), e.g. the internet. Or as a variant, in some cases one or both of the cancelling service 104 and speech recognition service 108 may be implemented on the user device 102.
The user device 102 may in general be any device with one or more microphones arranged ίο detect audio signals. For example, the user device 100 maybe a media device such as a television set, a personal computing device such as a laptop or desktop or tablet computer, a video game console, a mobile device including mobile or cellular phone (including a so-called “smart phone”), a dedicated media player (e.g. an MP3 or similar player), a wearable communication device (including so-called “smart watches”), etc. The user device 102 may also be a “voice-controllable device, including so-called “smart devices” with intelligent, personal assistants. Voicecontrollable devices are capable of one or two-way voice interaction with a user of the user device 102, and may even interact with other connected devices or services. To interact with a voice-controllable device, a user typically says a keyword, often called a “wake-up word” or command, such as “Hey Siri”, “Hello Alexa”, etc. The user can then issue a voice command for the user device 102 to perform actions such as, for example, provide weather and sports reports, play music, manage alarms, timers and to-do lists, etc. The user device 102 may be controllable by voice commands.
Examples of a voice-controllable device 104 include Amazon Echo™. Amazon Dot™, Google Home™, etc.
The user device 102 may optionally have a user interface such as, for example, a display in the form of a screen for receiving inputs from the user. For example, the 25 user interface may comprise a touch screen, or a poinl-and-click user interface comprising a mouse, track pad, or tracker ball or the like. The user interface is configured to receive inputs from a user of the user device. The user device 102 may also have one or more speakers arranged to output audio signals.
The cancelling service 104 may be performed by a controller of the user device 102. The controller of the user device 1.02 may be implemented across one or more processors of the user device 102 and be operatively coupled to the microphone(s) and speaker(s) of the user device 102.
Alternatively, the cancelling service 104 may be separate from the user device. For example, the cancelling service 104 may be performed by computing equipment comprising some or all of the resources of a server connected to a network, the server 5 comprising one or more server units at one or more geographic sites.
In embodiments the functionality of the cancelling service 104 (e.g. the server(s)) is implemented in the form of software stored in memory and arranged for execution on a processor (the memory on which the software is stored comprising one 10 or more memory units employing one or more storage media, e.g. EEPROM or a .magnetic drive, and the processor on which the software is run comprising one or more processing units). Alternatively it is not excluded that some or all of the functionality of the cancelling service 104 could be implemented in dedicated hardware circuitry, or configurable or reconfigurable hardware circuitry such as an 15 ASIC or a PGA or FPGA. The cancelling service 104 may be connected to the network via a local area network such as a WLAN or a wide area network, such as the internet. Alternatively, the cancelling service 104 may comprise a wired connection to the network, e.g. an Ethernet or DMX connection.
The cancelling service 104 is operatively coupled to the user device 102. For example, the user device 102 may be coupled to the cancelling service 104 via a wired or wireless connection. In this example, the controller of the user device 102 may be operatively coupled to a wireless transceiver for communicating via any suitable wireless medium, e.g. a radio transceiver for communicating via a radio channel (though other forms are not excluded, e.g. an infrared transceiver). The wireless transceiver may comprise a Wi-Fi, Bluetooth, etc. interface for enabling the user device 102 to communicate wirelessly with the cancelling service 104. Additionally or alternatively, the wireless transceiver may communicate with the cancelling service 104 via a wireless router or a server (not shown), for example, over a local area network such as a WLAN or a wide area network such as the internet. The user device
102 may also communicate with the cancelling service 104 via a wired connection.
The matching service 106 may be performed by computing equipment comprising some or all of the resources of a server connected to a network, the server comprising one or more server units at one or more geographic sites. Note that the matching service 106 may be separate from the cancelling service 104 (i.e. different entities). That is, the computing equipment for performing the matching service 106 may be different to the computing equipment for performing the cancelling service 104. It is also not excluded that the servers may be virtual servers (servers implemented by means of different secure enclaves operated on some or all of the same physical hardware).
In embodiments the functionality of the matching service 106 (e.g. the server(s)) is implemented in the form of software stored in memory and arranged for execution on a processor (the memory on which the software is stored comprising one or more memory units employing one or more storage media, e.g. EEPROM or a magnetic drive, and the processor on which the software is run comprising one or more processing units). Alternatively it is not excluded that some or all of the functionality of the matching service 106 could be implemented in dedicated hardware circuitry, or configurable or reconfigurable hardware circuitry such as an ASIC or a PGA or FPGA. The matching service 106 may be connected to the network via a local area network such as a WLAN or a wide area network, such as the internet. Alternatively, the matching service 106 may comprise a wired connection to the network, e.g. an Ethernet or DMX connection. The user device 102 and/or cancelling service 104 may communicate with the matching service 106 via a wired or wireless connection (as described above for the cancelling service 104).
The speech recognition service 108 may be implemented by the controller of the user device 102. Alternatively, the speech recognition service may be implemented by a third party service. That is, the speech recognition service 108 may be performed by computing equipment comprising some or all of the resources of a server connected to a network, the server comprising one or more server units at one or more geographic sites for storing data provided by or collected from the individual. Note that the speech recognition service 108 may be separate from the cancelling service 104 (i.e. different, entities) and/or matching service 106. It is also not excluded that the servers may be virtual servers (servers implemented by means of different secure enclaves operated on some or all of the same physical hardware).
In embodiments the functionality of the speech recognition service- 108 (e.g.
the server(s)) is implemented in the form of software stored in memory and arranged for execution on a processor (the memory on which the software is stored comprising one or more memory units employing one or more storage media, e.g. EEPROM or a magnetic drive, and the processor on which the software is run comprising one or more processing units). Alternatively it is not excluded that some or all of the functionality of the speech recognition service 108 could be implemented in dedicated hardware circuitry, or configurable or reconfigurable hardware circuitry such as an ASIC or a PGA or FPGA. The speech recognition service 108 may be connected to the network via a local area network such as a WLAN or a wide area network, such as the internet. Alternatively, the speech recognition service 108 may comprise a wired connection to the network, e.g. an Ethernet or DMX connection. The user device 102 and/or cancelling service 104 may communicate with the speech recognition sendee 108 via a wired or wireless connection.
Embodiments of the invention are performed by the cancelling service 104 to remove unknown sounds from a source audio signal. A source signal is detected by one or more microphones of the user device 102. The source audio signal comprises a user voice signal from a user of the user device 102. The user voice signal may be, for example, a voice command, speech as part of a phone call, speech as part of a dictation to be recorded by the user device .102, etc. The source audio signal also includes one or more unknown audio signals that are not produced by the user nor the user device 102. That is, an audio source other than the user or the user device 102 produces the unknown audio signals. For example, the unknown audio signals may be produced by an external speaker such as, for example, a hi-fi, a television set, a radio, etc.
An example of an unknown audio signal is a music track played out by audio system in the same environment as the user device 102. In general, the unknown audio signals may comprise one or more of: audio signals from a music track, audio signals from a video, audio signals from a podcast, audio signals from a television programme, audio signals from a radio channel, etc. The unknown audio signals may include pre-recorded audio signals (e.g. music tracks), or audio signals of a live stream (e.g. a live sports match, a live news broadcast). The audio signals of a live stream are produced in real-time.
The cancelling service 104 is configured to access the matching service 106. The matching service 106 is configured to receive audio signals and match the received audio signals with predetermined audio signals. The matching may be performed using any known audio matching technique such as pattern matching. The predetermined audio signals (i.e. the candidates for matching to the received signal) are predetermined in the sense that they have been produced before the matching service 1.06 receives the audio signals. The predetermined audio signals include prerecorded audio signals (e.g. music tracks) and previous moments of a live stream (e.g. a television programmes, sports matches, etc.). The cancelling services accesses the matching service 106 to obtain a matched audio signal. The matched audio signal is an audio signal from amongst the. predetermined audio signals (in other words, the matched audio signal is one of the predetermined audio signals) that matches with at least one of the unknown audio signals present in the source audio signal (and that has been transmitted to the matching service 106). The matched audio signal may be an exact match to an unknown audio signal, or the matched audio signal may differ from the audio signal by less than a threshold amount. Here, a matched audio signal is one that has the same audio content as the unknown audio signal.
The cancelling service 104 is configured to generate a modified audio signal by removing the matched audio signal from the source audio signal. That is, the matched audio signal is subtracted from the source audio signal, thereby removing the unknown audio signal. For example, if the source audio signal consists of the user’s speech and a music track (an unknown audio signal), the music track is obtained from the matching service 106 and removed from the source audio signal, leaving only the user’s speech. As another example, if the source audio signal comprises the user’s speech and a plurality of unknown audio sounds (e.g. a music track and a television programme), a plurality of matched audio signals may be obtained from the matching service 106, each matching with a respective one of the unknown audio signals. Each matched audio signal is removed from the source audio signal, leaving only the user’s speech (along with any remaining audio signals, e.g. background noise such as, for example, a microwave pinging or a bird chirping).
As shown in Figure 1. the cancelling service 104 may receive the source audio signal from the user device 102. Thai is, the user device 102 may transmit the source audio signal detected by the microphone(s) to the cancelling service 104 via a wired or wireless connection. The source audio signal may be transmitted directly (e.g.
internally) from the microphone(s) to the cancelling service 104 if the cancelling service 104 is performed by a controller of the user device 102, or externally over a network if the cancelling service 1.04 is implemented on a server. In this example, when accessing the matching service 106, the cancelling service 104 transmits at least a portion of the source audio signal to the matching service 106 and, in response, receives back the matched audio signal. The matching service 106 will identify a matched audio signal that matches at least one unknown audio signal in the portion of the source audio signal.
The cancelling service 104 may transmit a (different) portion of the source audio to the matching service 106 at regular intervals (i.e. periodically) or irregular intervals. For example, the cancelling service 104 may transmit portions of the source audio signal to the matching service 106 e.g. every second, every few seconds, etc. Each portion transmitted to the matching service 106 may be a most recently received portion. A most recently received portion may correspond to the portion of source audio signal most recently detected by the microphone(s).
Alternatively, the cancelling service 104 may continuously transmit a portion of the source audio signal to the matching service 106. For example, the cancelling service 104 may forward the source audio signal to the matching service 106 as and when it receives the source audio signal. For instance, the whole of the source audio signal may be transmitted to the matching service 106.
As an optional feature, the cancelling service 104 may determine that a portion of source audio signal comprises one or more unknown audio signals and send the source audio to the matching service .106 in response io this. For example, the cancelling service 104 may determine that the source audio signal comprises audio signals other than the user’s speech or audio signals produced by the user device 102. The source audio signal may be passed through a voice activity detection (VAD) stage, also known as speech activity detection or speech detection, that processes the source audio signal to determine the presence or absence of human speech. The cancelling service 104 may receive the result of the VAD and, based on said result, determine whether to transmit the portion of the source audio signal (e.g. if the source audio signal comprises non-user voice signals or non-user device signals) or not to transmit the portion of the source audio signal (e.g. if the source audio signal consists solely of user-voice or user device 102 audio signals). As another example, the presence of unknown audio signals may be detected based on properties of the source audio signal, e.g. amplitude, frequency, etc. For example, the source audio signal having a frequency outside of that of the human speech range may indicate the presence of an unknown audio signal. As another example, a music detector, a. machine learning classifier trained to recognise music, may be used to identify when the unknown sounds are music tracks.
As is also shown in Figure 1, in variants of the above the user device 102 maybe configured to transmit the source audio signal directly to the matching service 106 instead of via the cancelling service 104. The matching service 1.06 may identify a matching audio signal in response to receiving the source audio signal. The matching service 106 may then transmit the source audio signal and the matched audio signal to the cancelling service 104. In this example, the cancelling service 104 is configured to access the matching service 106 to receive the source audio signal, obtain the matched audio signal and remove said matched audio signal from the received source audio signal.
In either arrangement, the cancelling service 104 may be configured to cause the modified audio signal to be played out through one or more speakers. For example, the cancelling service 104 may output the modified audio signal to be played out by one or more speakers of the user device 102. Additionally or alternatively, the cancelling service. 104 may output the modified audio signal to be played out by one or more speakers of a different user device (e.g. a voice-controllable device, television set, audio system, etc.) that has a wired or wireless connection to the cancelling service 104.
Additionally or alternatively, the cancelling service 104 may be configured to output the modified audio signal to a speech recognition service 108 (as shown in Figure 1). The speech recognition service 108 may be configured to receive the modified audio signal and process the modified audio signal to recognise one or more speech commands based on speech of the used in the user voice signal. The speech recognition service 108 may be implemented on the user device 1.02, on a connected user device, or across one or more servers connected to the cancelling service 104 by a wired or wireless connection. The speech recognition service 108 may be implemented by a voice-controllable device. The speech commands may control the voice-controllable device to perform a variety of tasks such as, for example, home appliance control, online shopping, the setting of alarms, voice calling, etc. The speech recognition service may be, for example, a language translation service that translates the user’s speech commands (e.g. conversational speech) into a different language.
If the source audio signal is part of an audio or video call (i.e. the user of the user device 102 is engaged in a phone or video call to a recipient user) the cancelling service 104 may output the modified audio signal as part of that audio or video call to the recipient user’s user device. The audio or video call may be, for example, a voice over Internet Protocol (VoIP) call or a cellular call.
In some embodiments the matching service 106 may be configured to perform the matching by matching received feature vectors to predetermined feature vectors. In this case the version of the source audio signal received by the matching service 106 comprise at least one or more feature vectors generated from, the source audio. The predetermined feature vectors are ones which have been generated based on the predetermined audio signals. The matching service 106 may store a respective one or more feature vectors for each stored predetermined audio signal. A feature vector may be a compressed form of a respective audio signal. For example, the feature vector may be a sei of spectral features of the respective audio signal, i.e. a set of features describing signal properties of the audio signal. E.g. as shown in Figure 5, the feature vector may comprise a set of power spectral density components from only low resolution frequency bins, and/or a non-exhaustive set of bins that cover some but not all of the part of the power spectrum of the source signal where the information content lies. The feature vector of an audio signal may be unique to that audio signal.
The cancelling service 104 may be configured to obtain a matched audio signal from the matching service 106 that has a feature vector matching to a feature vector of the unknown audio signal. That is, two audio signals may match if they have the same feature vector. Alternatively, two audio signals may match if their feature vectors are sufficiently similar (i.e. they differ by less than a predetermined threshold). Therefore if a matched audio signal having the same feature vector as the unknown audio signal is removed from the source audio signal, the unknown source, signal will not be present in the modified audio signal.
The cancelling service 104 may generate a feature vector based on at least a portion of source audio signal received from the user device 102. The feature vector generated by the cancelling service 104 may comprise the same features of an audio signal as the feature vectors generated by the matching service 106. Note that the feature vectors generated by the cancelling service 104 and the matching service 106 may comprise the same features, but they will only comprise the same values for those features for matching audio signals. The cancelling service 104 may also be configured to transmit the generated feature vectors to the matching service 106. The generated feature vectors may be transmitted to the matching service 106 may be transmitted instead of the portion of the source audio. An advantage of this is that, less data may be required to be sent to the matching service 106. A further advantage of this is that by transmitting compressed feature vectors instead of the full source audio signal, the user’s privacy may be maintained. This is because the user’s voice (and what the user is saying) is not transmitted and thus cannot be listened to by third parties.
In some embodiments, the feature vectors (those accessible by the matching service 106 and/or generated by the cancelling service 104) may comprise a spectral fingerprint, sometimes referred to in the art as an audio fingerprint. The spectral 5 fingerprint comprises a plurality of spectral elements from a spectrogram of at least the portion of the source audio signal. Each predetermined audio signal may have a respective spectrogram, and each predetermined audio signal may have a respective spectral fingerprint comprising spectral elements of the respective spectrogram.
A spectrogram may be generated from an audio signal by taking a frequency domain transform such as a (Fast) Fourier transform of the audio signal. A spectrogram is a time-frequency representation of the audio signal. An audio signal is divided into portions, which may overlap in time, and Fourier transformed to calculate the magnitude of the frequency spectrum for each portion. A Discrete Cosine
Transform (DCT) may optionally be formed on the resulting transformed portion.
Each portion then corresponds to a vertical line in the image; a measurement of magnitude versus frequency for a specific moment in time (e.g. the midpoint of the portion). These spectrums (or time plots) are then laid side by side to form the image representing three dimensions: frequency vs. amplitude vs. time. A spectral 20 fingerprint is made up of a selection of points from the spectrogram. For example, the spectral fingerprint may include only the points that represent peaks in the spectrogram image, e.g. notes that contain “higher energy content’’ than all the other notes around them. Alternatively, the spectral fingerprint may be a lower resolution (or condensed) version of the spectrogram that contains fewer spectral elements than 25 the total number of spectral elements in the spectrogram.
The cancelling service 104 may be configured to generate a spectrogram of a portion of the source audio signal, and from that spectrogram, generate a spectral fingerprint for the portion of source audio signal. The cancelling service 104 may then 30 transmit the spectral fingerprint to the matching service 106 instead of the original version of the source audio signal, and in response, obtain a matched audio signal having a spectral fingerprint matching that of the spectral fingerprint transmitted to the matching service 106.
For example, the matching service 106 may store a database of spectral fingerprints in a database. The cancelling service 104 may generate a spectral fingerprint from a 10 second portion of the source audio signal, and then transmit the 5 spectral fingerprint io the matching service 106. The matching service 106 can then analyse the received spectral fingerprint and seek a match based on the spectral fingerprints stored in its database. If the matching service. 106 identifies a match, it sends the matched audio signal back to the cancelling service 104.
In some embodiments, the cancelling service 104 may first transmit a feature vector (or spectral fingerprint) generated based on a portion of the source audio signal, and in response, receive an indication from the matching service 106 that the matching service 106 requires the portion (or at least some of the portion) of the original source audio signal. For example, the matching service 106 may indicate that it requires additional information to identify a matching audio signal, e.g. from amongst a plurality of possible matches. Here, the user’s privacy may still be maintained to a large extent since only a small portion of the source audio signal may be required to be transmitted to the matching service 106.
In some cases, the matched audio signal obtained from the matching service
106 may not be synchronised with the source audio signal. For example, if the matched audio signal is a music track, the unknown audio signal when detected by the microphone, may be two minutes into the music track. Therefore if the matched audio signal is removed from the source audio signal, the modified audio signal will sound 25 unnatural as the music track will not. have been removed correctly. To avoid this, the cancelling service 104 may be configured to determine a time offset between the matched audio signal and the source audio signal (or the unknown audio signal within the source audio signal). The time offset may be determined by correlating points in the matched audio signal with points in the source audio signal. For instance, a portion 30 of the source audio signal may be identified in the matched audio signa! to determine the time difference between the identified portions. As another example, the matching service 106 may transmit the matched audio signal with timestamp information to ensure, that the cancelling service 104 knows at what point in the unknown audio signal (e.g. video) the matched audio signal has been matched to.
To address this phenomenon, the cancelling service 104 may be configured to synchronise the matched audio signa] with the source audio signal based on the determined time offset between the two signals. The audio signals may be synchronised exactly (e.g. all of their peaks are aligned). The cancellation service, then generates the modified audio signal by removing the synchronised matched audio signal from the source audio signal.
Furthermore, in some cases the matched audio signal obtained from the matching service 106 may be skewed relative to the source audio signal, e.g. there may be dock drift between the source signal and matched audio signal. That is, the matched audio signal and the source audio signal rnay have different playback rates. For example, a matched audio signal may be played at 48000 Hz, whilst the source audio signal (or the unknown audio signal) may be played at 48002 Hz. Similarly, the matched audio signal (e.g. a podcast) may have a duration of 3 minutes, whereas the unknown audio signal within the source audio signal may have a duration of 3 minutes 5 seconds. If this is the case, no two consecutive points between the two signals will match exactly in time, i.e. one audio signal will always be slightly ahead of the other. Note that this is only applicable for pre-recorded audio signals since live streams (received by the user device 102 and accessed by the matching service 106) will both be synchronised to the same clock source.
To address this, the cancelling service 104 may be configured to determine a playback rate offset between the matched audio signal and the portion of the audio signal and correlate the matched audio signal with the source audio signal based on the determined playback rate offset. To do this, the two audio signals (matched and source) may be ran through an asynchronous sample rate converter (ASRC) to perform a correlation between the source audio signal and the matched audio signal. If one audio signal is one or more samples ahead of the other audio signal, the ASRC may remove those samples. Conversely, samples may be added if one audio signal is one or more samples behind the other audio signal. The two audio signals should have corresponding peaks in their signals. When the peaks starts drifting in the positive direction, the ASRC is instructed to run a bit slower, when the peak starts drifting in the negative direction, the ASRC is instructed to run a bit faster. The cancelling service 104 then generates the modified audio signal by removing the correlated matched audio signal from the source audio signal.
If the unknown audio signal is, for example, a music track, video, podcast, radio channel, etc., the unknown audio signal will at some point in time change, e.g. to a different music track. If the source audio signal detected by the microphone(s) includes a different unknown audio signal, removing the current matched audio signal would not result in an improved modified audio signal. Therefore, the cancelling service 104 may be configured to determining an error value representing the difference between the (current) matched audio signal and the source audio signal. If the error value is above a predetermined maximum threshold value, the cancelling service 104 may obtain a new matched audio signal from the matching service 106. The cancelling service 104 may then remove the new' matched audio signal from the source audio signal to generate a new modified audio signal.
The cancelling service 104 may determine the error as the difference between the source audio signal and the matched audio signal. The error value may represent the ratio of the matched audio signal to the source audio signal. If the cancelling service 104 is performing well, i.e. removing a correctly matched audio signal from the source audio signal, the error should be negative as measured in dB. Therefore if the determined error becomes positive (i.e. error is being added to the source audio signal), the cancelling service 104 can infer that, a new matched audio signal is required. For example, the source audio signal may change suddenly if the user changes a TV channel. At this point the error would increase substantially. The increase in error is detected and in response a new matched audio signal (matching the new TV channel) is obtained.
The determined error may also be used to detect that a matched audio signal has been incorrectly synchronised with the source audio signal. For example, if the unknown audio signal is an audio signal that comprises two or more repeated portions, the matched audio signal may be synchronised with the source audio signal at the wrong portion. For example, if the unknown audio signal is a music track comprising the same chorus repeated twice or more throughout the music track, the first chorus of the matched audio signal may be synchronised with the second chorus of the unknown audio signal. The determined error between the matched audio signal and the source audio signal will increase when the chorus of each signal changes to the following, different verses. In response to detecting the error, the cancelling service 104 may determine that a new matched audio signal is not required, and instead may resynchronise the two audio signals, e.g. by skipping forward or backward through the matched audio signal. The cancelling service 104 may report this to the matching service 108 so that the matching service can learn which audio signals have matched with the unknown audio signals.
As an optional feature, the cancelling service 104 may be configured to obtain several candidate audio signals from the matching service 106. The matching service 106 may have .identified the candidate audio signals as audio signals that may match the unknown audio signal. However, the matching service 106 may not have been able to identify a match for the unknown audio signal, e.g. because the unknown audio signal has not been detected clearly by the microphone(s) or because the unknown audio signal is similar to two or more of the predetermined audio signals. Therefore the matching service 106 may transmit the candidate audio signals to the cancelling service 104, thereby allowing the cancelling service 104 to identify which of the candidate audio signals matches the unknown audio signal. For example, error analysis similar to that described above may be performed for each of the audio signals, i.e. the error between each candidate audio signal and the unknown audio signal may be determined, with the candidate audio signal having the smallest error being identified as the matched audio signal. Alternatively, the cancelling service 104 may have access to a playlist of audio signals that is currently being played out by a speaker in the environment of the user device 102. The cancelling service 104 may compare the candidate audio signals with those in the playlist of audio signals to identify which candidate audio signal is most likely to match the unknown audio signal (e.g. the one that appears next in the playlist of audio signals).
The cancelling service 104 may retrieve the playlist of audio signals (e.g. songs), e.g. from the user device 102, and transmit the playlist of audio signals to the matching service 106 to allow the matching service 106 to identify candidate audio signals from the playlist. Note that although the playlist is retrieved from the user device 102, the unknown audio signals are not played out by the user device 102. For instance, the playlist may indicate that song A is followed by song B. If song A is a song by Elvis, the matching service 106 may assume that song B is not a song by Madonna. Therefore the matching service 106 can rule out candidate audio signals and/or predict candidate audio signals. The predicted candidate audio signals can be transmitted to the cancelling service 104 ahead of the next unknown audio signal so that the cancelling service 104 can immediately (or at least more quickly) remove the matched audio signal (one of the candidate audio signals) from the source audio signal when generating the modified audio signal. The cancelling may not be able to immediately remove the matched audio signal since one or two or more playlists having song A may be playing. However, if the cancelling service 104 has access to the next song from each playlist, when the next, song does begin to play (and is detected by the microphone(s)), that song can be cancelled from the source audio signal.
The cancelling service 104 may report the best matching audio signal (i.e. the candidate audio most likely to match the unknown audio signal ) to the matching service 106. That way, the matching service 106 may learn which candidate, best matched the unknown signal, thus reducing the number of candidates the matching service sends to the cancelling service 104 in the future. In other words, the cancelling service 104 may transmit a message to the matching service 106 that tells the matching service 106 which one of the candidate audio signals was the matching audio signal and/or which ones of the candidate audio signals was not the matching audio signal. The message may, for instance, indicate which candidate had the smallest error value. The message may include the determined error value for each of the candidate audio signals.
In some examples, the matching service 108 may learn the playlist of audio signals without actually being sent, the playlist. For example, the matching service 108 may learn and remember that the user device (or one or more of the user’s other user devices) has in the past played song C e.g. 45% of the time straight after the end of an audio signal matched as song D has finished playing.
In some examples, the cancelling service 104 may run two or more echo cancellers in parallel, each removing a different one of the candidate audio signals from the source audio signal to generate a respective modified audio signal. The modified audio signal having the smallest error value may be output e.g. to the speakers of a user device 102, or to the speech recognition service etc.
Figure 2 illustrates an example system for performing some embodiments of the. invention. The user device 102 comprises microphones 202 which detect the source audio signal and pass the source audio signal to a traditional echo canceller 204. The echo canceller 204 then forwards the source audio signal (minus any cancelled echoes) to the cancelling service 104 (labelled “unknown sound canceller” in Figure 2). The cancelling service 1()4 transmits a portion of the source audio signal comprising unknown audio signals to a matching service 106 (labelled “sound recognition” in Figure 2) via a cloud interface 206. Whilst the cloud interface 206 is shown after the keyword spotter 208 in the example of Figure 2, the cloud interface
206 could be positioned elsewhere (e.g. anywhere in Figure 2). For example, the cloud interface 206 may be positioned between the unknown sound canceller 104 and the keyword spotter 208. As another example, the cloud interface 206 may be positioned between the speech recognition service 108 and the music and sound streaming service 210. The matching service 106 streams a matched audio signal back to the cancelling service 104, which can then generate the modified audio signal by removing the matched audio signal from the source audio signal. The modified audio signal may be passed to a keyword spotter 208 or other speech recognition algorithm 106 (e.g. via the cloud interface 206). The speech recognition algorithm 106 may detect speech inputs or commands in the user voice signal and pass the speech to a cloud service such as, for example, a music streaming service 210, or an online shopping service, etc. The modified audio signal may be passed to the speakers 212 of the device for play out. Alternatively, an action resulting from the user’s speech commands may be played out by the speakers 212. For example, a requested song may be played out. One example of an action that might be possible as a result of tliis is a request for a small device (e.g. a device with an intelligent assistant) to “Join in”, or similar, which would cause the user device 102 to output on its speakers a lime matched output that boosts the audio output of another device (e.g. the device outputting the unknown audio signal). The. cancelling service 104 may synchronise the output of the modified audio signal with the output of the unknown audio signal based on, for example, timestamps in the matched audio signal, a determined playback rate of each of the matched and unknown audio signals, and/or the offset and/or skew of those signals. Hence a smart TV or similar user device .102 could be used to improve the audio (e.g. boost the audio) provided by a traditional radio simply by asking it to “join in”.
Although not shown in Figure 2, the source audio signal may be transmitted to matching service (or “sound recognition”) 106 from one or more echo cancellers 204 15 rather than the unknown sound canceller 104. That is, the cancelling service may comprise the one or more echo cancellers 104. In this example, the matched audio signal may be removed from the source audio signal output from the one or more echo cancellers 204. That is, the matched audio signal may be removed from a version of the source audio signal that has been passed through the echo cancellers 204.
Where the unknown audio signal source is not. just a mono thread, a trade-off can be made between actually modelling and removing multiple sound sources (e.g. left and right), or a single pre-mixed sound source (e.g. L-t-R) can be removed. The latter will not have as good a performance as the former, but it would require fewer 25 processing requirements and less bandwidth.
Figure 3 illustrates an example of how the cancelling service 1.04 may remove unknown audio signals from a source audio signal. Optionally, in this example, the cancelling service 1.04 comprises a beam steering or beam forming component 302 (e.g. algorithm) 302 for directional signal transmission or reception (i.e. spatial selectivity) of audio signals. This may be achieved by combining elements in the microphone 202 and/or speaker 212 array in such a way that signals at particular angles experience constructive interference while others experience destructive interference. However, beam steering is not essential in all embodiments.
The cancelling service 104 may have one or more unknown signal cancellers 304a, 304b. These may be located ahead of the traditional beam steering algorithm in the case where beam steering is used. The LMS algorithm described above may be used for each of the unknown signal cancellers 304. This is the same algorithm as may be used for traditional echo cancelling, except that the cancelling service 104 will use the modified audio signal obtained from the matching service 106 as its input x, the output from the traditional echo canceller e as its input y, and learn an FIR H that models the acoustic path from the unknown audio signal source to each microphone 202. Once the unknown audio signal is recognised, the cancelling service 104 will continue to cancel the matched audio signal until a change in circumstances (e.g. a change in music track). The H matrices (which describe the physics of the environment in which the user device 102 is in) can be retained to quickly initialise the cancelling service 104 for the next unknown audio signal (e.g. the next music track).
Figure 4 illustrates an example state diagram for learning the FIR filters. Initially the H matrices are set as zero, and portions of the source audio signal arc transmitted to the matching service 106. The cancelling service 104 obtains a matched signal from the matching service 106 and uses it to learn the FIR for the environment. This allows the cancelling service 104 to remove the unknown audio signals (Σ|=ο If th® error resulting from the LMS equation is too large, the cancelling service 104 obtains a new matched audio signal from the matching service 106.
As an optional feature, a fee for the service of supplying a matching sound to be removed may be charged. For example, the user of the cancelling service may have to pay a fee in order to access the matching service to obtain a matched audio signal. The fee may, for example, be a fee-per-matched audio signal, a fee-per-audio stream, a time-based fee (e.g. a weekly, monthly, yearly fee, etc.).
It will be appreciated that the above embodiments have been described by way of example only. Other applications or variants of the disclosed techniques may become apparent to a person skilled in the art once given the disclosure herein. The scope of the present disclosure is not limited by the above-described embodiments but only by the accompanying claims,

Claims (30)

1. A method of processing a source audio signal detected by one or more microphones of a user device, wherein the source audio signal comprises a user voice signal from a user of the user device and one or more unknown audio signals, the unknown audio signals being audio signals that are not produced by the user nor the user device, wherein the method comprises:
accessing a matching service for matching received audio to predetermined audio signals;
wherein said accessing of the matching service comprises obtaining, from the matching service, a matched audio signal matched with al least one of the unknown signals from amongst said predetermined audio signals; and generating a modified audio signal, wherein generating the modified audio signal comprises removing the matched audio signal from the source audio signal.
2. A method according to claim 1, comprising:
receiving the source signal from the user device; wherein said accessing comprises transmitting at least a portion of the source audio signal to the matching service and receiving back the matched audio signal in response.
3. A method according to claim 1, wherein said accessing also comprises receiving the source audio signal from the matching service, and wherein said removing comprises removing the matched audio signal from the received source audio signal.
4. A method according to claim 2, wherein said transmitting comprises: periodically transmitting at least a portion of the source audio signal to the matching service; or continuously transmitting at least, a portion of the source audio signal to the matching service.
5. A method according to claim 2, comprising:
determining that. at. least the portion of the source audio signal comprises one or more unknown audio signals, wherein said transmitting comprises transmitting at least said portion of the source audio signal to the matching service in response to determining that said portion of the source audio signal comprises one or more unknown audio signals.
5
6. A method according to any preceding claim, comprising:
outputting the modified audio signal to be played out through one or more speakers, wherein said one or more speakers are part of said user device.
7. A method according to claim 6, comprising:
10 outputting, in response to a request from the user, the modified audio signal to be played out through the one or more speakers in synchronisation with the source of the unknown audio signals.
8. A method according to any preceding claim, comprising:
15 outputting the modified audio signal to a speech recognition service to cause one or more speech commands to be recognised based on speech of the user in the user voice signal.
9. A method according to any preceding claim, wherein the source audio signal is 20 part of an audio or video call, wherein the method comprises outputting the modified audio signal as part of said audio or video call to a user device of a recipient user.
10. A method according to any preceding claim, wherein the unknown audio signals comprise one or more of: (i) audio signals from a music track, (ii) audio
25 signals from a video, (iii) audio signals from a podcast, (iv) audio signals from a television programme, and/or (iv) audio signals from a radio channel.
11. A method according to any preceding claim, wherein the at least one unknown audio signal is a pre-recorded audio signals.
12. A method according to any of claims 1 to 10, wherein the at least one unknown audio signal is a live stream.
13. A method according to any preceding claim, wherein the matching service is for matching received feature vectors to predetermined feature vectors generated based on predetermined audio signals; and wherein said obtaining the matched audio signal comprises obtaining, from the
5 matching service, the matched audio signal having a feature vector, matched with a feature vector generated based on said at least one unknown audio signal, from amongst said predetermined feature vectors.
14. A method according to claim 2 and claim 13, comprising:
10 generating the generated feature vector based at least the portion of the source audio signal; and wherein said transmitting at least the portion of the source audio signal to the matching service comprises transmitting the generated feature vector to the matching service.
Λ ·/
15. A method according to claim 13 or claim 14, wherein the generated feature vector comprises a spectral fingerprint comprising a plurality of spectral elements from a spectrogram of at least the portion of the source audio signal.
20
16. A method according to claim 15, wherein the spectral fingerprint is a condensed form of the spectrogram, wherein the plurality of spectral elements is less than a total number of spectral elements in the spectrogram.
17. A method according to claim 1.5 or claim 16, comprising:
25 generating the spectrogram of the at least the portion of the source audio signal; and generating the spectral fingerprint based on the generated spectrogram.
18. A method according to any preceding claim, wherein said generating of the
30 modified audio signal comprises:
determining a time offset between the matched audio signal and the portion of the source audio signal; and based on the determined time offset, synchronising the matched audio signal with the source audio signal, wherein said removing of the matched audio signal from the source audio signal comprises removing the synchronised matched audio signal from the source audio signal.
19. A method according to any preceding claim, comprising:
determining a playback rate offset between the matched audio signal and the portion of the source audio signal;
correlating the matched audio signal with the source audio signal by adjusting the playback rate of the matched audio signal based on the determined playback rate offset;
wherein said removing of the matched audio signal from the source audio signal comprises removing the correlated matched audio signal from the source audio signal.
20. A method according to any preceding claim when dependent on claim 2 or claim 3, comprising:
determining an error value representing the difference between the matched audio signal and the source audio signal: and in response to determining that the error value is greater than a threshold value, obtaining, from the matching service, a new matched audio signal matched with a different one of the unknown signals from amongst said predetermined audio signals; and generating a new modified audio signal, wherein generating the new modified audio signal comprises removing the new matched audio signal from the source audio signal.
21. A method according to any preceding claim, wherein the matching service is configured to access a database of predetermined audio signals to identify the matched audio sisnal.
22. A method according to claim 21, wherein said accessing of the matching service comprises obtaining, from the matching service, a plurality of candidate audio signals, wherein the candidate audio signals have been identified as candidates from amongst the database of predetermined audio signals for matching with the at least one unknown audio signal; and wherein generating the modified audio signal comprises: identifying one of the candidate audio signals as the matched audio signal; and removing the matched audio signal from the source audio signal.
23. A method according to claim 22, comprising:
transmitting a playlist of audio signals to the matching service, wherein the candidate audio signals are identified from the playlist of audio signals, and wherein said identifying of one of the candidate audio signals as the matched audio signal is based on the candidate audio signals identified from the playlist of audio signals.
24. A method according to claim 22 or claim 23, comprising:
determining a respective error value for each of the candidate audio signals, wherein the respective error value represents a difference between a respective candidate audio signal and the source audio signal;
wherein said identifying of one of the candidate audio signals as the matched audio signal comprises identifying the candidate audio signal having the smallest respective error value-
25. A method according to any of claims 22 to 24, comprising:
transmitting a message to the matching service, wherein the message indicates, for one or more of the candidate audio signals, whether that candidate audio signal was identified as the matched audio signal.
26. A method according to any of claims 22 to 23, comprising:
the matching service learning, based on the candidate audio signals obtained by the cancelling service, a sequence list of audio signals.
27. A method according to any preceding claim, wherein said accessing of the matching service for matching received audio to predetermined audio signals requires at least one payment criteria to have been met.
5
28. A controller for processing a source audio signal detected by one or more microphones of a user device, wherein the source audio signal comprises a user voice signal from a user of the user device and one or more unknown audio signals, the unknown audio signals being audio signals that are not produced by the user nor the user device, wherein the controller is configured to:
10 access a matching service for matching received audio to predetermined audio signals to obtain, from the matching service, a matched audio signal matched with at least one of the unknown signals from amongst said predetermined audio signals; and generate a modified audio signal, wherein generating the modified audio signal comprises removing the matched audio signal from the source audio signal.
JL Cz
29. A system for processing a source audio signal detected by one or more microphones of a user device, wherein the source audio signal comprises a user voice signal from a user of the user device and one or more unknown audio signals, the unknown audio signals being audio signals that are not produced by the user nor the
20 user device, wherein the system comprises:
a matching service configured to match received audio signals to predetermined audio signals; and a cancelling service configured to:
access a matching service for matching received audio to predetermined audio
25 signals to obtain, from the matching service, a matched audio signal matched with at least one of the unknown signals from amongst, said predetermined audio signals; and generate a modified audio signal, wherein generating the modified audio signal comprises removing the matched audio signal from the source audio signal.
30
30. A computer program for processing a source audio signal detected by one or more microphones of a user device, wherein the source audio signal comprises a user voice signal from a user of the user device and one or more unknown audio signals, the unknown audio signals being audio signals that are not produced by the user nor the user device, ‘wherein the computer program comprises instructions embodied on computer-readable storage and configured so as, ‘when the program is executed by a computer of a cancelling service, cause the computer to perform operations of: accessing a matching service for matching received audio to predetermined
5 audio signals;
wherein said accessing of the matching service comprises obtaining, from th matching service, a matched audio signal matched with at least one of the unknown signals from amongst said predetermined audio signals; and generating a modified audio signal, wherein generating the modified audio 10 signal comprises removing the matched audio signal from the source audio signal.
GB1812289.5A 2018-07-27 2018-07-27 Processing audio signals Withdrawn GB2575873A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1812289.5A GB2575873A (en) 2018-07-27 2018-07-27 Processing audio signals

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1812289.5A GB2575873A (en) 2018-07-27 2018-07-27 Processing audio signals

Publications (2)

Publication Number Publication Date
GB201812289D0 GB201812289D0 (en) 2018-09-12
GB2575873A true GB2575873A (en) 2020-01-29

Family

ID=63518176

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1812289.5A Withdrawn GB2575873A (en) 2018-07-27 2018-07-27 Processing audio signals

Country Status (1)

Country Link
GB (1) GB2575873A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11741933B1 (en) * 2022-03-14 2023-08-29 Dazn Media Israel Ltd. Acoustic signal cancelling
US11823514B1 (en) 2022-08-09 2023-11-21 Motorola Solutions, Inc. Method and apparatus for allowing access through a controlled-access point

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140067385A1 (en) * 2012-09-05 2014-03-06 Honda Motor Co., Ltd. Sound processing device, sound processing method, and sound processing program
US20150112672A1 (en) * 2013-10-18 2015-04-23 Apple Inc. Voice quality enhancement techniques, speech recognition techniques, and related systems
WO2015124211A1 (en) * 2014-02-24 2015-08-27 Widex A/S Hearing aid with assisted noise suppression

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140067385A1 (en) * 2012-09-05 2014-03-06 Honda Motor Co., Ltd. Sound processing device, sound processing method, and sound processing program
US20150112672A1 (en) * 2013-10-18 2015-04-23 Apple Inc. Voice quality enhancement techniques, speech recognition techniques, and related systems
WO2015124211A1 (en) * 2014-02-24 2015-08-27 Widex A/S Hearing aid with assisted noise suppression

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11741933B1 (en) * 2022-03-14 2023-08-29 Dazn Media Israel Ltd. Acoustic signal cancelling
US20230290329A1 (en) * 2022-03-14 2023-09-14 Dazn Media Israel Ltd. Acoustic signal cancelling
US11823514B1 (en) 2022-08-09 2023-11-21 Motorola Solutions, Inc. Method and apparatus for allowing access through a controlled-access point

Also Published As

Publication number Publication date
GB201812289D0 (en) 2018-09-12

Similar Documents

Publication Publication Date Title
US11605393B2 (en) Audio cancellation for voice recognition
US11929088B2 (en) Input/output mode control for audio processing
EP2965496B1 (en) Content based noise suppression
US9319782B1 (en) Distributed speaker synchronization
KR101255404B1 (en) Configuration of echo cancellation
US9595997B1 (en) Adaption-based reduction of echo and noise
US9734845B1 (en) Mitigating effects of electronic audio sources in expression detection
US9494683B1 (en) Audio-based gesture detection
US10297250B1 (en) Asynchronous transfer of audio data
CN110709931B (en) System and method for audio pattern recognition
US10650840B1 (en) Echo latency estimation
EP3866158B1 (en) Systems and methods for generating a singular voice audio stream
Park et al. Acoustic interference cancellation for a voice-driven interface in smart TVs
CA3041198A1 (en) Microphone array beamforming control
US11785406B2 (en) Inter-channel level difference based acoustic tap detection
US10937418B1 (en) Echo cancellation by acoustic playback estimation
GB2575873A (en) Processing audio signals
US11386911B1 (en) Dereverberation and noise reduction
EP3871214B1 (en) Audio pipeline for simultaneous keyword spotting, transcription, and real time communications
US10887709B1 (en) Aligned beam merger
US20240046927A1 (en) Methods and systems for voice control

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)