WO2022106691A1

WO2022106691A1 - Audio signal processing systems and methods

Info

Publication number: WO2022106691A1
Application number: PCT/EP2021/082520
Authority: WO
Inventors: Lisa Rossi; Mateusz Leputa
Original assignee: Lisa Rossi; Mateusz Leputa
Priority date: 2020-11-23
Filing date: 2021-11-22
Publication date: 2022-05-27
Also published as: GB202018375D0; EP4002361A1

Abstract

An audio signal detection system comprising a signal processing unit, the system comprising a receiver for receiving an audio signal input having a plurality of audio frequency bands; the signal processing unit comprising at least one signal processing module configured to perform the steps of: processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; providing a second signal path to reconstruct the audio signal input; and applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.

Description

AUDIO SIGNAL PROCESSING SYSTEMS AND METHODS

Technical Field

Embodiments of the present invention generally relate to audio processing systems and methods, in particular in relation to hearing protection devices.

Background

According to the World Health Organisation, noise pollution is the second biggest environmental problem affecting health. Prolonged exposure to noise pollution can have detrimental effects on health, such as cardiovascular disease, cognitive impairment, tinnitus and hearing loss. Noise pollution is particularly evident in mining, manufacturing and construction industries.

Noise protective devices exist and include earplugs, earmuffs, radio-integrated headsets and noise proof panels. However, existing devices offer limited communication and no selective control. For example, in the manufacturing industry, noise levels are particularly high and require the use of common ear protective devices which may hinder the ability for someone to hear another co-worker shouting or asking for help or a safety alarm.

Aspects of the present invention aim to overcome problems with the existing devices.

Summary

According to a first, independent aspect of the invention, there is provided an audio signal detection system comprising a signal processing unit, the system comprising a receiver for receiving an audio signal input having a plurality of audio frequency bands; the signal processing unit comprising at least one signal processing module configured to perform the steps of: processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; applying the primary filter to the frequency domain converted signal output; and providing a second signal path to reconstruct the audio signal input into time domain and to provide a signal output.

In a dependent aspect, the converted signal input comprises temporal information and the machine-learned model is configured to detect and process temporal information.

In a dependent aspect, employing a neural network comprises using at least one recursive neural layer.

In a dependent aspect, processing the audio signal input to provide a frequency domain converted signal input comprises calculating, for each one of the plurality of audio frequency bands, respective magnitude and phase values of a plurality of short-time Fourier transforms, STFT, for each audio frequency band.

In a further dependent aspect, the step of processing the audio signal input to provide a frequency domain converted signal input further comprises sampling the converted signal input, for example using Mel-frequency binning . This represents a pre-processing step where the STFT magnitude values are resampled into Mel space.

In a dependent aspect, the first output comprises at least one frequency band identified by the machine-learned model and at least one attenuation value.

In a further dependent aspect, the magnitude values of a plurality of short-time Fourier transforms are multiplied by a value of the first output.

In a dependent aspect, reconstructing the audio signal input comprises applying an inverse short-time Fourier transform calculation.

In a dependent aspect, running a machine-learned model to receive the converted signal input further provides a second output comprising a confidence value.

In a further dependent aspect, the signal processing module is configured to use the confidence value to identify undesirable audio frequency bands and provide a plurality of filter parameters representing a secondary filter.

In a further dependent aspect, the step of applying the primary filter to the frequencydomain audio signal input comprises applying the secondary filter to the reconstructed audio signal input.

In a dependent aspect, the signal processing unit comprises a filter bank module comprising a plurality of filters for receiving the output of the machine — learned model, wherein the at least one audio frequency band is identified based on the stacked output of a power of each one of the plurality of filters.

In a dependent aspect, the system further comprises an output device configured to receive the audio signal output.

Advantageously, the system can recognise and identify sounds (e.g. machinery noise) received from two or more microphones for example. This represents a solution for detecting frequency bands of an input audio signal which contain particular audio information. Accordingly, a smart selective noise control solution is enabled. In particular, when employed as part of a noise control or selective control device, the system allows users to select the sounds they want to hear or remove. This advantageously leads to more enjoyable user experience as well as improves safety in the working environment.

For example, devices according to embodiments of the invention can guarantee the user is never exposed to noise levels above the safety limits, thus, they can automatically lower any sound higher than the prescribed level.

Advantageously, the system may be implemented on computational devices including mobile devices such as mobile phones. It will be appreciated that the system may be deployed using any existing framework that either already supports a processing unit such as an embedded/lightweight microcontroller or employing any suitable Al method that can be ported to the chosen device.

Advantageously, all input audio signal may be attenuated to a safe threshold when returned to the users of the system.

In a dependent aspect, there is provided an audio processing system for real-time noise control and selection for use on portable or wearable devices, the system comprising an audio signal detection system described above.

In a further dependent aspect, the audio processing system comprises a controller. The controller may be a processor of the type known in the art at hardware level.

In a further dependent aspect, the audio processing system comprises a user interface for receiving a user input. The input may include an audio signal threshold level. In preferred embodiments, the user input represents a selection to attenuate or/and amplify noise detected in the audio signal input. The user interface may be a hardware or a software interface which allows the user of the system to select the desired functionality of the system. Users can selectively attenuate or/and amplify any noise identified by the software. In further dependent aspects, audio processing system further comprises one or more of : at least one microphone, a sound pressure level measurement device, one or more speakers, and a data storage device. In preferred embodiments, mass storage facilitates the collection of data which can then be uploaded to a server location and used to train and improve the algorithms. The mass storage may take a form of a Micro SD peripheral (i.e. memory storage card) included within the device.

In a dependent aspect, the audio processing system comprises at least one indicator device; for example the indicator devices may comprise a plurality of light-emitting diodes. For example, the hardware version of the user interface including a LED indicator can indicate the functionality of the system currently enabled and/or battery status.

In a dependent aspect, there is provided a noise protective device comprising an audio processing system as described above. The noise protective device may be integrated in a hearing protection or communication device e.g. headset or earphone etc.

According to a second, independent aspect of the invention, there is provided an audio signal processing method comprising the steps of: providing a signal processing unit, the system comprising a at least one signal processing module and a receiver for receiving an audio signal input having a plurality of audio frequency bands; processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; providing a second signal path to reconstruct the audio signal input; and applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.

Aspects of the present invention are now described with reference to the examples shown in the accompanying Figures.

Brief Description of the Drawings Figure 1A is a block diagram of a signal filtering process;

Figure 1 B shows an example of signal values corresponding to the process of Figure 1 A;

Figure 1C is a block diagram of a secondary filtering method example;

Figure 1 D is an example model used at step 300 of Figure 1A;

Figure 1 E shows an example of frequency bands used for applying neural network outputs;

Figure 2 illustrates an audio input storage and data collection process;

Figure 3 illustrates hardware components of a system according to a preferred embodiment.

Detailed Description

With reference to Figure 1 A, an audio signal filtering process 1000 is described.

In a preferred embodiment, an input audio signal 100 representing a raw audio signal comprising a plurality of frequency bands is processed at step 200 using a series of short- time Fourier transforms (STFT) for each frequency band. For example, the frequency bands of an input audio signal 100 can be identified based on a stacked output of the power of each filter bank or the stacked magnitude of a series of short-time Fourier transforms (STFT) for each frequency band.

The system is provided with the input audio signal 100 from a microphone 1, an example of which is shown in Figure 3. At step 200, the signal is converted to the frequency domain using a STFT and sent via two signal paths to processing steps 300 and 500.

Figure 1 B shows an example of signal values corresponding to the process of Figure 1A. In this example, the input audio signal is sampled with 256 audio signal input values corresponding to 256 frequency bands. The STFT magnitude and phase value are also shown on the signal value plot corresponding to step 200 shown in Figure 1 B.

Referring back to Figure 1A, at step 300 a neural network is employed to run an Al model for detecting frequency bands of the input audio signal which contain desired information. It will be appreciated that a number of Al models are suitable. In a preferred embodiment, the Al model has a capability to detect and maintain temporal information about a series of inputs and as long as these inputs can be provided and processed by the Al model fast enough to provide the output in real-time. Accordingly, the Al model may receive as input a down-sampled frequency domain input. The input may be down sampled using known methods such as Mel-frequency binning to a fixed number of bins. The Al model outputs the attenuation value (i.e. how loud) each frequency band should be in the reconstructed signal.

As shown in Figure I B, in an example, step 300 is split into two steps, 300a and 300b. At step 300a, which represents a pre-processing step, the STFT magnitude is resampled into Mel-space, using 20 bins. The result of the down-sampled spectrum after step 300a is shown in the plot of Figure 1 B. At step 300b, the neural network acts to provide the Al model output In this example the phase is not processed by the Al model, but it is used unaltered at step 500 (via the second signal path) to reconstruct the audio signal e.g. via using an inverse short-time Fourier transform (ISTFT).

Figure 1 D illustrates an example model used in step 300 to process the Mel-binned (resampled) STFT signal and produce the confidence values for 20 frequency bands. Each output corresponds to a frequency band on the STFT magnitude shown in Figure 1 D.

In this example, an input layer with 20 neurons, one neuron for each Mel-binned value or frequency band, is provided. For optimum performance, the layer may be a recursive layer. Hidden layers, usually 2-3 layers deep are provided with 3-4 times the width of the input. Typically, the first 1 -2 layers are a type of recursive layer. The output layer also has 20 neurons, one neuron for each Mel-binned value or frequency band. The output of step 300 is then sent to steps 400 and 600, respectively.

Figure I E shows example of frequency bands that each neurons output is applied to. For example, all values within the Bl frequency band envelope are multiplied by the first output neuron, all within B2 are multiplied by second neuron output and so on. These envelopes may be used for down sampling the signal from the 129 magnitude values that result form 256-point STFT down to the 20 bins required by the neural network input.

At step 400, the frequency gains are applied on the frequency domain representation of the original signal via simple multiplication of each respective frequency band by its respective gain. The plot in Figure 1 B corresponding to step 400 shows an example of STFT magnitude values multiplied by the Al model outputs received from step 300 (or 300a and 300b).

It will be appreciated that gain may correspond to a range of frequencies. The frequency ranges are chosen at system implementation time depending on the requirements and usually match up with the down-sampling at step 200.

At step 500, the signal is reconstructed into time domain in this example using an inverse short-time Fourier transform (ISTFT).

Accordingly, the Al model output is used to identify desired signal components in the output signal. At step 600, it is decided if such activity has been detected with a high confidence by measuring the magnitudes of the outputs (for example, if all outputs are close to or above 1.0, it is decided that the model detects useful information in all bands). At steps 700 and 800, a finite or infinite input response (FIR/IIR) filter is provided , the filter being dynamically updated to remove any remaining noise in the signal to enhance the overall output. The processed signal (output) is provided to an output system at step 900.

Figure 1 C shows an example implementation of secondary filtering methods, spanning steps 600 to 900. It will be appreciated that there are several possible implementations of the processing path 600-900. In this example, the neural network outputs and a simple thresholding mechanism are used to determine whether each band contains useful signals or not (at step 600). In this example, at step 600 it is determined if a frequency band has a confidence level of less than or equal to 0.3, if so, the frequency band is identified as one to be filtered out (undesired frequency bands). At step 700, a filter matching the desired frequency response is provided, using any desired method such as Wiener filter design, lest square filter etc. For example FIR filter parameters are calculated that match the frequency response to filter out all of the frequency bands identified to be filtered out.

The constructed filter from step 700 is then applied at step 800 to the reconstructed signal from step 500 and the filtered signal is output at step 900.

The process 1000 of Figure 1A can advantageously identify bands in the frequency spectrum of the input signal 100 which contain useful audio information such as voice or any desired audio signal. The identified frequency bands which are deemed to be useful are then kept whilst the frequency bands containing information that is deemed to be noise are discarded via the means of attenuating their respective frequency bands in the frequency domain representation.

This approach is advantageous because it allows for live filtering of the input audio signal with very little knowledge of the exact spectral densities of the desired signal. The approach is also lightweight enough to work on embedded hardware such as microcontrollers of Field Programmable Gate Arrays (FPGAs). The process allows for attenuation of sporadic noises while most common modern approaches can only reliably cancel out continuous noises.

The output of the neural network 300 can also be used as a reliable voice activity detection (VAD) which can then be used in conjunction with more traditional filtering such as using a Weiner filter to enhance the quality of the processed sound.

Advantages of the filtering process 1000 of Figure 1 A over the prior art include but are not limited to:

• Explicit detection of frequency bands containing voice or desired signal.

• Voice activity detection and/or desired signal detection.

• Ability to employ any filtering method in tandem with the Al application.

• Small network capable of running on mobile devices in real-time.

• Ability to attenuate sporadic noises.

• Separation of the desired signals from the source signal.

Figure 2 illustrates an audio input storage and data collection process 2000 for the device. A storage unit 21 is present on the device that may be in communication with a computer when the device is charging. While the device is charging, the encrypted audio data is uploaded to a server while software updates are downloaded to the device. The recorded audio data can be accessed for monitoring purposes and to improve and retrain the models.

For example, audio samples are recorded on-to the device storage unit 21 during daily operation. When not in use the device may be in communication with an on-site server 22 and upload the audio data to the server. The on-site server 22 is configured in this example to select most distinct audio samples and upload them to an off-site server or cloud server 23. The off-site/cloud server 23 integrates the new data into a training dataset and may employ additional Al models to train with the new dataset 24.. The retrained model is then sent to be dispatched back to the devices deployed (i.e. via devices 23 , 22, and 21 in sequence).

Accordingly, mass storage facilitates the collection of data which can be uploaded to a server location and used to train and improve the algorithms.

Figure 6 illustrates an example embodiment of a system 3000 representing a hearing protection headset. In this example, the system comprises two microphones 1 for audio sampling, although it would be appreciated that the number of microphones may vary. Preferably, the system can further comprise a sound pressure level measurement device (not shown). Speakers 2 are provided for replaying the processed audio in real-time. The number of speakers may vary.

The system further comprises a controller unit 3 comprising a controller (i.e. a processor) and other electronic peripherals required to operate the system such as memory storage unit. For example, the mass storage may take a form of a Micro SD card included within the device. A LED indicator 4 is also provided to indicate for example when the headset 3000 is gathering audio samples.

Claims

1. An audio signal detection system comprising a signal processing unit, the system comprising a receiver for receiving an audio signal input having a plurality of audio frequency bands; the signal processing unit comprising at least one signal processing module configured to perform the steps of: processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; providing a second signal path to reconstruct the audio signal input; and applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.

2. A system according to claim 1, wherein the converted signal input comprises temporal information and the machine-learned model is configured to detect and process temporal information.

3. A system according to claim 1 or claim 2, wherein employing a neural network comprises using at least one recursive neural layer.

4. A system according to any one of the preceding claims, wherein processing the audio signal input to provide a frequency domain converted signal input comprises calculating, for each one of the plurality of audio frequency bands, respective magnitude and phase values of a plurality of short-time Fourier transforms, STFT, for each audio frequency band.

5. A system according to any one of the preceding claims, wherein the step of processing the audio signal input to provide a frequency domain converted signal input further comprises sampling the converted signal input.

6. A system according to any one of the preceding claims, wherein the first output comprises at least one frequency band identified by the machine-learned model and at least one corresponding attenuation value.

7. A system according to claim 6, wherein the magnitude values of a plurality of short-time Fourier transforms are multiplied by the respective attenuation values.

8. A system according to any one of the preceding claims, wherein reconstructing the audio signal input comprises making an inverse short-time Fourier transform calculation.

9. A system according to any one of the preceding claims, wherein running the machine- learned model further provides a second output comprising a confidence value.

10. A system according to claim 9, wherein the signal processing module is configured to use the confidence value to identify undesirable audio frequency bands and provide a plurality of filter parameters representing a secondary filter.

11. A system according to any one of the preceding claims, wherein the step of applying the primary filter to the frequency-domain representation of the audio signal input comprises applying the secondary filter to the reconstructed audio signal input .

12. A system according to any one of the preceding claims, wherein the signal processing unit comprises a filter bank module comprising a plurality of filters for receiving the output of the machine — learned model, wherein the at least one audio frequency band is identified based on the stacked output of a power of each one of the plurality of filters.

13. An audio processing system for real-time noise control and selection for use on portable or wearable devices, the system comprising an audio signal detection system according to any one of the preceding claims.

14. A noise protective device comprising an audio processing system according to any one of claims 1 to 12.

15. An audio signal processing method comprising the steps of: providing a signal processing unit, the system comprising a at least one signal processing module and a receiver for receiving an audio signal input having a plurality of audio frequency bands; processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; providing a second signal path to reconstruct the audio signal input; and applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.