EP4002361A1 - Audio signal processing systems and methods - Google Patents

Audio signal processing systems and methods Download PDF

Info

Publication number
EP4002361A1
EP4002361A1 EP21190382.8A EP21190382A EP4002361A1 EP 4002361 A1 EP4002361 A1 EP 4002361A1 EP 21190382 A EP21190382 A EP 21190382A EP 4002361 A1 EP4002361 A1 EP 4002361A1
Authority
EP
European Patent Office
Prior art keywords
signal input
audio
audio signal
output
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21190382.8A
Other languages
German (de)
French (fr)
Inventor
Lisa Rossi
Mateusz Leputa
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to PCT/EP2021/082520 priority Critical patent/WO2022106691A1/en
Publication of EP4002361A1 publication Critical patent/EP4002361A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Definitions

  • Embodiments of the present invention generally relate to audio processing systems and methods, in particular in relation to hearing protection devices.
  • noise pollution is the second biggest environmental problem affecting health. Prolonged exposure to noise pollution can have detrimental effects on health, such as cardiovascular disease, cognitive impairment, tinnitus and hearing loss. Noise pollution is particularly evident in mining, manufacturing and construction industries.
  • Noise protective devices exist and include earplugs, earmuffs, radio-integrated headsets and noise proof panels.
  • existing devices offer limited communication and no selective control.
  • noise levels are particularly high and require the use of common ear protective devices which may hinder the ability for someone to hear another co-worker shouting or asking for help or a safety alarm.
  • aspects of the present invention aim to overcome problems with the existing devices.
  • an audio signal detection system comprising a signal processing unit, the system comprising a receiver for receiving an audio signal input having a plurality of audio frequency bands; the signal processing unit comprising at least one signal processing module configured to perform the steps of:
  • the converted signal input comprises temporal information and the machine-learned model is configured to detect and process temporal information.
  • employing a neural network comprises using at least one recursive neural layer.
  • processing the audio signal input to provide a frequency domain converted signal input comprises calculating, for each one of the plurality of audio frequency bands, respective magnitude and phase values of a plurality of short-time Fourier transforms, STFT, for each audio frequency band.
  • the step of processing the audio signal input to provide a frequency domain converted signal input further comprises sampling the converted signal input, for example using Mel-frequency binning. This represents a pre-processing step where the STFT magnitude values are resampled into Mel space.
  • the first output comprises at least one frequency band identified by the machine-learned model and at least one attenuation value.
  • the magnitude values of a plurality of short-time Fourier transforms are multiplied by a value of the first output.
  • reconstructing the audio signal input comprises applying an inverse short-time Fourier transform calculation.
  • running a machine-learned model to receive the converted signal input further provides a second output comprising a confidence value.
  • the signal processing module is configured to use the confidence value to identify undesirable audio frequency bands and provide a plurality of filter parameters representing a secondary filter.
  • the step of applying the primary filter to the frequency-domain audio signal input comprises applying the secondary filter to the reconstructed audio signal input.
  • the signal processing unit comprises a filter bank module comprising a plurality of filters for receiving the output of the machine—learned model, wherein the at least one audio frequency band is identified based on the stacked output of a power of each one of the plurality of filters.
  • system further comprises an output device configured to receive the audio signal output.
  • the system can recognise and identify sounds (e.g. machinery noise) received from two or more microphones for example.
  • sounds e.g. machinery noise
  • a smart selective noise control solution is enabled.
  • the system when employed as part of a noise control or selective control device, the system allows users to select the sounds they want to hear or remove. This advantageously leads to more enjoyable user experience as well as improves safety in the working environment.
  • devices according to embodiments of the invention can guarantee the user is never exposed to noise levels above the safety limits, thus, they can automatically lower any sound higher than the prescribed level.
  • the system may be implemented on computational devices including mobile devices such as mobile phones. It will be appreciated that the system may be deployed using any existing framework that either already supports a processing unit such as an embedded/lightweight microcontroller or employing any suitable AI method that can be ported to the chosen device.
  • a processing unit such as an embedded/lightweight microcontroller or employing any suitable AI method that can be ported to the chosen device.
  • all input audio signal may be attenuated to a safe threshold when returned to the users of the system.
  • an audio processing system for real-time noise control and selection for use on portable or wearable devices, the system comprising an audio signal detection system described above.
  • the audio processing system comprises a controller.
  • the controller may be a processor of the type known in the art at hardware level.
  • the audio processing system comprises a user interface for receiving a user input.
  • the input may include an audio signal threshold level.
  • the user input represents a selection to attenuate or/and amplify noise detected in the audio signal input.
  • the user interface may be a hardware or a software interface which allows the user of the system to select the desired functionality of the system. Users can selectively attenuate or/and amplify any noise identified by the software.
  • audio processing system further comprises one or more of : at least one microphone, a sound pressure level measurement device, one or more speakers, and a data storage device.
  • mass storage facilitates the collection of data which can then be uploaded to a server location and used to train and improve the algorithms.
  • the mass storage may take a form of a Micro SD peripheral (i.e. memory storage card) included within the device.
  • the audio processing system comprises at least one indicator device; for example the indicator devices may comprise a plurality of light-emitting diodes.
  • the hardware version of the user interface including a LED indicator can indicate the functionality of the system currently enabled and/or battery status.
  • a noise protective device comprising an audio processing system as described above.
  • the noise protective device may be integrated in a hearing protection or communication device e.g. headset or earphone etc.
  • an audio signal processing method comprising the steps of:
  • an input audio signal 100 representing a raw audio signal comprising a plurality of frequency bands is processed at step 200 using a series of short-time Fourier transforms (STFT) for each frequency band.
  • STFT short-time Fourier transforms
  • the frequency bands of an input audio signal 100 can be identified based on a stacked output of the power of each filter bank or the stacked magnitude of a series of short-time Fourier transforms (STFT) for each frequency band.
  • the system is provided with the input audio signal 100 from a microphone 1, an example of which is shown in Figure 3 .
  • the signal is converted to the frequency domain using a STFT and sent via two signal paths to processing steps 300 and 500.
  • Figure 1B shows an example of signal values corresponding to the process of Figure 1A .
  • the input audio signal is sampled with 256 audio signal input values corresponding to 256 frequency bands.
  • the STFT magnitude and phase value are also shown on the signal value plot corresponding to step 200 shown in Figure 1B .
  • a neural network is employed to run an AI model for detecting frequency bands of the input audio signal which contain desired information.
  • AI models are suitable.
  • the AI model has a capability to detect and maintain temporal information about a series of inputs and as long as these inputs can be provided and processed by the AI model fast enough to provide the output in real-time.
  • the AI model may receive as input a down-sampled frequency domain input.
  • the input may be down sampled using known methods such as Mel-frequency binning to a fixed number of bins.
  • the Al model outputs the attenuation value (i.e. how loud) each frequency band should be in the reconstructed signal.
  • step 300 is split into two steps, 300a and 300b.
  • step 300a which represents a pre-processing step
  • the STFT magnitude is resampled into Mel-space, using 20 bins.
  • the result of the down-sampled spectrum after step 300a is shown in the plot of Figure 1B .
  • the neural network acts to provide the Al model output
  • the phase is not processed by the Al model, but it is used unaltered at step 500 (via the second signal path) to reconstruct the audio signal e.g. via using an inverse short-time Fourier transform (ISTFT).
  • ISTFT inverse short-time Fourier transform
  • Figure 1D illustrates an example model used in step 300 to process the Mel-binned (resampled) STFT signal and produce the confidence values for 20 frequency bands. Each output corresponds to a frequency band on the STFT magnitude shown in Figure 1D .
  • an input layer with 20 neurons, one neuron for each Mel-binned value or frequency band, is provided.
  • the layer may be a recursive layer.
  • Hidden layers usually 2-3 layers deep are provided with 3-4 times the width of the input.
  • the first 1-2 layers are a type of recursive layer.
  • the output layer also has 20 neurons, one neuron for each Mel-binned value or frequency band. The output of step 300 is then sent to steps 400 and 600, respectively.
  • Figure 1E shows example of frequency bands that each neurons output is applied to. For example, all values within the B1 frequency band envelope are multiplied by the first output neuron, all within B2 are multiplied by second neuron output and so on. These envelopes may be used for down sampling the signal from the 129 magnitude values that result form 256-point STFT down to the 20 bins required by the neural network input.
  • step 400 the frequency gains are applied on the frequency domain representation of the original signal via simple multiplication of each respective frequency band by its respective gain.
  • the plot in Figure 1B corresponding to step 400 shows an example of STFT magnitude values multiplied by the Al model outputs received from step 300 (or 300a and 300b).
  • gain may correspond to a range of frequencies.
  • the frequency ranges are chosen at system implementation time depending on the requirements and usually match up with the down-sampling at step 200.
  • the signal is reconstructed into time domain in this example using an inverse short-time Fourier transform (ISTFT).
  • ISTFT inverse short-time Fourier transform
  • the Al model output is used to identify desired signal components in the output signal.
  • a finite or infinite input response (FIR/IIR) filter is provided , the filter being dynamically updated to remove any remaining noise in the signal to enhance the overall output.
  • the processed signal (output) is provided to an output system at step 900.
  • Figure 1 C shows an example implementation of secondary filtering methods, spanning steps 600 to 900. It will be appreciated that there are several possible implementations of the processing path 600-900.
  • the neural network outputs and a simple thresholding mechanism are used to determine whether each band contains useful signals or not (at step 600).
  • a frequency band has a confidence level of less than or equal to 0.3, if so, the frequency band is identified as one to be filtered out (undesired frequency bands).
  • a filter matching the desired frequency response is provided, using any desired method such as Wiener filter design, lest square filter etc. For example FIR filter parameters are calculated that match the frequency response to filter out all of the frequency bands identified to be filtered out.
  • the constructed filter from step 700 is then applied at step 800 to the reconstructed signal from step 500 and the filtered signal is output at step 900.
  • the process 1000 of Figure 1A can advantageously identify bands in the frequency spectrum of the input signal 100 which contain useful audio information such as voice or any desired audio signal.
  • the identified frequency bands which are deemed to be useful are then kept whilst the frequency bands containing information that is deemed to be noise are discarded via the means of attenuating their respective frequency bands in the frequency domain representation.
  • This approach is advantageous because it allows for live filtering of the input audio signal with very little knowledge of the exact spectral densities of the desired signal.
  • the approach is also lightweight enough to work on embedded hardware such as microcontrollers of Field Programmable Gate Arrays (FPGAs).
  • FPGAs Field Programmable Gate Arrays
  • the output of the neural network 300 can also be used as a reliable voice activity detection (VAD) which can then be used in conjunction with more traditional filtering such as using a Weiner filter to enhance the quality of the processed sound.
  • VAD reliable voice activity detection
  • FIG. 2 illustrates an audio input storage and data collection process 2000 for the device.
  • a storage unit 21 is present on the device that may be in communication with a computer when the device is charging. While the device is charging, the encrypted audio data is uploaded to a server while software updates are downloaded to the device. The recorded audio data can be accessed for monitoring purposes and to improve and retrain the models.
  • audio samples are recorded on-to the device storage unit 21 during daily operation.
  • the device may be in communication with an on-site server 22 and upload the audio data to the server.
  • the on-site server 22 is configured in this example to select most distinct audio samples and upload them to an off-site server or cloud server 23.
  • the off-site/cloud server 23 integrates the new data into a training dataset and may employ additional Al models to train with the new dataset 24..
  • the retrained model is then sent to be dispatched back to the devices deployed (i.e. via devices 23 , 22, and 21 in sequence).
  • mass storage facilitates the collection of data which can be uploaded to a server location and used to train and improve the algorithms.
  • Figure 6 illustrates an example embodiment of a system 3000 representing a hearing protection headset.
  • the system comprises two microphones 1 for audio sampling, although it would be appreciated that the number of microphones may vary.
  • the system can further comprise a sound pressure level measurement device (not shown).
  • Speakers 2 are provided for replaying the processed audio in real-time. The number of speakers may vary.
  • the system further comprises a controller unit 3 comprising a controller (i.e. a processor) and other electronic peripherals required to operate the system such as memory storage unit.
  • a controller i.e. a processor
  • the mass storage may take a form of a Micro SD card included within the device.
  • a LED indicator 4 is also provided to indicate for example when the headset 3000 is gathering audio samples.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

An audio signal detection system comprising a signal processing unit, the system comprising a receiver for receiving an audio signal input having a plurality of audio frequency bands; the signal processing unit comprising at least one signal processing module configured to perform the steps of: processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; providing a second signal path to reconstruct the audio signal input; and applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.

Description

    Technical Field
  • Embodiments of the present invention generally relate to audio processing systems and methods, in particular in relation to hearing protection devices.
  • Background
  • According to the World Health Organisation, noise pollution is the second biggest environmental problem affecting health. Prolonged exposure to noise pollution can have detrimental effects on health, such as cardiovascular disease, cognitive impairment, tinnitus and hearing loss. Noise pollution is particularly evident in mining, manufacturing and construction industries.
  • Noise protective devices exist and include earplugs, earmuffs, radio-integrated headsets and noise proof panels. However, existing devices offer limited communication and no selective control. For example, in the manufacturing industry, noise levels are particularly high and require the use of common ear protective devices which may hinder the ability for someone to hear another co-worker shouting or asking for help or a safety alarm.
  • Aspects of the present invention aim to overcome problems with the existing devices.
  • Summary
  • According to a first, independent aspect of the invention, there is provided an audio signal detection system comprising a signal processing unit, the system comprising a receiver for receiving an audio signal input having a plurality of audio frequency bands;
    the signal processing unit comprising at least one signal processing module configured to perform the steps of:
    • processing the audio signal input to provide a frequency domain converted signal input;
    • providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter;
    • applying the primary filter to the frequency domain converted signal output; and
    • providing a second signal path to reconstruct the audio signal input into time domain and to provide a signal output.
  • In a dependent aspect, the converted signal input comprises temporal information and the machine-learned model is configured to detect and process temporal information.
  • In a dependent aspect, employing a neural network comprises using at least one recursive neural layer.
  • In a dependent aspect, processing the audio signal input to provide a frequency domain converted signal input comprises calculating, for each one of the plurality of audio frequency bands, respective magnitude and phase values of a plurality of short-time Fourier transforms, STFT, for each audio frequency band.
  • In a further dependent aspect, the step of processing the audio signal input to provide a frequency domain converted signal input further comprises sampling the converted signal input, for example using Mel-frequency binning. This represents a pre-processing step where the STFT magnitude values are resampled into Mel space.
  • In a dependent aspect, the first output comprises at least one frequency band identified by the machine-learned model and at least one attenuation value.
  • In a further dependent aspect, the magnitude values of a plurality of short-time Fourier transforms are multiplied by a value of the first output.
  • In a dependent aspect, reconstructing the audio signal input comprises applying an inverse short-time Fourier transform calculation.
  • In a dependent aspect, running a machine-learned model to receive the converted signal input further provides a second output comprising a confidence value.
  • In a further dependent aspect, the signal processing module is configured to use the confidence value to identify undesirable audio frequency bands and provide a plurality of filter parameters representing a secondary filter.
  • In a further dependent aspect, the step of applying the primary filter to the frequency-domain audio signal input comprises applying the secondary filter to the reconstructed audio signal input.
  • In a dependent aspect, the signal processing unit comprises a filter bank module comprising a plurality of filters for receiving the output of the machine—learned model, wherein the at least one audio frequency band is identified based on the stacked output of a power of each one of the plurality of filters.
  • In a dependent aspect, the system further comprises an output device configured to receive the audio signal output.
  • Advantageously, the system can recognise and identify sounds (e.g. machinery noise) received from two or more microphones for example. This represents a solution for detecting frequency bands of an input audio signal which contain particular audio information.
  • Accordingly, a smart selective noise control solution is enabled. In particular, when employed as part of a noise control or selective control device, the system allows users to select the sounds they want to hear or remove. This advantageously leads to more enjoyable user experience as well as improves safety in the working environment.
  • For example, devices according to embodiments of the invention can guarantee the user is never exposed to noise levels above the safety limits, thus, they can automatically lower any sound higher than the prescribed level.
  • Advantageously, the system may be implemented on computational devices including mobile devices such as mobile phones. It will be appreciated that the system may be deployed using any existing framework that either already supports a processing unit such as an embedded/lightweight microcontroller or employing any suitable AI method that can be ported to the chosen device.
  • Advantageously, all input audio signal may be attenuated to a safe threshold when returned to the users of the system.
  • In a dependent aspect, there is provided an audio processing system for real-time noise control and selection for use on portable or wearable devices, the system comprising an audio signal detection system described above.
  • In a further dependent aspect, the audio processing system comprises a controller. The controller may be a processor of the type known in the art at hardware level.
  • In a further dependent aspect, the audio processing system comprises a user interface for receiving a user input. The input may include an audio signal threshold level. In preferred embodiments, the user input represents a selection to attenuate or/and amplify noise detected in the audio signal input. The user interface may be a hardware or a software interface which allows the user of the system to select the desired functionality of the system. Users can selectively attenuate or/and amplify any noise identified by the software.
  • In further dependent aspects, audio processing system further comprises one or more of : at least one microphone, a sound pressure level measurement device, one or more speakers, and a data storage device. In preferred embodiments, mass storage facilitates the collection of data which can then be uploaded to a server location and used to train and improve the algorithms. The mass storage may take a form of a Micro SD peripheral (i.e. memory storage card) included within the device.
  • In a dependent aspect, the audio processing system comprises at least one indicator device; for example the indicator devices may comprise a plurality of light-emitting diodes. For example, the hardware version of the user interface including a LED indicator can indicate the functionality of the system currently enabled and/or battery status.
  • In a dependent aspect, there is provided a noise protective device comprising an audio processing system as described above. The noise protective device may be integrated in a hearing protection or communication device e.g. headset or earphone etc.
  • According to a second, independent aspect of the invention, there is provided an audio signal processing method comprising the steps of:
    • providing a signal processing unit, the system comprising a at least one signal processing module and a receiver for receiving an audio signal input having a plurality of audio frequency bands;
    • processing the audio signal input to provide a frequency domain converted signal input;
    • providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter;
    • providing a second signal path to reconstruct the audio signal input; and
    • applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.
  • Aspects of the present invention are now described with reference to the examples shown in the accompanying Figures.
  • Brief Description of the Drawings
    • Figure 1A is a block diagram of a signal filtering process;
    • Figure 1B shows an example of signal values corresponding to the process of Figure 1A;
    • Figure 1C is a block diagram of a secondary filtering method example;
    • Figure 1D is an example model used at step 300 of Figure 1A;
    • Figure 1E shows an example of frequency bands used for applying neural network outputs;
    • Figure 2 illustrates an audio input storage and data collection process;
    • Figure 3 illustrates hardware components of a system according to a preferred embodiment.
    Detailed Description
  • With reference to Figure 1A, an audio signal filtering process 1000 is described.
  • In a preferred embodiment, an input audio signal 100 representing a raw audio signal comprising a plurality of frequency bands is processed at step 200 using a series of short-time Fourier transforms (STFT) for each frequency band. For example, the frequency bands of an input audio signal 100 can be identified based on a stacked output of the power of each filter bank or the stacked magnitude of a series of short-time Fourier transforms (STFT) for each frequency band.
  • The system is provided with the input audio signal 100 from a microphone 1, an example of which is shown in Figure 3. At step 200, the signal is converted to the frequency domain using a STFT and sent via two signal paths to processing steps 300 and 500.
  • Figure 1B shows an example of signal values corresponding to the process of Figure 1A. In this example, the input audio signal is sampled with 256 audio signal input values corresponding to 256 frequency bands. The STFT magnitude and phase value are also shown on the signal value plot corresponding to step 200 shown in Figure 1B.
  • Referring back to Figure 1A, at step 300 a neural network is employed to run an AI model for detecting frequency bands of the input audio signal which contain desired information. It will be appreciated that a number of AI models are suitable. In a preferred embodiment, the AI model has a capability to detect and maintain temporal information about a series of inputs and as long as these inputs can be provided and processed by the AI model fast enough to provide the output in real-time.
  • Accordingly, the AI model may receive as input a down-sampled frequency domain input. The input may be down sampled using known methods such as Mel-frequency binning to a fixed number of bins. The Al model outputs the attenuation value (i.e. how loud) each frequency band should be in the reconstructed signal.
  • As shown in Figure 1B, in an example, step 300 is split into two steps, 300a and 300b. At step 300a, which represents a pre-processing step, the STFT magnitude is resampled into Mel-space, using 20 bins. The result of the down-sampled spectrum after step 300a is shown in the plot of Figure 1B. At step 300b, the neural network acts to provide the Al model output In this example the phase is not processed by the Al model, but it is used unaltered at step 500 (via the second signal path) to reconstruct the audio signal e.g. via using an inverse short-time Fourier transform (ISTFT).
  • Figure 1D illustrates an example model used in step 300 to process the Mel-binned (resampled) STFT signal and produce the confidence values for 20 frequency bands. Each output corresponds to a frequency band on the STFT magnitude shown in Figure 1D.
  • In this example, an input layer with 20 neurons, one neuron for each Mel-binned value or frequency band, is provided. For optimum performance, the layer may be a recursive layer. Hidden layers, usually 2-3 layers deep are provided with 3-4 times the width of the input. Typically, the first 1-2 layers are a type of recursive layer. The output layer also has 20 neurons, one neuron for each Mel-binned value or frequency band. The output of step 300 is then sent to steps 400 and 600, respectively.
  • Figure 1E shows example of frequency bands that each neurons output is applied to. For example, all values within the B1 frequency band envelope are multiplied by the first output neuron, all within B2 are multiplied by second neuron output and so on. These envelopes may be used for down sampling the signal from the 129 magnitude values that result form 256-point STFT down to the 20 bins required by the neural network input.
  • At step 400, the frequency gains are applied on the frequency domain representation of the original signal via simple multiplication of each respective frequency band by its respective gain. The plot in Figure 1B corresponding to step 400 shows an example of STFT magnitude values multiplied by the Al model outputs received from step 300 (or 300a and 300b).
  • It will be appreciated that gain may correspond to a range of frequencies. The frequency ranges are chosen at system implementation time depending on the requirements and usually match up with the down-sampling at step 200.
  • At step 500, the signal is reconstructed into time domain in this example using an inverse short-time Fourier transform (ISTFT).
  • Accordingly, the Al model output is used to identify desired signal components in the output signal. At step 600, it is decided if such activity has been detected with a high confidence by measuring the magnitudes of the outputs (for example, if all outputs are close to or above 1.0, it is decided that the model detects useful information in all bands). At steps 700 and 800, a finite or infinite input response (FIR/IIR) filter is provided , the filter being dynamically updated to remove any remaining noise in the signal to enhance the overall output. The processed signal (output) is provided to an output system at step 900.
  • Figure 1 C shows an example implementation of secondary filtering methods, spanning steps 600 to 900. It will be appreciated that there are several possible implementations of the processing path 600-900. In this example, the neural network outputs and a simple thresholding mechanism are used to determine whether each band contains useful signals or not (at step 600). In this example, at step 600 it is determined if a frequency band has a confidence level of less than or equal to 0.3, if so, the frequency band is identified as one to be filtered out (undesired frequency bands). At step 700, a filter matching the desired frequency response is provided, using any desired method such as Wiener filter design, lest square filter etc. For example FIR filter parameters are calculated that match the frequency response to filter out all of the frequency bands identified to be filtered out.
  • The constructed filter from step 700 is then applied at step 800 to the reconstructed signal from step 500 and the filtered signal is output at step 900.
  • The process 1000 of Figure 1A can advantageously identify bands in the frequency spectrum of the input signal 100 which contain useful audio information such as voice or any desired audio signal. The identified frequency bands which are deemed to be useful are then kept whilst the frequency bands containing information that is deemed to be noise are discarded via the means of attenuating their respective frequency bands in the frequency domain representation.
  • This approach is advantageous because it allows for live filtering of the input audio signal with very little knowledge of the exact spectral densities of the desired signal. The approach is also lightweight enough to work on embedded hardware such as microcontrollers of Field Programmable Gate Arrays (FPGAs). The process allows for attenuation of sporadic noises while most common modern approaches can only reliably cancel out continuous noises.
  • The output of the neural network 300 can also be used as a reliable voice activity detection (VAD) which can then be used in conjunction with more traditional filtering such as using a Weiner filter to enhance the quality of the processed sound.
  • Advantages of the filtering process 1000 of Figure 1A over the prior art include but are not limited to:
    • Explicit detection of frequency bands containing voice or desired signal.
    • Voice activity detection and/or desired signal detection.
    • Ability to employ any filtering method in tandem with the Al application.
    • Small network capable of running on mobile devices in real-time.
    • Ability to attenuate sporadic noises.
    • Separation of the desired signals from the source signal.
  • Figure 2 illustrates an audio input storage and data collection process 2000 for the device. A storage unit 21 is present on the device that may be in communication with a computer when the device is charging. While the device is charging, the encrypted audio data is uploaded to a server while software updates are downloaded to the device. The recorded audio data can be accessed for monitoring purposes and to improve and retrain the models.
  • For example, audio samples are recorded on-to the device storage unit 21 during daily operation. When not in use the device may be in communication with an on-site server 22 and upload the audio data to the server. The on-site server 22 is configured in this example to select most distinct audio samples and upload them to an off-site server or cloud server 23. The off-site/cloud server 23 integrates the new data into a training dataset and may employ additional Al models to train with the new dataset 24.. The retrained model is then sent to be dispatched back to the devices deployed (i.e. via devices 23 , 22, and 21 in sequence).
  • Accordingly, mass storage facilitates the collection of data which can be uploaded to a server location and used to train and improve the algorithms.
  • Figure 6 illustrates an example embodiment of a system 3000 representing a hearing protection headset. In this example, the system comprises two microphones 1 for audio sampling, although it would be appreciated that the number of microphones may vary. Preferably, the system can further comprise a sound pressure level measurement device (not shown). Speakers 2 are provided for replaying the processed audio in real-time. The number of speakers may vary.
  • The system further comprises a controller unit 3 comprising a controller (i.e. a processor) and other electronic peripherals required to operate the system such as memory storage unit. For example, the mass storage may take a form of a Micro SD card included within the device. A LED indicator 4 is also provided to indicate for example when the headset 3000 is gathering audio samples.

Claims (15)

  1. An audio signal detection system comprising a signal processing unit, the system comprising a receiver for receiving an audio signal input having a plurality of audio frequency bands;
    the signal processing unit comprising at least one signal processing module configured to perform the steps of:
    processing the audio signal input to provide a frequency domain converted signal input;
    providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter;
    providing a second signal path to reconstruct the audio signal input; and
    applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.
  2. A system according to claim 1, wherein the converted signal input comprises temporal information and the machine-learned model is configured to detect and process temporal information.
  3. A system according to claim 1 or claim 2, wherein employing a neural network comprises using at least one recursive neural layer.
  4. A system according to any one of the preceding claims, wherein processing the audio signal input to provide a frequency domain converted signal input comprises calculating, for each one of the plurality of audio frequency bands, respective magnitude and phase values of a plurality of short-time Fourier transforms, STFT, for each audio frequency band.
  5. A system according to any one of the preceding claims, wherein the step of processing the audio signal input to provide a frequency domain converted signal input further comprises sampling the converted signal input.
  6. A system according to any one of the preceding claims, wherein the first output comprises at least one frequency band identified by the machine-learned model and at least one corresponding attenuation value.
  7. A system according to claim 6, wherein the magnitude values of a plurality of short-time Fourier transforms are multiplied by the respective attenuation values.
  8. A system according to any one of the preceding claims, wherein reconstructing the audio signal input comprises making an inverse short-time Fourier transform calculation.
  9. A system according to any one of the preceding claims, wherein running the machine-learned model further provides a second output comprising a confidence value.
  10. A system according to claim 9, wherein the signal processing module is configured to use the confidence value to identify undesirable audio frequency bands and provide a plurality of filter parameters representing a secondary filter.
  11. A system according to any one of the preceding claims, wherein the step of applying the primary filter to the frequency-domain representation of the audio signal input comprises applying the secondary filter to the reconstructed audio signal input.
  12. A system according to any one of the preceding claims, wherein the signal processing unit comprises a filter bank module comprising a plurality of filters for receiving the output of the machine—learned model, wherein the at least one audio frequency band is identified based on the stacked output of a power of each one of the plurality of filters.
  13. An audio processing system for real-time noise control and selection for use on portable or wearable devices, the system comprising an audio signal detection system according to any one of the preceding claims.
  14. A noise protective device comprising an audio processing system according to any one of claims 1 to 12.
  15. An audio signal processing method comprising the steps of:
    providing a signal processing unit, the system comprising a at least one signal processing module and a receiver for receiving an audio signal input having a plurality of audio frequency bands;
    processing the audio signal input to provide a frequency domain converted signal input;
    providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter;
    providing a second signal path to reconstruct the audio signal input; and
    applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.
EP21190382.8A 2020-11-23 2021-08-09 Audio signal processing systems and methods Pending EP4002361A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/082520 WO2022106691A1 (en) 2020-11-23 2021-11-22 Audio signal processing systems and methods

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GBGB2018375.2A GB202018375D0 (en) 2020-11-23 2020-11-23 Audio signal processing systems and methods

Publications (1)

Publication Number Publication Date
EP4002361A1 true EP4002361A1 (en) 2022-05-25

Family

ID=74046902

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21190382.8A Pending EP4002361A1 (en) 2020-11-23 2021-08-09 Audio signal processing systems and methods

Country Status (3)

Country Link
EP (1) EP4002361A1 (en)
GB (1) GB202018375D0 (en)
WO (1) WO2022106691A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2151822A1 (en) * 2008-08-05 2010-02-10 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for processing and audio signal for speech enhancement using a feature extraction
US20190080710A1 (en) * 2017-09-12 2019-03-14 Board Of Trustees Of Michigan State University System and apparatus for real-time speech enhancement in noisy environments

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2151822A1 (en) * 2008-08-05 2010-02-10 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for processing and audio signal for speech enhancement using a feature extraction
US20190080710A1 (en) * 2017-09-12 2019-03-14 Board Of Trustees Of Michigan State University System and apparatus for real-time speech enhancement in noisy environments

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BORGSTROM BENGT J ET AL: "Improving Statistical Model-Based Speech Enhancement with Deep Neural Networks", 2018 16TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), IEEE, 17 September 2018 (2018-09-17), pages 471 - 475, XP033439100, DOI: 10.1109/IWAENC.2018.8521382 *
JAMAL NOREZMI ET AL: "A Hybrid Approach for Single Channel Speech Enhancement using Deep Neural Network and Harmonic Regeneration Noise Reduction", IJACSA) INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, vol. 11, 1 January 2020 (2020-01-01), XP055880063, Retrieved from the Internet <URL:https://thesai.org/Downloads/Volume11No10/Paper_33-A_Hybrid_Approach_for_Single_Channel_Speech_Enhancement.pdf> *
SALEEM N ET AL: "Spectral Phase Estimation Based on Deep Neural Networks for Single Channel Speech Enhancement", JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS, NAUKA/INTERPERIODICA PUBLISHING, MOSCOW, RU, vol. 64, no. 12, 1 December 2019 (2019-12-01), pages 1372 - 1382, XP037029872, ISSN: 1064-2269, [retrieved on 20200221], DOI: 10.1134/S1064226919120155 *
VALIN JEAN-MARC: "A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement", 2018 IEEE 20TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 31 May 2018 (2018-05-31), pages 1 - 5, XP055783657, ISBN: 978-1-5386-6070-6, Retrieved from the Internet <URL:https://arxiv.org/pdf/1709.08243.pdf> DOI: 10.1109/MMSP.2018.8547084 *

Also Published As

Publication number Publication date
GB202018375D0 (en) 2021-01-06
WO2022106691A1 (en) 2022-05-27

Similar Documents

Publication Publication Date Title
CN104303227B (en) The apparatus and method for eliminating and perceiving noise by combining Active noise and compensate the perceived quality for improving sound reproduction
US9558755B1 (en) Noise suppression assisted automatic speech recognition
US20060206320A1 (en) Apparatus and method for noise reduction and speech enhancement with microphones and loudspeakers
US11736870B2 (en) Neural network-driven frequency translation
WO2017191249A1 (en) Speech enhancement and audio event detection for an environment with non-stationary noise
US10531178B2 (en) Annoyance noise suppression
US20170230765A1 (en) Monaural speech intelligibility predictor unit, a hearing aid and a binaural hearing system
US9886967B2 (en) Systems and methods for speech extraction
US11218796B2 (en) Annoyance noise suppression
CN106664473A (en) Information-processing device, information processing method, and program
EP3757993B1 (en) Pre-processing for automatic speech recognition
JP2020115206A (en) System and method
US10204637B2 (en) Noise reduction methodology for wearable devices employing multitude of sensors
KR20210149858A (en) Wind noise detection systems and methods
CN113241085A (en) Echo cancellation method, device, equipment and readable storage medium
WO2017045512A1 (en) Voice recognition method and apparatus, terminal, and voice recognition device
GB2526980A (en) Sensor input recognition
EP4002361A1 (en) Audio signal processing systems and methods
CN116312545B (en) Speech recognition system and method in a multi-noise environment
CN113132885B (en) Method for judging wearing state of earphone based on energy difference of double microphones
CN115293205A (en) Anomaly detection method, self-encoder model training method and electronic equipment
CN112118511A (en) Earphone noise reduction method and device, earphone and computer readable storage medium
CN113132880A (en) Impact noise suppression method and system based on dual-microphone architecture
CN202634674U (en) Denoising device under the state of listening to music via earphone
WO2023138252A1 (en) Audio signal processing method and apparatus, earphone device, and storage medium

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220823

RBV Designated contracting states (corrected)

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20230922

17Q First examination report despatched

Effective date: 20230927