WO2022106691A1 - Audio signal processing systems and methods - Google Patents
Audio signal processing systems and methods Download PDFInfo
- Publication number
- WO2022106691A1 WO2022106691A1 PCT/EP2021/082520 EP2021082520W WO2022106691A1 WO 2022106691 A1 WO2022106691 A1 WO 2022106691A1 EP 2021082520 W EP2021082520 W EP 2021082520W WO 2022106691 A1 WO2022106691 A1 WO 2022106691A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- signal input
- audio
- audio signal
- output
- frequency
- Prior art date
Links
- 230000005236 sound signal Effects 0.000 title claims abstract description 54
- 238000012545 processing Methods 0.000 title claims abstract description 40
- 238000000034 method Methods 0.000 title claims description 21
- 238000013528 artificial neural network Methods 0.000 claims abstract description 13
- 238000001514 detection method Methods 0.000 claims abstract description 9
- 230000008569 process Effects 0.000 claims description 12
- 230000001681 protective effect Effects 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 5
- 230000002123 temporal effect Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000001537 neural effect Effects 0.000 claims description 2
- 238000003672 processing method Methods 0.000 claims description 2
- 230000001419 dependent effect Effects 0.000 description 18
- 238000001914 filtration Methods 0.000 description 8
- 210000002569 neuron Anatomy 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000036541 health Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 238000013480 data collection Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 208000024172 Cardiovascular disease Diseases 0.000 description 1
- 208000028698 Cognitive impairment Diseases 0.000 description 1
- 206010011878 Deafness Diseases 0.000 description 1
- 206010063602 Exposure to noise Diseases 0.000 description 1
- 208000009205 Tinnitus Diseases 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 208000010877 cognitive disease Diseases 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000010370 hearing loss Effects 0.000 description 1
- 231100000888 hearing loss Toxicity 0.000 description 1
- 208000016354 hearing loss disease Diseases 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000004205 output neuron Anatomy 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 231100000886 tinnitus Toxicity 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
Definitions
- Embodiments of the present invention generally relate to audio processing systems and methods, in particular in relation to hearing protection devices.
- noise pollution is the second biggest environmental problem affecting health. Prolonged exposure to noise pollution can have detrimental effects on health, such as cardiovascular disease, cognitive impairment, tinnitus and hearing loss. Noise pollution is particularly evident in mining, manufacturing and construction industries.
- Noise protective devices exist and include earplugs, earmuffs, radio-integrated headsets and noise proof panels.
- existing devices offer limited communication and no selective control.
- noise levels are particularly high and require the use of common ear protective devices which may hinder the ability for someone to hear another co-worker shouting or asking for help or a safety alarm.
- aspects of the present invention aim to overcome problems with the existing devices.
- an audio signal detection system comprising a signal processing unit, the system comprising a receiver for receiving an audio signal input having a plurality of audio frequency bands; the signal processing unit comprising at least one signal processing module configured to perform the steps of: processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; applying the primary filter to the frequency domain converted signal output; and providing a second signal path to reconstruct the audio signal input into time domain and to provide a signal output.
- the converted signal input comprises temporal information and the machine-learned model is configured to detect and process temporal information.
- employing a neural network comprises using at least one recursive neural layer.
- processing the audio signal input to provide a frequency domain converted signal input comprises calculating, for each one of the plurality of audio frequency bands, respective magnitude and phase values of a plurality of short-time Fourier transforms, STFT, for each audio frequency band.
- the step of processing the audio signal input to provide a frequency domain converted signal input further comprises sampling the converted signal input, for example using Mel-frequency binning .
- the first output comprises at least one frequency band identified by the machine-learned model and at least one attenuation value.
- the magnitude values of a plurality of short-time Fourier transforms are multiplied by a value of the first output.
- reconstructing the audio signal input comprises applying an inverse short-time Fourier transform calculation.
- running a machine-learned model to receive the converted signal input further provides a second output comprising a confidence value.
- the signal processing module is configured to use the confidence value to identify undesirable audio frequency bands and provide a plurality of filter parameters representing a secondary filter.
- the step of applying the primary filter to the frequencydomain audio signal input comprises applying the secondary filter to the reconstructed audio signal input.
- the signal processing unit comprises a filter bank module comprising a plurality of filters for receiving the output of the machine — learned model, wherein the at least one audio frequency band is identified based on the stacked output of a power of each one of the plurality of filters.
- system further comprises an output device configured to receive the audio signal output.
- the system can recognise and identify sounds (e.g. machinery noise) received from two or more microphones for example.
- sounds e.g. machinery noise
- This represents a solution for detecting frequency bands of an input audio signal which contain particular audio information.
- a smart selective noise control solution is enabled.
- the system when employed as part of a noise control or selective control device, the system allows users to select the sounds they want to hear or remove. This advantageously leads to more enjoyable user experience as well as improves safety in the working environment.
- devices according to embodiments of the invention can guarantee the user is never exposed to noise levels above the safety limits, thus, they can automatically lower any sound higher than the prescribed level.
- the system may be implemented on computational devices including mobile devices such as mobile phones. It will be appreciated that the system may be deployed using any existing framework that either already supports a processing unit such as an embedded/lightweight microcontroller or employing any suitable Al method that can be ported to the chosen device.
- a processing unit such as an embedded/lightweight microcontroller or employing any suitable Al method that can be ported to the chosen device.
- all input audio signal may be attenuated to a safe threshold when returned to the users of the system.
- an audio processing system for real-time noise control and selection for use on portable or wearable devices, the system comprising an audio signal detection system described above.
- the audio processing system comprises a controller.
- the controller may be a processor of the type known in the art at hardware level.
- the audio processing system comprises a user interface for receiving a user input.
- the input may include an audio signal threshold level.
- the user input represents a selection to attenuate or/and amplify noise detected in the audio signal input.
- the user interface may be a hardware or a software interface which allows the user of the system to select the desired functionality of the system. Users can selectively attenuate or/and amplify any noise identified by the software.
- audio processing system further comprises one or more of : at least one microphone, a sound pressure level measurement device, one or more speakers, and a data storage device.
- mass storage facilitates the collection of data which can then be uploaded to a server location and used to train and improve the algorithms.
- the mass storage may take a form of a Micro SD peripheral (i.e. memory storage card) included within the device.
- the audio processing system comprises at least one indicator device; for example the indicator devices may comprise a plurality of light-emitting diodes.
- the hardware version of the user interface including a LED indicator can indicate the functionality of the system currently enabled and/or battery status.
- a noise protective device comprising an audio processing system as described above.
- the noise protective device may be integrated in a hearing protection or communication device e.g. headset or earphone etc.
- an audio signal processing method comprising the steps of: providing a signal processing unit, the system comprising a at least one signal processing module and a receiver for receiving an audio signal input having a plurality of audio frequency bands; processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; providing a second signal path to reconstruct the audio signal input; and applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.
- Figure 1A is a block diagram of a signal filtering process
- Figure 1 B shows an example of signal values corresponding to the process of Figure 1 A
- Figure 1C is a block diagram of a secondary filtering method example
- Figure 1 D is an example model used at step 300 of Figure 1A;
- Figure 1 E shows an example of frequency bands used for applying neural network outputs
- FIG. 1 illustrates an audio input storage and data collection process
- Figure 3 illustrates hardware components of a system according to a preferred embodiment.
- an input audio signal 100 representing a raw audio signal comprising a plurality of frequency bands is processed at step 200 using a series of short- time Fourier transforms (STFT) for each frequency band.
- STFT short- time Fourier transforms
- the frequency bands of an input audio signal 100 can be identified based on a stacked output of the power of each filter bank or the stacked magnitude of a series of short-time Fourier transforms (STFT) for each frequency band.
- the system is provided with the input audio signal 100 from a microphone 1, an example of which is shown in Figure 3.
- the signal is converted to the frequency domain using a STFT and sent via two signal paths to processing steps 300 and 500.
- Figure 1 B shows an example of signal values corresponding to the process of Figure 1A.
- the input audio signal is sampled with 256 audio signal input values corresponding to 256 frequency bands.
- the STFT magnitude and phase value are also shown on the signal value plot corresponding to step 200 shown in Figure 1 B.
- a neural network is employed to run an Al model for detecting frequency bands of the input audio signal which contain desired information.
- the Al model has a capability to detect and maintain temporal information about a series of inputs and as long as these inputs can be provided and processed by the Al model fast enough to provide the output in real-time.
- the Al model may receive as input a down-sampled frequency domain input. The input may be down sampled using known methods such as Mel-frequency binning to a fixed number of bins.
- the Al model outputs the attenuation value (i.e. how loud) each frequency band should be in the reconstructed signal.
- step 300 is split into two steps, 300a and 300b.
- step 300a which represents a pre-processing step
- the STFT magnitude is resampled into Mel-space, using 20 bins.
- the result of the down-sampled spectrum after step 300a is shown in the plot of Figure 1 B.
- the neural network acts to provide the Al model output
- the phase is not processed by the Al model, but it is used unaltered at step 500 (via the second signal path) to reconstruct the audio signal e.g. via using an inverse short-time Fourier transform (ISTFT).
- ISTFT inverse short-time Fourier transform
- Figure 1 D illustrates an example model used in step 300 to process the Mel-binned (resampled) STFT signal and produce the confidence values for 20 frequency bands. Each output corresponds to a frequency band on the STFT magnitude shown in Figure 1 D.
- an input layer with 20 neurons, one neuron for each Mel-binned value or frequency band, is provided.
- the layer may be a recursive layer.
- Hidden layers usually 2-3 layers deep are provided with 3-4 times the width of the input.
- the first 1 -2 layers are a type of recursive layer.
- the output layer also has 20 neurons, one neuron for each Mel-binned value or frequency band. The output of step 300 is then sent to steps 400 and 600, respectively.
- Figure I E shows example of frequency bands that each neurons output is applied to. For example, all values within the Bl frequency band envelope are multiplied by the first output neuron, all within B2 are multiplied by second neuron output and so on. These envelopes may be used for down sampling the signal from the 129 magnitude values that result form 256-point STFT down to the 20 bins required by the neural network input.
- step 400 the frequency gains are applied on the frequency domain representation of the original signal via simple multiplication of each respective frequency band by its respective gain.
- the plot in Figure 1 B corresponding to step 400 shows an example of STFT magnitude values multiplied by the Al model outputs received from step 300 (or 300a and 300b).
- gain may correspond to a range of frequencies.
- the frequency ranges are chosen at system implementation time depending on the requirements and usually match up with the down-sampling at step 200.
- the signal is reconstructed into time domain in this example using an inverse short-time Fourier transform (ISTFT).
- ISTFT inverse short-time Fourier transform
- the Al model output is used to identify desired signal components in the output signal.
- a finite or infinite input response (FIR/IIR) filter is provided , the filter being dynamically updated to remove any remaining noise in the signal to enhance the overall output.
- the processed signal (output) is provided to an output system at step 900.
- Figure 1 C shows an example implementation of secondary filtering methods, spanning steps 600 to 900. It will be appreciated that there are several possible implementations of the processing path 600-900.
- the neural network outputs and a simple thresholding mechanism are used to determine whether each band contains useful signals or not (at step 600).
- a frequency band has a confidence level of less than or equal to 0.3, if so, the frequency band is identified as one to be filtered out (undesired frequency bands).
- a filter matching the desired frequency response is provided, using any desired method such as Wiener filter design, lest square filter etc. For example FIR filter parameters are calculated that match the frequency response to filter out all of the frequency bands identified to be filtered out.
- the constructed filter from step 700 is then applied at step 800 to the reconstructed signal from step 500 and the filtered signal is output at step 900.
- the process 1000 of Figure 1A can advantageously identify bands in the frequency spectrum of the input signal 100 which contain useful audio information such as voice or any desired audio signal.
- the identified frequency bands which are deemed to be useful are then kept whilst the frequency bands containing information that is deemed to be noise are discarded via the means of attenuating their respective frequency bands in the frequency domain representation.
- This approach is advantageous because it allows for live filtering of the input audio signal with very little knowledge of the exact spectral densities of the desired signal.
- the approach is also lightweight enough to work on embedded hardware such as microcontrollers of Field Programmable Gate Arrays (FPGAs).
- FPGAs Field Programmable Gate Arrays
- the output of the neural network 300 can also be used as a reliable voice activity detection (VAD) which can then be used in conjunction with more traditional filtering such as using a Weiner filter to enhance the quality of the processed sound.
- VAD reliable voice activity detection
- FIG. 2 illustrates an audio input storage and data collection process 2000 for the device.
- a storage unit 21 is present on the device that may be in communication with a computer when the device is charging. While the device is charging, the encrypted audio data is uploaded to a server while software updates are downloaded to the device. The recorded audio data can be accessed for monitoring purposes and to improve and retrain the models.
- audio samples are recorded on-to the device storage unit 21 during daily operation.
- the device may be in communication with an on-site server 22 and upload the audio data to the server.
- the on-site server 22 is configured in this example to select most distinct audio samples and upload them to an off-site server or cloud server 23.
- the off-site/cloud server 23 integrates the new data into a training dataset and may employ additional Al models to train with the new dataset 24..
- the retrained model is then sent to be dispatched back to the devices deployed (i.e. via devices 23 , 22, and 21 in sequence).
- mass storage facilitates the collection of data which can be uploaded to a server location and used to train and improve the algorithms.
- Figure 6 illustrates an example embodiment of a system 3000 representing a hearing protection headset.
- the system comprises two microphones 1 for audio sampling, although it would be appreciated that the number of microphones may vary.
- the system can further comprise a sound pressure level measurement device (not shown).
- Speakers 2 are provided for replaying the processed audio in real-time. The number of speakers may vary.
- the system further comprises a controller unit 3 comprising a controller (i.e. a processor) and other electronic peripherals required to operate the system such as memory storage unit.
- a controller i.e. a processor
- the mass storage may take a form of a Micro SD card included within the device.
- a LED indicator 4 is also provided to indicate for example when the headset 3000 is gathering audio samples.
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
An audio signal detection system comprising a signal processing unit, the system comprising a receiver for receiving an audio signal input having a plurality of audio frequency bands; the signal processing unit comprising at least one signal processing module configured to perform the steps of: processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; providing a second signal path to reconstruct the audio signal input; and applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.
Description
AUDIO SIGNAL PROCESSING SYSTEMS AND METHODS
Technical Field
Embodiments of the present invention generally relate to audio processing systems and methods, in particular in relation to hearing protection devices.
Background
According to the World Health Organisation, noise pollution is the second biggest environmental problem affecting health. Prolonged exposure to noise pollution can have detrimental effects on health, such as cardiovascular disease, cognitive impairment, tinnitus and hearing loss. Noise pollution is particularly evident in mining, manufacturing and construction industries.
Noise protective devices exist and include earplugs, earmuffs, radio-integrated headsets and noise proof panels. However, existing devices offer limited communication and no selective
control. For example, in the manufacturing industry, noise levels are particularly high and require the use of common ear protective devices which may hinder the ability for someone to hear another co-worker shouting or asking for help or a safety alarm.
Aspects of the present invention aim to overcome problems with the existing devices.
Summary
According to a first, independent aspect of the invention, there is provided an audio signal detection system comprising a signal processing unit, the system comprising a receiver for receiving an audio signal input having a plurality of audio frequency bands; the signal processing unit comprising at least one signal processing module configured to perform the steps of: processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; applying the primary filter to the frequency domain converted signal output; and providing a second signal path to reconstruct the audio signal input into time domain and to provide a signal output.
In a dependent aspect, the converted signal input comprises temporal information and the machine-learned model is configured to detect and process temporal information.
In a dependent aspect, employing a neural network comprises using at least one recursive neural layer.
In a dependent aspect, processing the audio signal input to provide a frequency domain converted signal input comprises calculating, for each one of the plurality of audio frequency bands, respective magnitude and phase values of a plurality of short-time Fourier transforms, STFT, for each audio frequency band.
In a further dependent aspect, the step of processing the audio signal input to provide a frequency domain converted signal input further comprises sampling the converted signal
input, for example using Mel-frequency binning . This represents a pre-processing step where the STFT magnitude values are resampled into Mel space.
In a dependent aspect, the first output comprises at least one frequency band identified by the machine-learned model and at least one attenuation value.
In a further dependent aspect, the magnitude values of a plurality of short-time Fourier transforms are multiplied by a value of the first output.
In a dependent aspect, reconstructing the audio signal input comprises applying an inverse short-time Fourier transform calculation.
In a dependent aspect, running a machine-learned model to receive the converted signal input further provides a second output comprising a confidence value.
In a further dependent aspect, the signal processing module is configured to use the confidence value to identify undesirable audio frequency bands and provide a plurality of filter parameters representing a secondary filter.
In a further dependent aspect, the step of applying the primary filter to the frequencydomain audio signal input comprises applying the secondary filter to the reconstructed audio signal input.
In a dependent aspect, the signal processing unit comprises a filter bank module comprising a plurality of filters for receiving the output of the machine — learned model, wherein the at least one audio frequency band is identified based on the stacked output of a power of each one of the plurality of filters.
In a dependent aspect, the system further comprises an output device configured to receive the audio signal output.
Advantageously, the system can recognise and identify sounds (e.g. machinery noise) received from two or more microphones for example. This represents a solution for detecting frequency bands of an input audio signal which contain particular audio information.
Accordingly, a smart selective noise control solution is enabled. In particular, when employed as part of a noise control or selective control device, the system allows users to select the sounds they want to hear or remove. This advantageously leads to more enjoyable user experience as well as improves safety in the working environment.
For example, devices according to embodiments of the invention can guarantee the user is never exposed to noise levels above the safety limits, thus, they can automatically lower any sound higher than the prescribed level.
Advantageously, the system may be implemented on computational devices including mobile devices such as mobile phones. It will be appreciated that the system may be deployed using any existing framework that either already supports a processing unit such as an embedded/lightweight microcontroller or employing any suitable Al method that can be ported to the chosen device.
Advantageously, all input audio signal may be attenuated to a safe threshold when returned to the users of the system.
In a dependent aspect, there is provided an audio processing system for real-time noise control and selection for use on portable or wearable devices, the system comprising an audio signal detection system described above.
In a further dependent aspect, the audio processing system comprises a controller. The controller may be a processor of the type known in the art at hardware level.
In a further dependent aspect, the audio processing system comprises a user interface for receiving a user input. The input may include an audio signal threshold level. In preferred embodiments, the user input represents a selection to attenuate or/and amplify noise detected in the audio signal input. The user interface may be a hardware or a software interface which allows the user of the system to select the desired functionality of the system. Users can selectively attenuate or/and amplify any noise identified by the software.
In further dependent aspects, audio processing system further comprises one or more of : at least one microphone, a sound pressure level measurement device, one or more speakers, and a data storage device. In preferred embodiments, mass storage facilitates the collection of data which can then be uploaded to a server location and used to train and improve the algorithms. The mass storage may take a form of a Micro SD peripheral (i.e. memory storage card) included within the device.
In a dependent aspect, the audio processing system comprises at least one indicator device; for example the indicator devices may comprise a plurality of light-emitting diodes. For example, the hardware version of the user interface including a LED indicator can indicate the functionality of the system currently enabled and/or battery status.
In a dependent aspect, there is provided a noise protective device comprising an audio processing system as described above. The noise protective device may be integrated in a hearing protection or communication device e.g. headset or earphone etc.
According to a second, independent aspect of the invention, there is provided an audio signal processing method comprising the steps of: providing a signal processing unit, the system comprising a at least one signal processing module and a receiver for receiving an audio signal input having a plurality of audio frequency bands; processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; providing a second signal path to reconstruct the audio signal input; and applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.
Aspects of the present invention are now described with reference to the examples shown in the accompanying Figures.
Brief Description of the Drawings
Figure 1A is a block diagram of a signal filtering process;
Figure 1 B shows an example of signal values corresponding to the process of Figure 1 A;
Figure 1C is a block diagram of a secondary filtering method example;
Figure 1 D is an example model used at step 300 of Figure 1A;
Figure 1 E shows an example of frequency bands used for applying neural network outputs;
Figure 2 illustrates an audio input storage and data collection process;
Figure 3 illustrates hardware components of a system according to a preferred embodiment.
Detailed Description
With reference to Figure 1 A, an audio signal filtering process 1000 is described.
In a preferred embodiment, an input audio signal 100 representing a raw audio signal comprising a plurality of frequency bands is processed at step 200 using a series of short- time Fourier transforms (STFT) for each frequency band. For example, the frequency bands of an input audio signal 100 can be identified based on a stacked output of the power of each filter bank or the stacked magnitude of a series of short-time Fourier transforms (STFT) for each frequency band.
The system is provided with the input audio signal 100 from a microphone 1, an example of which is shown in Figure 3. At step 200, the signal is converted to the frequency domain using a STFT and sent via two signal paths to processing steps 300 and 500.
Figure 1 B shows an example of signal values corresponding to the process of Figure 1A. In this example, the input audio signal is sampled with 256 audio signal input values corresponding to 256 frequency bands. The STFT magnitude and phase value are also shown on the signal value plot corresponding to step 200 shown in Figure 1 B.
Referring back to Figure 1A, at step 300 a neural network is employed to run an Al model for detecting frequency bands of the input audio signal which contain desired information. It will be appreciated that a number of Al models are suitable. In a preferred embodiment, the Al model has a capability to detect and maintain temporal information about a series of inputs and as long as these inputs can be provided and processed by the Al model fast enough to provide the output in real-time.
Accordingly, the Al model may receive as input a down-sampled frequency domain input. The input may be down sampled using known methods such as Mel-frequency binning to a fixed number of bins. The Al model outputs the attenuation value (i.e. how loud) each frequency band should be in the reconstructed signal.
As shown in Figure I B, in an example, step 300 is split into two steps, 300a and 300b. At step 300a, which represents a pre-processing step, the STFT magnitude is resampled into Mel-space, using 20 bins. The result of the down-sampled spectrum after step 300a is shown in the plot of Figure 1 B. At step 300b, the neural network acts to provide the Al model output In this example the phase is not processed by the Al model, but it is used unaltered at step 500 (via the second signal path) to reconstruct the audio signal e.g. via using an inverse short-time Fourier transform (ISTFT).
Figure 1 D illustrates an example model used in step 300 to process the Mel-binned (resampled) STFT signal and produce the confidence values for 20 frequency bands. Each output corresponds to a frequency band on the STFT magnitude shown in Figure 1 D.
In this example, an input layer with 20 neurons, one neuron for each Mel-binned value or frequency band, is provided. For optimum performance, the layer may be a recursive layer. Hidden layers, usually 2-3 layers deep are provided with 3-4 times the width of the input. Typically, the first 1 -2 layers are a type of recursive layer. The output layer also has 20 neurons, one neuron for each Mel-binned value or frequency band. The output of step 300 is then sent to steps 400 and 600, respectively.
Figure I E shows example of frequency bands that each neurons output is applied to. For example, all values within the Bl frequency band envelope are multiplied by the first output neuron, all within B2 are multiplied by second neuron output and so on. These envelopes may be used for down sampling the signal from the 129 magnitude values that result form 256-point STFT down to the 20 bins required by the neural network input.
At step 400, the frequency gains are applied on the frequency domain representation of the original signal via simple multiplication of each respective frequency band by its respective
gain. The plot in Figure 1 B corresponding to step 400 shows an example of STFT magnitude values multiplied by the Al model outputs received from step 300 (or 300a and 300b).
It will be appreciated that gain may correspond to a range of frequencies. The frequency ranges are chosen at system implementation time depending on the requirements and usually match up with the down-sampling at step 200.
At step 500, the signal is reconstructed into time domain in this example using an inverse short-time Fourier transform (ISTFT).
Accordingly, the Al model output is used to identify desired signal components in the output signal. At step 600, it is decided if such activity has been detected with a high confidence by measuring the magnitudes of the outputs (for example, if all outputs are close to or above 1.0, it is decided that the model detects useful information in all bands). At steps 700 and 800, a finite or infinite input response (FIR/IIR) filter is provided , the filter being dynamically updated to remove any remaining noise in the signal to enhance the overall output. The processed signal (output) is provided to an output system at step 900.
Figure 1 C shows an example implementation of secondary filtering methods, spanning steps 600 to 900. It will be appreciated that there are several possible implementations of the processing path 600-900. In this example, the neural network outputs and a simple thresholding mechanism are used to determine whether each band contains useful signals or not (at step 600). In this example, at step 600 it is determined if a frequency band has a confidence level of less than or equal to 0.3, if so, the frequency band is identified as one to be filtered out (undesired frequency bands). At step 700, a filter matching the desired frequency response is provided, using any desired method such as Wiener filter design, lest square filter etc. For example FIR filter parameters are calculated that match the frequency response to filter out all of the frequency bands identified to be filtered out.
The constructed filter from step 700 is then applied at step 800 to the reconstructed signal from step 500 and the filtered signal is output at step 900.
The process 1000 of Figure 1A can advantageously identify bands in the frequency spectrum of the input signal 100 which contain useful audio information such as voice or any desired
audio signal. The identified frequency bands which are deemed to be useful are then kept whilst the frequency bands containing information that is deemed to be noise are discarded via the means of attenuating their respective frequency bands in the frequency domain representation.
This approach is advantageous because it allows for live filtering of the input audio signal with very little knowledge of the exact spectral densities of the desired signal. The approach is also lightweight enough to work on embedded hardware such as microcontrollers of Field Programmable Gate Arrays (FPGAs). The process allows for attenuation of sporadic noises while most common modern approaches can only reliably cancel out continuous noises.
The output of the neural network 300 can also be used as a reliable voice activity detection (VAD) which can then be used in conjunction with more traditional filtering such as using a Weiner filter to enhance the quality of the processed sound.
Advantages of the filtering process 1000 of Figure 1 A over the prior art include but are not limited to:
• Explicit detection of frequency bands containing voice or desired signal.
• Voice activity detection and/or desired signal detection.
• Ability to employ any filtering method in tandem with the Al application.
• Small network capable of running on mobile devices in real-time.
• Ability to attenuate sporadic noises.
• Separation of the desired signals from the source signal.
Figure 2 illustrates an audio input storage and data collection process 2000 for the device. A storage unit 21 is present on the device that may be in communication with a computer when the device is charging. While the device is charging, the encrypted audio data is uploaded to a server while software updates are downloaded to the device. The recorded audio data can be accessed for monitoring purposes and to improve and retrain the models.
For example, audio samples are recorded on-to the device storage unit 21 during daily operation. When not in use the device may be in communication with an on-site server 22
and upload the audio data to the server. The on-site server 22 is configured in this example to select most distinct audio samples and upload them to an off-site server or cloud server 23. The off-site/cloud server 23 integrates the new data into a training dataset and may employ additional Al models to train with the new dataset 24.. The retrained model is then sent to be dispatched back to the devices deployed (i.e. via devices 23 , 22, and 21 in sequence).
Accordingly, mass storage facilitates the collection of data which can be uploaded to a server location and used to train and improve the algorithms.
Figure 6 illustrates an example embodiment of a system 3000 representing a hearing protection headset. In this example, the system comprises two microphones 1 for audio sampling, although it would be appreciated that the number of microphones may vary. Preferably, the system can further comprise a sound pressure level measurement device (not shown). Speakers 2 are provided for replaying the processed audio in real-time. The number of speakers may vary.
The system further comprises a controller unit 3 comprising a controller (i.e. a processor) and other electronic peripherals required to operate the system such as memory storage unit. For example, the mass storage may take a form of a Micro SD card included within the device. A LED indicator 4 is also provided to indicate for example when the headset 3000 is gathering audio samples.
Claims
1. An audio signal detection system comprising a signal processing unit, the system comprising a receiver for receiving an audio signal input having a plurality of audio frequency bands; the signal processing unit comprising at least one signal processing module configured to perform the steps of: processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; providing a second signal path to reconstruct the audio signal input; and applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.
2. A system according to claim 1, wherein the converted signal input comprises temporal information and the machine-learned model is configured to detect and process temporal information.
3. A system according to claim 1 or claim 2, wherein employing a neural network comprises using at least one recursive neural layer.
4. A system according to any one of the preceding claims, wherein processing the audio signal input to provide a frequency domain converted signal input comprises calculating, for each one of the plurality of audio frequency bands, respective magnitude and phase values of a plurality of short-time Fourier transforms, STFT, for each audio frequency band.
5. A system according to any one of the preceding claims, wherein the step of processing the audio signal input to provide a frequency domain converted signal input further comprises sampling the converted signal input.
6. A system according to any one of the preceding claims, wherein the first output comprises at least one frequency band identified by the machine-learned model and at least one corresponding attenuation value.
7. A system according to claim 6, wherein the magnitude values of a plurality of short-time Fourier transforms are multiplied by the respective attenuation values.
8. A system according to any one of the preceding claims, wherein reconstructing the audio signal input comprises making an inverse short-time Fourier transform calculation.
9. A system according to any one of the preceding claims, wherein running the machine- learned model further provides a second output comprising a confidence value.
10. A system according to claim 9, wherein the signal processing module is configured to use the confidence value to identify undesirable audio frequency bands and provide a plurality of filter parameters representing a secondary filter.
11. A system according to any one of the preceding claims, wherein the step of applying the primary filter to the frequency-domain representation of the audio signal input comprises applying the secondary filter to the reconstructed audio signal input .
12. A system according to any one of the preceding claims, wherein the signal processing unit comprises a filter bank module comprising a plurality of filters for receiving the output of the machine — learned model, wherein the at least one audio frequency band is identified based on the stacked output of a power of each one of the plurality of filters.
13. An audio processing system for real-time noise control and selection for use on portable or wearable devices, the system comprising an audio signal detection system according to any one of the preceding claims.
14. A noise protective device comprising an audio processing system according to any one of claims 1 to 12.
15. An audio signal processing method comprising the steps of: providing a signal processing unit, the system comprising a at least one signal processing module and a receiver for receiving an audio signal input having a plurality of audio frequency bands;
processing the audio signal input to provide a frequency domain converted signal input; providing a first signal path comprising the step of employing a neural network for running a machine-learned model to receive the converted signal input and provide a first output representing a primary filter; providing a second signal path to reconstruct the audio signal input; and applying the primary filter to the frequency-domain representation of the audio signal input to provide a signal output.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB2018375.2 | 2020-11-23 | ||
GBGB2018375.2A GB202018375D0 (en) | 2020-11-23 | 2020-11-23 | Audio signal processing systems and methods |
EP21190382.8 | 2021-08-09 | ||
EP21190382.8A EP4002361A1 (en) | 2020-11-23 | 2021-08-09 | Audio signal processing systems and methods |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022106691A1 true WO2022106691A1 (en) | 2022-05-27 |
Family
ID=74046902
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2021/082520 WO2022106691A1 (en) | 2020-11-23 | 2021-11-22 | Audio signal processing systems and methods |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP4002361A1 (en) |
GB (1) | GB202018375D0 (en) |
WO (1) | WO2022106691A1 (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2151822A1 (en) * | 2008-08-05 | 2010-02-10 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus and method for processing and audio signal for speech enhancement using a feature extraction |
US20190080710A1 (en) * | 2017-09-12 | 2019-03-14 | Board Of Trustees Of Michigan State University | System and apparatus for real-time speech enhancement in noisy environments |
-
2020
- 2020-11-23 GB GBGB2018375.2A patent/GB202018375D0/en not_active Ceased
-
2021
- 2021-08-09 EP EP21190382.8A patent/EP4002361A1/en active Pending
- 2021-11-22 WO PCT/EP2021/082520 patent/WO2022106691A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2151822A1 (en) * | 2008-08-05 | 2010-02-10 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus and method for processing and audio signal for speech enhancement using a feature extraction |
US20190080710A1 (en) * | 2017-09-12 | 2019-03-14 | Board Of Trustees Of Michigan State University | System and apparatus for real-time speech enhancement in noisy environments |
Non-Patent Citations (4)
Title |
---|
BORGSTROM BENGT J ET AL: "Improving Statistical Model-Based Speech Enhancement with Deep Neural Networks", 2018 16TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), IEEE, 17 September 2018 (2018-09-17), pages 471 - 475, XP033439100, DOI: 10.1109/IWAENC.2018.8521382 * |
JAMAL NOREZMI ET AL: "A Hybrid Approach for Single Channel Speech Enhancement using Deep Neural Network and Harmonic Regeneration Noise Reduction", vol. 11, 1 January 2020 (2020-01-01), XP055880063, Retrieved from the Internet <URL:https://thesai.org/Downloads/Volume11No10/Paper_33-A_Hybrid_Approach_for_Single_Channel_Speech_Enhancement.pdf> * |
SALEEM N ET AL: "Spectral Phase Estimation Based on Deep Neural Networks for Single Channel Speech Enhancement", JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS, NAUKA/INTERPERIODICA PUBLISHING, MOSCOW, RU, vol. 64, no. 12, 1 December 2019 (2019-12-01), pages 1372 - 1382, XP037029872, ISSN: 1064-2269, [retrieved on 20200221], DOI: 10.1134/S1064226919120155 * |
VALIN JEAN-MARC: "A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement", 31 May 2018 (2018-05-31), pages 1 - 5, XP055783657, ISBN: 978-1-5386-6070-6, Retrieved from the Internet <URL:https://arxiv.org/pdf/1709.08243.pdf> DOI: 10.1109/MMSP.2018.8547084 * |
Also Published As
Publication number | Publication date |
---|---|
GB202018375D0 (en) | 2021-01-06 |
EP4002361A1 (en) | 2022-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104303227B (en) | The apparatus and method for eliminating and perceiving noise by combining Active noise and compensate the perceived quality for improving sound reproduction | |
US9558755B1 (en) | Noise suppression assisted automatic speech recognition | |
US20060206320A1 (en) | Apparatus and method for noise reduction and speech enhancement with microphones and loudspeakers | |
US20170061978A1 (en) | Real-time method for implementing deep neural network based speech separation | |
EP3079378B1 (en) | Neural network-driven frequency translation | |
US10531178B2 (en) | Annoyance noise suppression | |
US20170230765A1 (en) | Monaural speech intelligibility predictor unit, a hearing aid and a binaural hearing system | |
US9886967B2 (en) | Systems and methods for speech extraction | |
CN106664473A (en) | Information-processing device, information processing method, and program | |
US11218796B2 (en) | Annoyance noise suppression | |
CN106463106A (en) | Wind noise reduction for audio reception | |
EP3757993B1 (en) | Pre-processing for automatic speech recognition | |
JP2020115206A (en) | System and method | |
US10204637B2 (en) | Noise reduction methodology for wearable devices employing multitude of sensors | |
CN110992967A (en) | Voice signal processing method and device, hearing aid and storage medium | |
Garg et al. | A comparative study of noise reduction techniques for automatic speech recognition systems | |
CN113241085A (en) | Echo cancellation method, device, equipment and readable storage medium | |
WO2017045512A1 (en) | Voice recognition method and apparatus, terminal, and voice recognition device | |
GB2526980A (en) | Sensor input recognition | |
EP4002361A1 (en) | Audio signal processing systems and methods | |
CN116312545B (en) | Speech recognition system and method in a multi-noise environment | |
CN113132885B (en) | Method for judging wearing state of earphone based on energy difference of double microphones | |
CN115293205A (en) | Anomaly detection method, self-encoder model training method and electronic equipment | |
US20210287674A1 (en) | Voice recognition for imposter rejection in wearable devices | |
CN111933140B (en) | Method, device and storage medium for detecting voice of earphone wearer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21810381 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21810381 Country of ref document: EP Kind code of ref document: A1 |