WO2022107393A1 - Approche basée sur réseau neuronal pour déclaration de débruitage vocal concernant la recherche subventionnée au niveau fédéral - Google Patents

Approche basée sur réseau neuronal pour déclaration de débruitage vocal concernant la recherche subventionnée au niveau fédéral Download PDF

Info

Publication number
WO2022107393A1
WO2022107393A1 PCT/JP2021/027243 JP2021027243W WO2022107393A1 WO 2022107393 A1 WO2022107393 A1 WO 2022107393A1 JP 2021027243 W JP2021027243 W JP 2021027243W WO 2022107393 A1 WO2022107393 A1 WO 2022107393A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
signal representation
noise
speech
ref
Prior art date
Application number
PCT/JP2021/027243
Other languages
English (en)
Inventor
Changxi Zheng
Ruilin Xu
Rundi WU
Carl Vondrick
Yuko ISHIWAKA
Original Assignee
The Trustees Of Columbia University In The City Of New York
Softbank Corp.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Trustees Of Columbia University In The City Of New York, Softbank Corp. filed Critical The Trustees Of Columbia University In The City Of New York
Priority to JP2023530195A priority Critical patent/JP2023552090A/ja
Publication of WO2022107393A1 publication Critical patent/WO2022107393A1/fr
Priority to US18/320,206 priority patent/US11894012B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02168Noise filtering characterised by the method used for estimating noise the estimation exclusively taking place during speech pauses
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • Audio recordings of human speech are often contaminated with noise from various sources. Some noise in recordings may be stationary, while other noise may fluctuate in frequency and amplitude throughout the recording. This latter noise, called nonstationary noise, is difficult to remove from audio recordings.
  • FIG. 1 A network structure.
  • FIG. 2 Silent intervals over time.
  • FIG. 3 Example of intermediate and final results
  • FIG. 4 Noise gallery
  • FIG. 5 Quantitative comparisons
  • FIG. 6 Denoise quality w.r.t input SNRs
  • FIG. 7 Constructed noisy audio based on different SNR levels
  • FIG. 8 Denoise quality under different input SNRs
  • FIG. 9 An example of Silent Interval Detection
  • the implementations described herein are based on a deep neural network for speech denoising approach that tightly integrates silent intervals, and thereby overcomes many of the limitations of classical approaches.
  • the goal is not just to identify a single silent interval, but to find as many as possible silent intervals over time.
  • silent intervals in speech appear in abundance: psycholinguistic studies have shown that there is almost always a pause after each sentence and even after each word in speech. Each pause, however short, provides a silent interval revealing noise characteristics local in time.
  • these silent intervals assemble a time-varying picture of background noise, allowing a neural network to better denoise speech signals, even in presence of nonstationary noise.
  • the technology described herein uses a neural network architecture based on long short-term memory (LSTM) structures to reliably denoise vocal recordings (other learning machine architectures / structures may also be used).
  • LSTM long short-term memory
  • the LSTM is trained on noise obtained from intermittent gaps in speech called silent intervals, which it automatically identifies in the recording.
  • the silent intervals contain a combination of stationary and nonstationary noise, and thus the spectral distributions of noise during these silent intervals can be used in denoising.
  • LSTM is capable of removing the stationary and nonstationary spectra in the vocal intervals to provide a robustly denoised, high quality speech recording. This technology is also applicable to audio recording, filmmaking, and speech-to-text applications.
  • a network structure that includes three major components (illustrated in FIG. 1): i) a component dedicated to silent interval detection, ii) another component to estimate the full noise from those revealed in silent intervals, akin to an inpainting process in computer vision, and iii) another component to clean up the input signal.
  • the silent interval detection component is configured to detect silent intervals in the input signal.
  • the input to this component is the spectrogram of the input (noisy) signal x.
  • the spectrogram S x is first encoded by a 2D convolutional encoder into a 2D feature map, which, in turn, is processed by a bidirectional LSTM followed by two fully-connected (FC) layers.
  • the bidirectional LSTM is suitable for processing time-series features resulting from the spectrogram, and the FC layers are applied to the features of each time sample to accommodate variable length input.
  • the output from this network component is a vector D(S x ).
  • Each element of D(S x ) is a scalar in [0,1] (after applying the sigmoid function), indicating a confidence score of a small-time segment being silent. In some examples, each time segment has a duration of 1/30 second, which is small enough to capture short speech pauses and large enough to allow robust prediction.
  • the output vector D(S x ) is then expanded to a longer mask, denoted m(x). Each element of this mask indicates the confidence of classifying each sample of the input signal x as pure noise. With this mask, exposed by silent intervals are estimated by an element-wise product, namely
  • noise estimation component / module resulting from silent interval detection is noise profile exposed only through a series of time windows, but not a complete picture of the noise.
  • the input signal is a superposition of clean speech signal and noise, having a complete noise profile would ease the denoising process, especially in presence of nonstationary noise. Therefore, the entire noise profile over time is estimated, which is achieved, in some implementations, using a neural network.
  • Inputs to this component include both the noisy audio signal representation x and Both are converted by STFT into spectrograms, denoted as respectively.
  • the spectrograms can be thought of as 2D images.
  • the goal here is conceptually akin to the image inpainting task in computer vision.
  • the feature maps are then concatenated in a channel-wise manner and further decoded by a convolutional decoder to estimate the full noise spectrogram, denoted
  • a neural network R receives as input both the input audio spectrogram S x and the estimated full noise spectrogram
  • the two input spectrograms are processed individually by their own 2D convolutional encoders.
  • the two encoded feature maps are then concatenated together before passing to a bidirectional LSTM, followed by three fully connected layers.
  • the output of this component is a vector with two channels which form the real and imaginary parts of a complex ratio mask in frequency-time domain.
  • the mask c has the same (temporal and frequency) dimensions as S x .
  • the denoised spectrogram is computed through element-wise multiplication of the input audio spectrogram S x and the mask
  • the cleaned-up audio signal representation is obtained by applying the inverse STFT (ISTFT) to
  • the network can be trained in an end-to-end fashion with a stochastic gradient descent approach.
  • the following loss function is optimized: where the notations are as defined above, denote the spectrograms of the ground-truth foreground signal and background noise, respectively.
  • the first term penalizes the discrepancy between estimated noise and the ground-truth noise, while the second term accounts for the estimation of foreground signal.
  • the end-to-end training process has no supervision on silent interval detection: the loss function only accounts for the recoveries of noise and clean speech signal.
  • the ability of detecting silent intervals automatically emerges as the output of the first network component. In other words, the network automatically learns to detect silent intervals for speech denoising without this supervision.
  • the model is learning to detect silent intervals on its own, silent detection can be directly supervised to further improve the denoising quality.
  • a term can be added to the above loss function that penalizes the discrepancy between detected silent intervals and their ground truth. Experiments showed that this way is not effective, so instead the model is trained in two sequential steps.
  • the silent interval detection component is computed through the following loss function: where l BCE is the binary cross-entropy loss, m(x) is the mask resulted from silent interval detection component, and is the ground-truth label of each signal sample being silent or not.
  • the noise estimation and removal components are trained through the loss function L 0 .
  • This training step starts by neglecting the silent detection component.
  • the loss function L 0 instead of using , the noise spectrogram exposed by the estimated silent intervals, the spectrogram of the noise exposed by the ground-truth silent intervals is used.
  • the network components are fine-tuned by incorporating the already trained silent interval detection component. With the silent interval detection component fixed, this fine-tuning step optimizes the original loss function L 0 and thereby updates the weights of the noise estimation and removal components.
  • a system includes a receiver unit (e.g., a microphone, a communication module to receive electronic signal representations of audio / sound, etc.) to receive an audio signal representation, and a controller (e.g., a programmable device), implementing one or more learning engines, in communication with the receiver unit and a memory device to store programmable instructions, to detect in the received audio signal representation, using a first learning model, one or more silent intervals with reduced foreground sound levels, determine based on the detected one or more silent intervals an estimated full noise profile corresponding to the audio signal representation, and generate with a second learning model, based on the received audio signal representation and on the determined estimated full noise profile, a resultant audio signal representation with a reduced noise level.
  • a receiver unit e.g., a microphone, a communication module to receive electronic signal representations of audio / sound, etc.
  • a controller e.g., a programmable device
  • implementing one or more learning engines in communication with the receiver unit and a memory device to
  • a non-transitory computer readable media that stores a set of instructions, executable on at least one programmable device, to receive an audio signal representation, detect in the received audio signal representation, using a first learning model, one or more silent intervals with reduced foreground sound levels, determine based on the detected one or more silent intervals an estimated full noise profile corresponding to the audio signal representation, and generate with a second learning model, based on the received audio signal representation and on the determined estimated full noise profile, a resultant audio signal representation with a reduced noise level.
  • a method includes receiving an audio signal representation, detecting in the received audio signal representation, using a first learning model, one or more silent intervals with reduced foreground sound levels, determining based on the detected one or more silent intervals an estimated full noise profile corresponding to the audio signal representation, and generating with a second learning model, based on the received audio signal representation and on the determined estimated full noise profile, a resultant audio signal representation with a reduced noise level.
  • detecting using the first learning model the one or more silent intervals may include segmenting the audio signal representation into multiple segments, each segment being shorter than an interval length of the received audio signal representation, transforming the multiple segments into a time-frequency representation, and processing the time-frequency representation of the multiple segments using a first learning machine, implementing the first learning model, to produce a noise vector that includes, for each of the multiple segments, a confidence value representative of a likelihood that the respective one of the multiple segments is a silent interval.
  • processing the time-frequency representation may include encoding the time-frequency representation of the multiple segment with a 2D convolutional encoder to a generate a 2D feature map, applying a learning network structure, comprising at least a bidirectional long short-term memory (LSTM) structure, to the 2D feature map to produce the silence vector, determining a noise mask from the silence vector, and generating based on the audio signal representation and the noise mask a partial noise profile for the audio signal representation.
  • LSTM long short-term memory
  • determining the estimated full noise profile may include generating a partial noise profile representative of time-frequency characteristics of the detected one or more silent intervals, transforming the audio signal representation and the partial noise profile into respective time-frequency representations, applying convolutional encoding to the time-frequency representations of the audio signal representation and the partial noise profile to produce an encoded audio signal representation and encoded partial noise profile, and combining the encoded audio signal representation and the encoded partial noise profile to produce the estimated full noise profile.
  • generating the resultant audio signal representation with the reduced noise level may include generating time-frequency representations for the audio signal representation and the estimated full noise profile, and applying the second learning model to the time-frequency representations for the audio signal representation and the estimated full noise profile to generate the resultant audio signal representation.
  • the second learning model may be implemented with a bidirectional long short- term memory (LSTM) structure.
  • implementation of the denoising processing described herein may be realized using one or more learning machines (such as neural networks).
  • Neural networks are in general composed of multiple layers of linear transformations (multiplications by a "weight" matrix), each followed by a nonlinear function (e.g., a rectified linear activation function, or ReLU, etc.)
  • the linear transformations are learned during training by making small changes to the weight matrices that progressively make the transformations more helpful to the final classification task (or some other type of desired output).
  • the layered network may include convolutional processes which are followed by pooling processes along with intermediate connections between the layers to enhance the sharing of information between the layers.
  • learning engine approaches / architectures include generating an auto-encoder and using a dense layer of the network to correlate with probability for a future event through a support vector machine, or constructing a regression or classification neural network model that predicts a specific output from input data (based on training reflective of correlation between similar input and the output that is to predicted).
  • neural networks examples include convolutional neural network (CNN), feed-forward neural networks, recurrent neural networks (RNN, e.g., implemented, for example, using long short-term memory (LSTM) structures), etc.
  • Feed-forward networks include one or more layers of learning nodes / elements with connections to one or more portions of the input data.
  • the connectivity of the inputs and layers of learning elements is such that input data and intermediate data propagate in a forward direction towards the network's output. There are typically no feedback loops or cycles in the configuration / structure of the feed-forward network.
  • Convolutional layers allow a network to efficiently learn features by applying the same learned transformation to subsections of the data.
  • the various learning processes implemented through use of the learning machines may be realized using keras (an open-source neural network library) building blocks and/or NumPy (an open-source programming library useful for realizing modules to process arrays) building blocks.
  • the various learning engine implementations may include a trained learning engine (e.g., a neural network) and a corresponding coupled learning engine controller / adapter configured to determine and/or adapt the parameters (e.g., neural network weights) of the learning engine that would produce desired output.
  • training data includes sets of input records along with corresponding data defining the ground truth for the input training records.
  • the adapter Upon completion of a training cycles by the adapter / controller coupled to a particular learning engine, the adapter provides data representative of updates / changes (e.g., in the form of parameter values / weights to be assigned to links of a neural-network-based learning engine) to the particular learning engine to cause the learning engine to be updated in accordance with the training cycle(s) completed.
  • updates / changes e.g., in the form of parameter values / weights to be assigned to links of a neural-network-based learning engine
  • a controller device e.g., a processor-based computing device
  • a verbal communication device such as a hearing aid device
  • a controller device may include a processor-based device such as a computing device, and so forth, that typically includes a central processor unit or a processing core.
  • the device may also include one or more dedicated learning machines (e.g., neural networks) that may be part of the CPU or processing core.
  • the system includes main memory, cache memory and bus interface circuits.
  • the controller device may include a mass storage element, such as a hard drive (solid state hard drive, or other types of hard drive), or flash drive associated with the computer system.
  • the controller device may further include a keyboard, or keypad, or some other user input interface, and a monitor, e.g., an LCD (liquid crystal display) monitor, that may be placed where a user can access them.
  • a monitor e.g., an LCD (liquid crystal display) monitor
  • the controller device is configured to facilitate, for example, the implementation of de- noising processing.
  • the storage device may thus include a computer program product that when executed on the controller device (which, as noted, may be a programmable or processor-based device) causes the processor-based device to perform operations to facilitate the implementation of procedures and operations described herein.
  • the controller device may further include peripheral devices to enable input/output functionality.
  • peripheral devices may include, for example, flash drive (e.g., a removable flash drive), or a network connection (e.g., implemented using a USB port and/or a wireless transceiver), for downloading related content to the connected system.
  • Such peripheral devices may also be used for downloading software containing computer instructions to enable general operation of the respective system/device.
  • special purpose logic circuitry e.g., an FPGA (field programmable gate array), an ASIC (application-specific integrated circuit), a DSP processor, a graphics processing unit (GPU), accelerated processing unit (APU), application processing unit, etc.
  • FPGA field programmable gate array
  • ASIC application-specific integrated circuit
  • DSP digital signal processor
  • GPU graphics processing unit
  • APU accelerated processing unit
  • application processing unit etc.
  • Other modules that may be included with the controller device may include a user interface to provide or receive input and output data.
  • sensor devices such as a microphone, a light-capture device (e.g., a CMOS-based or CCD-based camera device), other types of optical or electromagnetic sensors, sensors for measuring environmental conditions, etc., may be coupled to the controller device, and may be configured to observe or measure the signals or data to be processed.
  • the controller device may include an operating system.
  • Computer programs include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language.
  • machine-readable medium refers to any non-transitory computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a non-transitory machine-readable medium that receives machine instructions as a machine-readable signal.
  • PLDs Programmable Logic Devices
  • any suitable computer readable media can be used for storing instructions for performing the processes / operations / procedures described herein.
  • computer readable media can be transitory or non-transitory.
  • non-transitory computer readable media can include media such as magnetic media (such as hard disks, floppy disks, etc.), optical media (such as compact discs, digital video discs, Blu-ray discs, etc.), semiconductor media (such as flash memory, electrically programmable read only memory (EPROM), electrically erasable programmable read only Memory (EEPROM), etc.), any suitable media that is not fleeting or not devoid of any semblance of permanence during transmission, and/or any suitable tangible media.
  • transitory computer readable media can include signals on networks, in wires, conductors, optical fibers, circuits, any suitable media that is fleeting and devoid of any semblance of permanence during transmission, and/or any suitable intangible media.
  • Noise is everywhere. When we listen to someone speak, the audio signals we receive are never pure and clean, always contaminated by all kinds of noises-cars passing by, spinning fans in an air conditioner, barking dogs, music from a loudspeaker, and so forth. To a large extent, people in a conversation can effortlessly filter out these noises (Ref. 40). In the same vein, numerous applications, ranging from cellular communications to human-robot interaction, rely on speech denoising algorithms as a fundamental building block.
  • speech denoising aims to separate the foreground (speech) signal from its additive background noise. This separation problem is inherently ill-posed.
  • Classic approaches such as spectral subtraction (Ref. 7, 91, 6, 66, 73) and Wiener filtering (Ref. 74, 38) conduct audio denoising in the spectral domain, and they are typically restricted to stationary or quasi-stationary noise.
  • Wiener filtering Ref. 74, 38
  • FIG. 2 Silent intervals over time.
  • a speech signal has many natural pauses. Without any noise, these pauses are exhibited as silent intervals (highlighted in red).
  • bottom However, most speech signals are contaminated by noise. Even with mild noise, silent intervals become overwhelmed and hard to detect. If robustly detected, silent intervals can help to reveal the noise profile over time.
  • the spectral subtraction method suffers from two major shortcomings: i) it requires user specification of a silent interval, that is, not fully automatic; and ii) the single silent interval, although undemanding for the user, is insufficient in presence of nonstationary noise-for example, a back-ground music. Ubiquitous in daily life, nonstationary noise has time-varying spectral features. The single silent interval reveals the noise spectral features only in that particular time span, thus inade-quate for denoising the entire input signal. The success of spectral subtraction pivots on the concept of silent interval; so do its shortcomings.
  • a network structure consisting of three major components (see Fig. 1): i) one dedicated to silent interval detection, ii) another that aims to estimate the full noise from those revealed in silent intervals, akin to an inpainting process in computer vision (Ref. 36), and iii) yet another for cleaning up the input signal.
  • Our neural-network-based denoising model accepts a single channel of audio signal and outputs the cleaned-up signal. Unlike some of the recent denoising methods that take as input audiovisual signals (i.e., both audio and video footage), our method can be applied in a wider range of scenarios (e.g., in cellular communication).
  • Speech denoising (Ref. 48) is a fundamental problem studied over several decades. Spectral subtraction (Ref. 7, 91, 6, 66, 73) estimates the clean signal spectrum by subtracting an estimate of the noise spectrum from the noisy speech spectrum. This classic method was followed by spectrogram factorization methods (Ref. 78). Wiener filtering (Ref. 74, 38) derives the enhanced signal by optimizing the mean-square error. Other methods exploit pauses in speech, forming segments of low acoustic energy where noise statistics can be more accurately measured (Ref. 13, 52, 79, 15, 69, 10, 11). Statistical model-based methods (Ref. 14, 32) and subspace algorithms (Ref. 12, 16) are also studied.
  • Audio signal processing methods operate on either the raw waveform or the spectrogram by Short-time Fourier Transform (STFT). Some work directly on waveform (Ref. 22, 62, 54, 50), and others use Wavenet (Ref. 84) for speech denoising (Ref. 68, 70, 28). Many other methods such as (Ref. 49, 87, 56, 92, 41, 100, 9) work on audio signal's spectrogram, which contains both magnitude and phase information. There are works discussing how to use the spectrogram to its best potential (Ref. 86, 61), while one of the disadvantages is that the inverse STFT needs to be applied. Meanwhile, there also exist works (Ref. 46, 27, 26, 88, 19, 94, 55) investigating how to overcome artifacts from time aliasing.
  • STFT Short-time Fourier Transform
  • Speech denoising has also been studied in conjunction with computer vision due to the relations between speech and facial features (Ref. 8).
  • Methods such as (Ref. 29, 24, 3, 34, 30) utilize different network structures to enhance the audio signal to the best of their ability.
  • Adeel et al. (Ref. 1) even utilize lip-reading to filter out the background noise of a speech.
  • Deep learning for other audio processing tasks Deep learning is widely used for lip reading, speech recognition, speech separation, and many audio processing or audio-related tasks, with the help of computer vision (Ref. 58, 60, 5, 4). Methods such as (Ref. 45, 17, 59) are able to reconstruct speech from pure facial features. Methods such as (Ref. 2, 57) take advantage of facial features to improve speech recognition accuracy. Speech separation is one of the areas where computer vision is best leveraged. Methods such as (Ref. 23, 58, 18, 102) have achieved impressive results, making the previously impossible speech separation from a single audio signal possible. Recently, Zhang et al. (Ref. 101) proposed a new operation called Harmonic Convolution to help networks distill audio priors, which is shown to even further improve the quality of speech separation.
  • the first component is dedicated to detecting silent intervals in the input signal.
  • the input to this component is the spectrogram of the input (noisy) signal x.
  • the spectrogram S x is first encoded by a 2D convolutional encoder into a 2D feature map, which is in turn processed by a bidirectional LSTM (Ref. 33, 75) followed by two fully-connected (FC) layers (see network details in the following A).
  • the bidirectional LSTM is suitable for processing time-series features resulting from the spectrogram (Ref. 53, 39, 67, 18), and the FC layers are applied to the features of each time sample to accommodate variable length input.
  • the output from this network component is a vector D(S x ).
  • Each element of D(S x ) is a scalar in [0,1] (after applying the sigmoid function), indicating a confidence score of a small time segment being silent.
  • Figure 3 Example of intermediate and final results.
  • the black regions in (b) indicate ground-truth silent intervals.
  • the noise exposed by detected silent intervals i.e., the output of the silent interval detection component when the network is trained with silent interval supervision (recall Sec. 3.3).
  • the output vector D(S x ) is then expanded to a longer mask, which we denote as m(x).
  • m(x) Each element of this mask indicates the confidence of classifying each sample of the input signal x as pure noise (see Fig. 3-e).
  • This mask exposed by silent intervals are estimated by an element-wise product, namely
  • Noise estimation resulted from silent interval detection is noise profile exposed only through a series of time windows (see Fig. 3-e)-but not a complete picture of the noise.
  • the input signal is a superposition of clean speech signal and noise, having a complete noise profile would ease the denoising process, especially in presence of nonstationary noise. Therefore, we also estimate the entire noise profile over time, which we do with a neural network.
  • Inputs to this component include both the noisy audio signal at and Both are converted by STFT into spectrograms, denoted as respectively.
  • the cleaned-up audio signal is obtained by applying the inverse STFT to
  • Noise gallery We show four examples of noise from the noise datasets.
  • Noise 1) is a stationary (white) noise, and the other three are not.
  • Noise 2) is a monologue in a meeting.
  • Noise 3) is party noise from people speaking and laughing with background noise.
  • Noise 4) is street noise from people shouting and screaming with additional traffic noise such as vehicles driving and honking.
  • a signal-to-noise ratio SNR
  • a -10dB SNR means that the power of noise is ten times the speech (see Fig. 7).
  • the SNR range in our evaluations i.e., [-10dB, 10dB] is significantly larger than those tested in previous works.
  • VSE Voice over IP
  • a learning-based method that takes both video and audio as input, and leverages both audio signal and mouth motions (from video footage) for speech denoising.
  • Figure 5 Quantitative comparisons. We measure denoising quality under six metrics (correspond-ing to columns). The comparisons are conducted using noise from DEMAND and AudioSet separately. Ours-GTSI (in black) uses ground-truth silent intervals. Although not a practical approach, it serves as an upper-bound reference of all methods.
  • Figure 6 Denoise quality W.r.t. input SNRs. Denoise results measured in PESQ for each method w.r.t different input SNRs. Results measured in other metrics are shown in Fig. 8.
  • Table 1 shows that, under all metrics, our method is consistently better than the alternatives. Between VAD and Baseline-thres, VAD has higher precision and lower recall, meaning that VAD is overly conservative and Baseline-thres is overly aggressive when detecting silent intervals (see Fig. 9). Our method reaches better balance and thus detects silent intervals more accurately.
  • Table 1 Results of silent interval detection. The metrics are measured using our test signals that have SNRs from -10dB to 10dB. Definitions of these metrics are summarized in the following C.1.
  • Speech denoising has been a long-standing challenge.
  • We present a new network structure that leverages the abundance of silent intervals in speech.
  • our network is able to denoise speech signals plausibly, and meanwhile, the ability to detect silent intervals automatically emerges. We reinforce this ability.
  • Our explicit supervision on silent intervals enables the network to detect them more accurately, thereby further improving the performance of speech denoising.
  • our method consistently outperforms several state-of-the-art audio denoising models.
  • the silent interval detection component of our model is composed of 2D convolutional layers, a bidirectional LSTM, and two FC layers.
  • the parameters of the convolutional layers are shown in Table 3. Each convolutional layer is followed by a batch normalization layer with a ReLU activation function.
  • the hidden size of bidirectional LSTM is 100.
  • the two FC layers, interleaved with a ReLU activation function, have hidden size of 100 and 1, respectively.
  • the noise estimation component of our model is fully convolutional, consisting of two encoders and one decoder.
  • the two encoders process the noisy signal and the incomplete noise profile, respectively; they have the same architecture (shown in Table 4) but different weights.
  • the two feature maps resulted from the two encoders are concatenated in a channel-wise manner before feeding into the decoder.
  • every layer, except the last one is followed by a batch normalization layer together with a ReLU activation function.
  • Table 4 Architecture of noise estimation component. 'C' indicates a convolutional layer, and 'TC' indicates a transposed convolutional layer.
  • the noise removal component of our model is composed of two 2D convolutional encoders, a bidirectional LSTM, and three FC layers.
  • the two convolutional encoders take as input the input audio spectrogram S x and the estimated full noise spectrogram respectively.
  • the first encoder has the network architecture listed in Table 5, and the second has the same architecture but with half of the number of filters at each convolutional layer.
  • the bidirectional LSTM has the hidden size of 200
  • the three FC layers have the hidden size of 600, 600, and 2F, respectively, where F is the number of frequency bins in the spectrogram.
  • ReLU is used after each layer except the last layer, which uses sigmoid.
  • Table 5 Convolutional encoder for the noise removal component of our model. Each convolutional layer is followed by a batch normalization layer with ReLU as the activation function.
  • Figure 7 Constructed noisy audio based on different SNR levels. The first row shows the waveform of the ground truth clean input.
  • each 2-second clip yields a (complex-valued) spectrogram with a resolution 256 ⁇ 178, where 256 is the number of frequency bins, and 178 is the temporal resolution.
  • our model can still accept audio clips with arbitrary length.
  • the clean audio signals To supervise our silent interval detection, we label the clean audio signals in the following way. We first normalize each audio clip so that its magnitude is in the range [-1,1], that is, ensuring the largest waveform magnitude at -1 or 1. Then, the clean audio clip is divided into segments of length 1/30 seconds. We label a time segment as a "silent" segment (i.e., label 0) if its average waveform energy in that segment is below 0.08. Otherwise, it is labeled as a "non-silent" segment (i.e., label 1).
  • Detecting silent intervals is a binary classification task, one that classifies every time segment as being silent (i.e., a positive condition) or not (i.e., a negative condition). Recall that the confusion matrix in a binary classification task is as follows:
  • a true positive (TP) sample is a correctly predicted silent segment.
  • a true negative (TN) sample is a correctly predicted non-silent segment.
  • a false positive (FP) sample is a non-silent segment predicted as silent.
  • a false negative (FN) sample is a silent segment predicted as non-silent.
  • N TP , N TN , N FP , and N FN indicate the numbers of true positive, true negative, false positive, and false negative predictions among all tests.
  • recall indicates the ability of correctly finding all true silent intervals
  • precision measures how much proportion of the labeled silent intervals are truly silent.
  • F1 score takes both precision and recall into account, and produces their harmonic mean.
  • accuracy is the ratio of correct predictions among all predictions.
  • Figure 9 An example of silent interval detection results. Provided an input signal whose SNR is 0dB (top-left), we show the silent intervals (in red) detected by three approaches: our method, Baseline-thres, and VAD. We also show ground-truth silent intervals in top-left.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

L'invention concerne des procédés, des systèmes, des dispositif, ainsi que d'autres modes de réalisation, notamment un procédé consistant : à recevoir une représentation de signal audio ; à détecter dans la représentation de signal audio reçue, au moyen d'un premier modèle d'apprentissage, au moins un intervalle silencieux à niveaux sonores d'avant-plan réduits ; à déterminer, en fonction dudit intervalle silencieux détecté au moins, un profil de bruit total estimé correspondant à la représentation de signal audio ; et à générer avec un deuxième modèle d'apprentissage, en fonction de la représentation de signal audio reçue et du profil de bruit complet estimé déterminé, une représentation de signal audio résultante à niveau de bruit réduit.
PCT/JP2021/027243 2020-11-20 2021-07-20 Approche basée sur réseau neuronal pour déclaration de débruitage vocal concernant la recherche subventionnée au niveau fédéral WO2022107393A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023530195A JP2023552090A (ja) 2020-11-20 2021-07-20 連邦政府が後援する研究に関する音声ノイズ除去の声明のためのニューラルネットワークベースの手法
US18/320,206 US11894012B2 (en) 2020-11-20 2023-05-19 Neural-network-based approach for speech denoising

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063116400P 2020-11-20 2020-11-20
US63/116,400 2020-11-20

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/320,206 Continuation US11894012B2 (en) 2020-11-20 2023-05-19 Neural-network-based approach for speech denoising

Publications (1)

Publication Number Publication Date
WO2022107393A1 true WO2022107393A1 (fr) 2022-05-27

Family

ID=81708760

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/027243 WO2022107393A1 (fr) 2020-11-20 2021-07-20 Approche basée sur réseau neuronal pour déclaration de débruitage vocal concernant la recherche subventionnée au niveau fédéral

Country Status (3)

Country Link
US (1) US11894012B2 (fr)
JP (1) JP2023552090A (fr)
WO (1) WO2022107393A1 (fr)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230232171A1 (en) * 2022-01-14 2023-07-20 Chromatic Inc. Method, Apparatus and System for Neural Network Hearing Aid
US11818523B2 (en) 2022-01-14 2023-11-14 Chromatic Inc. System and method for enhancing speech of target speaker from audio signal in an ear-worn device using voice signatures
US11832061B2 (en) 2022-01-14 2023-11-28 Chromatic Inc. Method, apparatus and system for neural network hearing aid
US11849286B1 (en) 2021-10-25 2023-12-19 Chromatic Inc. Ear-worn device configured for over-the-counter and prescription use
US11877125B2 (en) 2022-01-14 2024-01-16 Chromatic Inc. Method, apparatus and system for neural network enabled hearing aid
US11894012B2 (en) 2020-11-20 2024-02-06 The Trustees Of Columbia University In The City Of New York Neural-network-based approach for speech denoising
WO2024029771A1 (fr) * 2022-08-05 2024-02-08 Samsung Electronics Co., Ltd. Procédé, appareil et support lisible par ordinateur pour générer un signal vocal filtré à l'aide de réseaux de débruitage de la parole sur la base de la modélisation de la parole et du bruit
US11902747B1 (en) 2022-08-09 2024-02-13 Chromatic Inc. Hearing loss amplification that amplifies speech and noise subsignals differently

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02253298A (ja) * 1989-03-28 1990-10-12 Sharp Corp 音声通過フィルタ
JPH06282297A (ja) * 1993-03-26 1994-10-07 Idou Tsushin Syst Kaihatsu Kk 音声符号化方式

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1494208A1 (fr) * 2003-06-30 2005-01-05 Harman Becker Automotive Systems GmbH Méthode pour controler un système de dialogue vocal et système de dialogue vocal
US7725314B2 (en) * 2004-02-16 2010-05-25 Microsoft Corporation Method and apparatus for constructing a speech filter using estimates of clean speech and noise
US7464029B2 (en) * 2005-07-22 2008-12-09 Qualcomm Incorporated Robust separation of speech signals in a noisy environment
US10134425B1 (en) * 2015-06-29 2018-11-20 Amazon Technologies, Inc. Direction-based speech endpointing
US9892731B2 (en) * 2015-09-28 2018-02-13 Trausti Thor Kristjansson Methods for speech enhancement and speech recognition using neural networks
CN108346428B (zh) * 2017-09-13 2020-10-02 腾讯科技(深圳)有限公司 语音活动检测及其模型建立方法、装置、设备及存储介质
US11527259B2 (en) * 2018-02-20 2022-12-13 Mitsubishi Electric Corporation Learning device, voice activity detector, and method for detecting voice activity
US10923139B2 (en) * 2018-05-02 2021-02-16 Melo Inc. Systems and methods for processing meeting information obtained from multiple sources
US20210174791A1 (en) * 2018-05-02 2021-06-10 Melo Inc. Systems and methods for processing meeting information obtained from multiple sources
US10714122B2 (en) * 2018-06-06 2020-07-14 Intel Corporation Speech classification of audio for wake on voice
WO2019246314A1 (fr) * 2018-06-20 2019-12-26 Knowles Electronics, Llc Interface utilisateur vocale sensible à l'acoustique
US10210860B1 (en) * 2018-07-27 2019-02-19 Deepgram, Inc. Augmented generalized deep learning with special vocabulary
WO2020041497A1 (fr) * 2018-08-21 2020-02-27 2Hz, Inc. Systèmes et procédés d'amélioration de la qualité vocale et de suppression de bruit
US20200074997A1 (en) * 2018-08-31 2020-03-05 CloudMinds Technology, Inc. Method and system for detecting voice activity in noisy conditions
US10937443B2 (en) * 2018-09-04 2021-03-02 Babblelabs Llc Data driven radio enhancement
US11011182B2 (en) * 2019-03-25 2021-05-18 Nxp B.V. Audio processing system for speech enhancement
US11127394B2 (en) * 2019-03-29 2021-09-21 Intel Corporation Method and system of high accuracy keyphrase detection for low resource devices
US20210020191A1 (en) * 2019-07-18 2021-01-21 DeepConvo Inc. Methods and systems for voice profiling as a service
KR20210078133A (ko) * 2019-12-18 2021-06-28 엘지전자 주식회사 간투어 검출 모델을 훈련시키기 위한 훈련 데이터 생성 방법 및 장치
KR20220120584A (ko) * 2019-12-30 2022-08-30 애리스 엔터프라이지즈 엘엘씨 주변 잡음 보상을 이용한 자동 볼륨 제어 장치 및 방법
US11741943B2 (en) * 2020-04-27 2023-08-29 SoundHound, Inc Method and system for acoustic model conditioning on non-phoneme information features
US11678120B2 (en) * 2020-05-14 2023-06-13 Nvidia Corporation Audio noise determination using one or more neural networks
US20220092389A1 (en) * 2020-09-21 2022-03-24 Aondevices, Inc. Low power multi-stage selectable neural network suppression
WO2022107393A1 (fr) 2020-11-20 2022-05-27 The Trustees Of Columbia University In The City Of New York Approche basée sur réseau neuronal pour déclaration de débruitage vocal concernant la recherche subventionnée au niveau fédéral

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02253298A (ja) * 1989-03-28 1990-10-12 Sharp Corp 音声通過フィルタ
JPH06282297A (ja) * 1993-03-26 1994-10-07 Idou Tsushin Syst Kaihatsu Kk 音声符号化方式

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11894012B2 (en) 2020-11-20 2024-02-06 The Trustees Of Columbia University In The City Of New York Neural-network-based approach for speech denoising
US11849286B1 (en) 2021-10-25 2023-12-19 Chromatic Inc. Ear-worn device configured for over-the-counter and prescription use
US20230232171A1 (en) * 2022-01-14 2023-07-20 Chromatic Inc. Method, Apparatus and System for Neural Network Hearing Aid
US20230254651A1 (en) * 2022-01-14 2023-08-10 Chromatic Inc. Method, apparatus and system for neural network hearing aid
US11812225B2 (en) 2022-01-14 2023-11-07 Chromatic Inc. Method, apparatus and system for neural network hearing aid
US11818547B2 (en) * 2022-01-14 2023-11-14 Chromatic Inc. Method, apparatus and system for neural network hearing aid
US11818523B2 (en) 2022-01-14 2023-11-14 Chromatic Inc. System and method for enhancing speech of target speaker from audio signal in an ear-worn device using voice signatures
US11832061B2 (en) 2022-01-14 2023-11-28 Chromatic Inc. Method, apparatus and system for neural network hearing aid
US11877125B2 (en) 2022-01-14 2024-01-16 Chromatic Inc. Method, apparatus and system for neural network enabled hearing aid
US11950056B2 (en) 2022-01-14 2024-04-02 Chromatic Inc. Method, apparatus and system for neural network hearing aid
WO2024029771A1 (fr) * 2022-08-05 2024-02-08 Samsung Electronics Co., Ltd. Procédé, appareil et support lisible par ordinateur pour générer un signal vocal filtré à l'aide de réseaux de débruitage de la parole sur la base de la modélisation de la parole et du bruit
US11902747B1 (en) 2022-08-09 2024-02-13 Chromatic Inc. Hearing loss amplification that amplifies speech and noise subsignals differently

Also Published As

Publication number Publication date
JP2023552090A (ja) 2023-12-14
US20230306981A1 (en) 2023-09-28
US11894012B2 (en) 2024-02-06

Similar Documents

Publication Publication Date Title
US11894012B2 (en) Neural-network-based approach for speech denoising
Xu et al. Listening to sounds of silence for speech denoising
Gabbay et al. Visual speech enhancement
Triantafyllopoulos et al. Towards robust speech emotion recognition using deep residual networks for speech enhancement
Wang et al. A multiobjective learning and ensembling approach to high-performance speech enhancement with compact neural network architectures
Le Cornu et al. Generating intelligible audio speech from visual speech
Poorjam et al. Automatic quality control and enhancement for voice-based remote Parkinson’s disease detection
Shahin Novel third-order hidden Markov models for speaker identification in shouted talking environments
Ideli et al. Visually assisted time-domain speech enhancement
CN115881164A (zh) 一种语音情感识别方法及系统
Wang et al. Fusing bone-conduction and air-conduction sensors for complex-domain speech enhancement
Xian et al. Multi-scale residual convolutional encoder decoder with bidirectional long short-term memory for single channel speech enhancement
Xu et al. Improving visual speech enhancement network by learning audio-visual affinity with multi-head attention
Nguyen et al. Feature adaptation using linear spectro-temporal transform for robust speech recognition
Soni et al. State-of-the-art analysis of deep learning-based monaural speech source separation techniques
Jannu et al. Multi-stage progressive learning-based speech enhancement using time–frequency attentive squeezed temporal convolutional networks
Hepsiba et al. Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN
Richter et al. Audio-visual speech enhancement with score-based generative models
Chen et al. CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
Xu et al. Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement
Singh et al. Speech enhancement for Punjabi language using deep neural network
Xu et al. MFFCN: multi-layer feature fusion convolution network for audio-visual speech enhancement
Parvathala et al. Neural comb filtering using sliding window attention network for speech enhancement
Chhetri et al. Speech Enhancement: A Survey of Approaches and Applications
Samui et al. Deep Recurrent Neural Network Based Monaural Speech Separation Using Recurrent Temporal Restricted Boltzmann Machines.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21894268

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023530195

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21894268

Country of ref document: EP

Kind code of ref document: A1