CN109841226B - Single-channel real-time noise reduction method based on convolution recurrent neural network - Google Patents

Single-channel real-time noise reduction method based on convolution recurrent neural network Download PDF

Info

Publication number
CN109841226B
CN109841226B CN201811010712.6A CN201811010712A CN109841226B CN 109841226 B CN109841226 B CN 109841226B CN 201811010712 A CN201811010712 A CN 201811010712A CN 109841226 B CN109841226 B CN 109841226B
Authority
CN
China
Prior art keywords
neural network
recurrent neural
acoustic features
convolutional
noise reduction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811010712.6A
Other languages
Chinese (zh)
Other versions
CN109841226A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Elevoc Technology Co ltd
Original Assignee
Elevoc Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Elevoc Technology Co ltd filed Critical Elevoc Technology Co ltd
Priority to CN201811010712.6A priority Critical patent/CN109841226B/en
Publication of CN109841226A publication Critical patent/CN109841226A/en
Priority to PCT/CN2019/090530 priority patent/WO2020042707A1/en
Application granted granted Critical
Publication of CN109841226B publication Critical patent/CN109841226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The disclosure discloses a single-channel real-time noise reduction method and device based on a convolution recurrent neural network, electronic equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: extracting acoustic features from received single-channel sound signals, carrying out iterative operation on the acoustic features in a pre-trained convolution recurrent neural network model, calculating a ratio film of the acoustic features, masking the acoustic features by adopting the ratio film, and synthesizing the masked acoustic features and the phase of the single-channel sound signals to obtain voice signals. The single-channel real-time noise reduction method and device based on the convolution recurrent neural network can reduce the number of parameters of the neural network, reduce the data storage capacity and the requirements on the system data bandwidth, and greatly improve the real-time performance of single-channel voice noise reduction while realizing good noise reduction performance.

Description

Single-channel real-time noise reduction method based on convolution recurrent neural network
Technical Field
The present disclosure relates to the field of computer application technologies, and in particular, to a single-channel real-time noise reduction method and apparatus based on a convolutional recurrent neural network, an electronic device, and a storage medium.
Background
Voice noise reduction refers to separating a target voice signal from background noise to eliminate or suppress the background noise. Single channel speech is a speech signal generated by relying only on single microphone recordings, and single channel speech noise reduction can be applied to a wider range of acoustic scenes than beamforming-based noise reduction techniques (i.e., spatial filtering by appropriate configuration of the microphone array). Single-channel speech noise reduction is not only cost-advantageous, but is also easier to use in practical situations. Furthermore, single channel speech separation can be used to enhance the effect of beamforming and associated microphone arrays.
Monophonic speech noise reduction is particularly difficult because single channel speech does not have the spatial information provided by the microphone array as a reference. Recently, single channel speech noise reduction is regarded as a kind of supervised learning, and the signal processing problem is converted into a supervised learning task in a breakthrough manner. Signal processing methods, typified by traditional speech enhancement, are based on general statistical analysis of background noise and speech, whereas supervised learning methods are data driven and can automatically learn from specific training samples. It can be said that the introduction of the supervised learning method realizes the leap of the single-channel speech noise reduction technology. However, in the conventional supervised learning single-channel speech noise reduction method, the number of network parameters is large and the model is complex, which affects the real-time performance and noise reduction effect of single-channel speech noise reduction.
Disclosure of Invention
In order to solve the technical problems that the number of network parameters for single-channel voice noise reduction is large and a model is complex in the related art, the disclosure provides a single-channel real-time noise reduction method, a single-channel real-time noise reduction device and a single-channel real-time noise reduction terminal based on a convolutional recurrent neural network.
In a first aspect, a single-channel real-time noise reduction method based on a convolutional recurrent neural network is provided, which includes:
extracting acoustic features from the received single-channel sound signal;
carrying out iterative operation on the acoustic features in a pre-trained convolution recurrent neural network model, and calculating a ratio film of the acoustic features;
masking the acoustic feature with the ratio film;
and synthesizing the masked acoustic features and the phase of the single-channel sound signal to obtain a noise-reduced sound signal.
Optionally, the step of extracting the acoustic features from the received single-channel sound signal includes:
dividing the received single-channel sound signal into time frames according to a preset time period;
extracting a spectral magnitude vector from the time frame;
and carrying out normalization processing on the frequency spectrum amplitude vector to form acoustic features.
Optionally, the step of performing normalization processing on the spectral magnitude vector to form an acoustic feature includes:
and combining the frequency spectrum amplitude vectors of the current time frame and the past time frame for normalization processing to form the acoustic features.
Optionally, the step of performing normalization processing on the spectral magnitude vector to form an acoustic feature includes:
and combining the spectral amplitude vectors of the current time frame, the past time frame and the future time frame for normalization processing to form the acoustic features.
Optionally, before the step of performing iterative operation on the acoustic features in a pre-trained convolutional recurrent neural network model and calculating a ratio film of the acoustic features, the method further includes:
combining the convolutional neural network with a recurrent neural network with long and short term memory to obtain a convolutional recurrent neural network;
and training a pre-collected voice training set through the convolution recurrent neural network to construct the convolution recurrent neural network model.
Optionally, the convolutional neural network is a convolutional encoder-decoder structure, the encoder includes a set of convolutional layers and pooling layers, the decoder structure is the same as the encoder in reverse order, and the output of the encoder is connected to the input of the decoder.
Optionally, the recurrent neural network with long-short term memory comprises two stacked long-short term memory layers.
Optionally, the step of combining the convolutional neural network with a recurrent neural network with long-term and short-term memory to obtain a convolutional recurrent neural network includes:
two stacked long-short term memory layers are incorporated between the encoder and decoder of a convolutional neural network, which is constructed.
Optionally, each convolutional or pooling layer in the convolutional neural network comprises a maximum of 16 kernels, and each long-short term memory layer of the recurrent neural network with long-short term memory comprises 64 neurons.
Optionally, the speech training set is formed by combining the background noise collected in the daily environment, various types of male and female voices and the speech signal mixed with a specific signal-to-noise ratio.
In a second aspect, a single-channel real-time noise reduction apparatus is provided, including:
the acoustic feature extraction module is used for extracting acoustic features from the received single-channel sound signal;
the specific membrane calculation module is used for carrying out iterative operation on the acoustic features in a pre-constructed convolution recurrent neural network model and calculating the specific membrane of the acoustic features;
a masking module for masking the acoustic feature with the ratio film;
and the voice synthesis module is used for synthesizing the masked acoustic features and the phase of the single-channel voice signal to obtain a noise-reduction voice signal.
Optionally, an ideal ratio mask is used as a training target of the convolutional recurrent neural network.
In a third aspect, an electronic device is provided, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
In a fourth aspect, there is provided a computer readable storage medium storing a program which, when executed, causes an electronic device to perform the method of the first aspect.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
when single-channel real-time noise reduction is carried out, acoustic features are extracted from received single-channel sound signals, after iterative operation is carried out on the acoustic features in a pre-trained convolution recurrent neural network model to calculate a ratio film of the acoustic features, the ratio film is adopted to mask the acoustic features, and then the masked acoustic features and the phase of the single-channel sound signals are synthesized to obtain voice signals.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the scope of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
Fig. 1 is a flow diagram illustrating a single-channel real-time noise reduction method based on a convolutional recurrent neural network according to an exemplary embodiment.
Fig. 2 is a flowchart illustrating an implementation of step S110 in the single-channel real-time noise reduction method based on the convolutional recurrent neural network according to the corresponding embodiment of fig. 1.
Fig. 3 is a flowchart illustrating an implementation of step S120 in the single-channel real-time noise reduction method based on the convolutional recurrent neural network in fig. 1.
Fig. 4 is a flow diagram illustrating single-channel real-time noise reduction according to an example embodiment.
Fig. 5 is a graphical representation of the predicted spectral magnitudes when the CRN model is not compressed.
FIG. 6 is a diagram illustrating the predicted spectral magnitudes after compression of a CRN model.
FIG. 7 is a diagram illustrating the results of LSTM model training, CRN model training, and comparison to untrained STOI parameters in a multi-person conversation noise scenario, according to an exemplary embodiment.
FIG. 8 is a graph illustrating the results of LSTM model training, CRN model training, and comparison to untrained STOI parameters in a coffee shop noise scenario, according to an exemplary embodiment.
Fig. 9 is a spectral plot of an untrained sound signal in a multi-person talk noise scene at-5 dB SNR (signal-to-noise ratio) output according to an example embodiment.
FIG. 10 is a clean speech spectral diagram corresponding to FIG. 9, according to an example embodiment.
FIG. 11 is a graph of a spectrum output after noise reduction using a CRN model according to an example embodiment.
Fig. 12 is a block diagram illustrating a single channel real-time noise reduction apparatus according to an exemplary embodiment.
Fig. 13 is a block diagram of the acoustic feature extraction module 110 in the single-channel real-time noise reduction apparatus according to the corresponding embodiment of fig. 12.
FIG. 14 is a block diagram of the ratio film calculation module 120 shown in the corresponding embodiment of FIG. 12.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
Fig. 1 is a flow diagram illustrating a single-channel real-time noise reduction method based on a convolutional recurrent neural network according to an exemplary embodiment. The single-channel real-time noise reduction method based on the convolutional recurrent neural network can be used for electronic equipment such as smart phones and computers. As shown in fig. 1, the single-channel real-time noise reduction method based on the convolutional recurrent neural network may include step S110, step S120, step S130, and step S140.
Step S110, extracting acoustic features from the received single-channel sound signal.
The single channel sound signal is a signal to be subjected to real-time noise reduction processing.
Typically, a single channel sound signal contains speech and non-human interference noise.
When the electronic device performs the real-time noise reduction processing of the single-channel voice, the electronic device can receive the single-channel voice signals collected by recording devices such as a microphone, can also receive the single-channel voice signals sent by other electronic devices, and can also receive the single-channel voice signals in other modes, which are not described one by one here.
The acoustic features are data features that can characterize a single channel sound signal.
When extracting acoustic features from a received single-channel sound signal, the acoustic features may be extracted from the single-channel sound signal by using short-time Fourier transform (STFT), or may be extracted from the single-channel sound signal by using wavelet transform, or may be extracted from the received single-channel sound signal by using other forms.
Alternatively, as shown in fig. 2, step S110 may include step S111, step S112, and step S113.
Step S111, dividing the received single channel sound signal into time frames according to a preset time period.
The preset time period is a preset time interval period, and the single-channel sound signal is divided into a plurality of time frames according to the preset time period.
In a specific exemplary embodiment, the received single-channel sound signal is divided into time frames by 20 milliseconds per frame, with an overlap of 10 milliseconds between every two adjacent time frames.
Step S112, a spectral magnitude vector is extracted from the time frame.
And step S113, carrying out normalization processing on the frequency spectrum amplitude vector to form acoustic features.
In an exemplary embodiment, an STFT is applied to each time frame to extract spectral magnitude vectors, each of which, after normalization, forms acoustic features.
Optionally, since the temporal context is a feature of the speech signal, the context information is integrated by concatenating a plurality of consecutive frames centered on the current time frame into a larger vector to further improve the single-channel speech noise reduction performance.
For example, when the spectral magnitude vector is normalized, the spectral magnitude vectors of the current time frame and the past time frame are combined and normalized to form the acoustic feature.
Future time frames cannot be used because of the high level of real-time performance required for noise reduction applications such as mobile communications and hearing aids. For such real-time applications, when the spectral magnitude vector is normalized, the spectral magnitude vectors of the current time frame and the past time frame are merged for normalization. Specifically, the previous 5 frames and the current time frame are spliced into a unified feature vector as the input of the present invention. The number of the past time frames can be less than 5, so that the calculation time is further saved under the condition of sacrificing a certain noise reduction performance, and the real-time performance of the application is improved.
For another example, when the spectral magnitude vector is normalized, the spectral magnitude vectors of the current time frame, the past time frame, and the future time frame are combined and normalized to form the acoustic feature.
For applications that do not require real-time processing, such as Automatic Speech Recognition (ASR), future time frames may be used as input. Specifically, by splicing 7 Time frames including the current Time frame, 1 future Time frame and 5 past Time frames into a unified feature vector as an input of the present invention, the STOI (Short-Time Objective intelligibility) can be improved by about one percentage point in this scenario compared with a scenario without adding a future Time frame. STOI is an important indicator for evaluating the noise reduction performance of speech, and its typical value ranges between 0 and 1, which can be interpreted as a percentage of intelligible speech.
Therefore, when extracting acoustic features from a single-channel sound signal, the single-channel sound signal is divided into time frames in advance according to a preset time period, an appropriate time period is set so that an input is provided for noise reduction processing based on the acoustic features extracted from each time frame, and the noise reduction performance can be improved by selectively combining the current time frame with the frequency spectrum magnitude vectors of the past time frame and the future time frame to form the acoustic features.
And step S120, carrying out iterative operation on the acoustic features in a pre-trained convolution recurrent neural network model, and calculating a ratio membrane of the acoustic features.
The ratio film is a film that characterizes the relationship between the noisy speech signal and the clean speech signal, indicating a tradeoff of noise suppression versus retained speech.
Ideally, after the noise-carrying speech signal is masked by the ratio film, a pure speech signal can be restored from the noise-carrying speech.
The convolution recurrent neural network model is trained in advance.
And (5) taking the acoustic features obtained in the step (S110) as the input of a convolution recurrent neural network model, performing iterative operation in the convolution recurrent neural network model, and calculating a ratio film for the acoustic features.
In this step, an IRM (Ideal Ratio Mask) is targeted for the iterative operation. The IRM for each time-frequency cell in a spectrogram can be expressed by the following equation:
Figure BDA0001785014980000071
wherein SFFT(t, f) and NFFT(t, f) represent the clean speech magnitude spectrum and the noise magnitude spectrum at time frame t and frequency bin f, respectively.
The ideal ratio film is predicted in the supervision training process, and then the ratio film is adopted to mask the acoustic features so as to obtain the voice signals after noise reduction.
Step S130, masking the acoustic feature with a ratio film.
And step S140, synthesizing the masked acoustic features and the phase of the single-channel sound signal to obtain a noise-reduction voice signal.
After the training is completed, the trained CRN (Convolutional Recurrent Network) can be used for the application of speech noise reduction. The use of a trained neural network for a particular application is referred to as reasoning or operation. In the inference phase, the various layers of the CRN model process noisy signals. The T-F mask is derived from the results of the inference and is then used to weight (or mask) the noisy speech amplitude to produce an enhanced speech signal that is sharper than the original noisy input.
In an exemplary embodiment, the masked spectral magnitude vector is sent to an inverse Fourier transform along with the phase of the single channel sound signal to derive the speech signal in the corresponding time domain.
By using the method, when single-channel real-time noise reduction is carried out, acoustic features are extracted from received single-channel sound signals, iterative operation is carried out on the acoustic features in a pre-trained convolution recurrent neural network model to calculate a ratio film of the acoustic features, and then the ratio film is adopted to mask the acoustic features. And synthesizing the masked acoustic features and the phase of the single-channel sound signal to obtain a voice signal. Because the pre-trained convolution recurrent neural network model is adopted in the scheme, the number of neural network parameters is greatly reduced, the data storage capacity is reduced, the requirement on system data bandwidth is met, and the good noise reduction performance can be realized while the real-time performance of single-channel voice noise reduction can be greatly improved.
Fig. 3 is a flowchart illustrating a specific implementation of step S120 in the single-channel real-time noise reduction method based on the convolutional recurrent neural network according to the corresponding embodiment of fig. 1. As shown in fig. 3, the single-channel real-time noise reduction method based on the convolutional recurrent neural network may include steps S121 and S122.
And step S121, combining the convolutional neural network and the recurrent neural network with long-term and short-term memory to obtain the convolutional recurrent neural network.
CNN (Convolutional Neural Network) is an efficient recognition method that has attracted much attention in recent years. In the 60's of the 20 th century, Hubel and Wiesel discovered their unique network structures in the study of neurons in the feline cerebral cortex for local sensitivity and direction selection, which effectively reduced the complexity of the feedback neural network, and subsequently proposed neural structures similar to convolutional neural networks. CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification, and since the network avoids the complex preprocessing of the image and can directly input the original image, it has been widely applied. The new recognition machine proposed by fukushima in 1980 is the first network to implement convolutional neural networks. Of these, the version proposed by LeCun et al is representative.
The convolution operation in CNN is based on a two-dimensional structure definition that defines local perceptual domains where each underlying feature is only associated with a subset of the inputs, such as a topological neighborhood. Topological local constraints within convolutional layers make the weight matrix very sparse, so two layers connected by convolutional operations are only locally connected. Calculating such a matrix multiplication is more convenient and efficient than calculating a dense matrix multiplication, and in addition, a smaller number of free parameters would make statistical calculations more beneficial. In an image with a two-dimensional topology, the same input pattern appears at different positions, and the similar values are more likely to depend more strongly, which is very important for the data model.
The CNN reduces the number of parameters to be learned by using a weight sharing mode, so that the complexity of a model is reduced, and the number of network weights is greatly reduced. Compared with the common forward BP algorithm (Error Back Propagation ), the training speed and accuracy are greatly improved. The CNN is used as a deep learning algorithm, and can minimize the overhead of preprocessing data.
A Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) (hereinafter, "Recurrent Neural Network with Long Short-Term Memory" is simply referred to as "LSTM") is a temporal Recurrent Neural Network, and the paper was first published in 1997. Due to the unique design structure, LSTM is suitable for handling and predicting significant events of very long intervals and delays in a time series.
LSTM generally performs better than the temporal recurrent neural networks and Hidden Markov Models (HMMs), such as used in non-segmented continuous handwriting recognition. In 2009, an ICDAR handwriting recognition match champion won with an artificial neural network model constructed with LSTM. LSTM is also commonly used for automatic speech recognition, with records of 17.7% error rate in 2013 using the TIMIT natural speech database. As a nonlinear model, LSTM can be used as a complex nonlinear unit to construct larger deep neural networks.
LSTM is a specific type of RNN that can effectively record long-term correlations of sound signals. LSTM improves the gradient reduction or gradient explosion problem over time during training compared to traditional RNNs. The memory cell of the LSTM module has three gates: an input gate, a forgetting gate and an output gate. The input gate controls how much current information should be added to the memory unit, forgets how much previous information the gate should retain, and the output gate controls whether or not to output information. Specifically, the LSTM may be described by a mathematical formula as follows.
it=σ(Wiixt+bii+Whiht-1+bhi)
ft=σ(Wifxt+bif+Whfht-1+bhf)
gt=tanh(WigXt+big+Whght-1+bhg)
ot=σ(Wioxt+bio+Whoht-1+bho)
ct=ft⊙ct-1+it⊙gt
ht=ot⊙tanh(ct)
Wherein it,ftAnd otRespectively the outputs of the input gate, the forgetting gate and the output gate. x is the number oftAnd htRespectively representing the input feature and the hidden activation at time t. gtAnd ctBlock input and storage units are represented separately. σ represents a sigmoidal function, e.g. σ (x) 1/(1+ e)x) Tanh represents a hyperbolic tangent function, such as tanh (x) ═ ex-e-x)/(ex+e-x) The symbol ⊙ represents the multiplication of array elements in turn the input and forgetting gates are computed from the previous activation function and the current input, and are based onThe context-associated update is performed on the memory unit according to the entry gate and the forgetting gate.
When training for speech denoising, the LSTM saves the relevant context for mask prediction at the current time.
A convolutional recurrent neural network (CRN) is obtained by combining a convolutional neural network with a recurrent neural network having long and short term memory. The CRN has the characteristics of CNN and LSTM, so that the quantity of neural network parameters of the CRN is greatly reduced while voice denoising can be efficiently carried out, and the size of the CRN is effectively reduced.
And step S122, training a pre-collected voice training set through a convolution recurrent neural network, and constructing a convolution recurrent neural network model.
After the convolutional recurrent neural network and the recurrent neural network with long and short term memory are combined to obtain the convolutional recurrent neural network, training a pre-collected voice training set through the convolutional recurrent neural network, adjusting network neural parameters of the convolutional recurrent neural network, and enabling the output of the network neural parameters to approach to IRM (inverse Fourier transform) so as to construct a convolutional recurrent neural network model.
Fig. 4 is a flow diagram illustrating single-channel real-time noise reduction according to an example embodiment. As shown in FIG. 4, the input is a voice signal, and the output is a voice signal after noise reduction, wherein the dotted arrow is shown
Figure BDA0001785014980000101
Representing the steps involved during training, dotted dashed arrows in the figure
Figure BDA0001785014980000102
Representing the steps of the prediction phase, the solid arrow "→" in the figure represents the steps of training and prediction sharing. As a supervised learning approach, the present invention uses an Ideal Ratio Membrane (IRM) as a training target. The IRM is obtained by comparing the STFT of a noisy speech signal with a corresponding clean speech signal. During the training phase, the RNN with LSTM estimates the ideal ratio film for each input noisy speech, and then calculates the Mean-squared error (MSE) between the ideal ratio film and the estimated ratio film. Is subjected to repeated multipleThe iterations of the round minimize the MSE for the entire training set, while the training samples are used only once per iteration of the round. And after the training stage is finished, entering a prediction stage, namely directly denoising the input sound signal by using a trained convolution recurrent neural network model, specifically, processing the input sound signal by using the trained convolution recurrent neural network model and calculating a ratio film, then processing the input sound signal by using the calculated ratio film, and finally re-synthesizing to obtain the denoised sound signal.
The output of the top deconvolution layer is passed through a sigmoidal function (see fig. 4) to obtain a prediction of the ratio film, which is then compared to the IRM, which by comparison generates an MSE error for adjusting the CRN weights.
Optionally, the convolutional neural network is a convolutional encoder-decoder structure, including a convolutional encoder and a corresponding decoder. The encoder includes a set of convolutional layers and pooling layers for extracting advanced features from the input; the decoder is constructed as a reverse order encoder, and is capable of mapping low resolution features at the output of the encoder to a full input size feature map. The number of kernels remains symmetric: the number of cores is gradually increased in the encoder and gradually decreased in the decoder. A symmetric encoder-decoder architecture ensures that the output has the same shape as the input. To improve the information and gradient flow in the whole network we use skip connections, connecting the output of each encoder layer to the input of each decoder layer. Causal convolution and causal deconvolution, where the output is independent of future inputs, are applied in the encoder and decoder, respectively, to implement a causal system for real-time processing.
Optionally, the recurrent neural network with long-short term memory comprises two stacked long-short term memory layers. A convolutional recurrent neural network is constructed to handle the temporal dynamics of a sound signal by incorporating two stacked long and short term memory layers between the encoder and decoder of the convolutional neural network.
Optionally, to further reduce the CRN model size, the CRN model is compressed, e.g., each convolutional or pooled layer in the CRN includes up to 16 cores. Each long-short term memory layer of the recurrent neural network with long-short term memory includes 64 neurons. By compressing the CRN model, its STOI performance is only slightly reduced compared to the fully trained uncompressed CRN model. The reduction in the STOI parameter is about 4-5% when the input SNR is-5 dB. The STOI parameter drops even less when the input SNR is higher. Fig. 5 shows the predicted spectral amplitude when the CRN model is not compressed, and fig. 6 shows the predicted spectral amplitude after the CRN model is compressed.
In summary, the compressed CRN model can also greatly improve STOI parameters compared to the unprocessed mixed speech; compared with an uncompressed CRN model, the method has the advantages that under the condition that the noise reduction performance of the CRN model is not greatly influenced, the size of the model is greatly reduced, and the data storage capacity of the CRN model is reduced.
By utilizing the method, the convolution neural network and the recurrent neural network with long and short term memory are combined to obtain the convolution recurrent neural network, so that the constructed convolution recurrent neural network model has the characteristic of less parameter number of the convolution neural network, and the real-time performance of single-channel speech noise reduction is greatly improved while the good noise reduction performance is ensured.
Optionally, to further improve the generalization ability of the present solution, i.e. to achieve a general noise reduction effect that is not limited to a specific noise environment and a specific speaker, the speech training set of the present exemplary embodiment is formed by combining a large amount of background noise collected in a daily environment, various types of male voices and female voices, and voice and noise mixed with a specific signal-to-noise ratio (SNR).
In addition, the speech training set contains a lot of noise (about 1000 hours) and numerous speech segments, and the whole speech training set lasts for about several hundred hours, ensuring that the CRN model is adequately trained.
Fig. 7 is a diagram illustrating results of alignment of STOI parameters trained with LSTM models, trained with CRN models, and untrained in a multiple person conversation noise scenario, according to an exemplary embodiment, and fig. 8 is a diagram illustrating results of alignment of STOI parameters trained with LSTM models, trained with CRN models, and untrained in a cafe noise scenario, according to an exemplary embodiment. The CRN model used in this example has five convolutional layers in the encoder, five deconvolution layers in the decoder, and two LSTM layers between the encoder and decoder. As shown in fig. 7 and 8, the STOI parameter of the CRN model-based method proposed in this embodiment is greatly improved relative to the unprocessed noisy sound signal, and the STOI parameter is improved by about 20 percentage points in a state where the signal-to-noise ratio is-5 dB; in the state of 5dB SNR, the STOI parameter is improved by about 10 percentage points. Fig. 7, 8 also show that the performance of the method is consistently better than the LSTM-based RNN model method, and the STOI improvement is more pronounced at lower SNR.
To further illustrate the noise reduction results, FIG. 9 shows a spectral plot of an untrained sound signal in a multi-talk noise scenario at-5 dB SNR; FIG. 10 shows a spectral plot of a corresponding clean utterance; fig. 11 shows a spectrum diagram after noise reduction using the CRN model. Fig. 9, 10, 11 show that the noise-reduced speech signal is much closer to pure speech than the sound signal of a multi-talk noise scene.
The single-channel noise reduction of the invention refers to processing the signal collected by a single microphone, and compared with a noise reduction method of a microphone array formed by wave beams, the single-channel noise reduction has wider practicability. The invention adopts a supervised learning method to perform voice noise reduction, and predicts the ratio film of a sound signal by using a model with a convolution recursive neural network. The voice training set used by the convolution recurrent neural network model provided by the invention contains a large amount of noise (about 1000 hours) and a plurality of voice segments, and the whole voice training set lasts for about hundreds of hours, so that the CRN model is ensured to be fully trained, and the realization of single-channel real-time noise reduction is independent of a future time frame. The convolutional recurrent neural network is obtained by combining the convolutional neural network and the recurrent neural network with long and short term memory, and then the convolutional recurrent neural network trains a pre-collected voice training set to construct a convolutional recurrent neural network model. The model greatly reduces the quantity of parameters of the neural network, reduces the data storage capacity, and greatly improves the real-time performance of single-channel voice noise reduction while realizing good noise reduction performance.
The following is an embodiment of the apparatus of the present disclosure, which may be used to implement the above embodiment of the single-channel real-time noise reduction method based on the convolutional recurrent neural network. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the single-channel real-time noise reduction method based on the convolutional recurrent neural network of the present disclosure.
Fig. 12 is a block diagram illustrating a single channel real-time noise reduction apparatus according to an exemplary embodiment, including but not limited to: an acoustic feature extraction module 110, a ratio film calculation module 120, a masking module 130, and a speech synthesis module 140.
An acoustic feature extraction module 110, configured to extract acoustic features from the received single-channel sound signal;
a ratio film calculation module 120, configured to perform iterative operation on the acoustic features in a pre-constructed convolutional recurrent neural network model, and calculate a ratio film of the acoustic features;
a masking module 130 for masking the acoustic feature with the ratio film;
and a speech synthesis module 140, configured to synthesize the masked acoustic features and the phase of the single-channel speech signal to obtain a noise-reduced speech signal.
The implementation processes of the functions and actions of each module in the device are specifically described in the implementation processes of the corresponding steps in the single-channel real-time noise reduction method based on the convolutional recurrent neural network, and are not described herein again.
Optionally, as shown in fig. 13, the acoustic feature extraction module 110 shown in fig. 12 includes but is not limited to: a temporal frame dividing unit 111, a spectral magnitude vector extracting unit 112, and an acoustic feature forming unit 113.
A time frame dividing unit 111, configured to divide the received single-channel sound signal into time frames according to a preset time period;
a spectral magnitude vector extraction unit 112 for extracting a spectral magnitude vector from the time frame;
an acoustic feature forming unit 113, configured to perform normalization processing on the spectral magnitude vector to form an acoustic feature.
Optionally, as shown in fig. 14, the ratio film calculation module 120 in fig. 12 includes, but is not limited to: a network combining unit 121 and a network model building unit 122.
A network combining unit 121, configured to combine the convolutional neural network and a recurrent neural network with long-term and short-term memory to obtain a convolutional recurrent neural network;
a network model constructing unit 122, configured to train a pre-collected speech training set through the convolutional recurrent neural network, and construct the convolutional recurrent neural network model.
Optionally, the present invention further provides an electronic device, which performs all or part of the steps of the single-channel real-time noise reduction method based on the convolutional recurrent neural network according to any of the above exemplary embodiments. The electronic device includes:
a processor; and
a memory communicatively coupled to the processor; wherein the content of the first and second substances,
the memory stores readable instructions which, when executed by the processor, implement the method of any of the above exemplary embodiments.
The specific manner in which the processor in the terminal performs the operation in this embodiment has been described in detail in the embodiment of the single-channel real-time noise reduction method based on the convolutional recurrent neural network, and will not be described in detail here.
In an exemplary embodiment, a storage medium is also provided that is a computer-readable storage medium, such as may be temporary and non-temporary computer-readable storage media, including instructions.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. A single-channel real-time noise reduction method based on a convolution recurrent neural network is characterized by comprising the following steps:
extracting acoustic features from the received single-channel sound signal;
carrying out iterative operation on the acoustic features in a pre-trained convolution recurrent neural network model, and calculating a ratio film of the acoustic features;
masking the acoustic feature with the ratio film;
synthesizing the masked acoustic features and the phase of the single-channel sound signal to obtain a voice signal;
the step of calculating a ratio film of the acoustic features includes:
combining the convolutional neural network with a recurrent neural network with long and short term memory to obtain a convolutional recurrent neural network;
training a pre-collected voice training set through the convolution recurrent neural network to construct a convolution recurrent neural network model;
the step of combining the convolutional neural network with the recurrent neural network with long-term and short-term memory to obtain the convolutional recurrent neural network comprises the following steps:
two stacked long-short term memory layers are incorporated between the encoder and decoder of a convolutional neural network, which is constructed.
2. The method of claim 1, wherein the step of extracting the acoustic features from the received single channel sound signal comprises:
dividing the received single-channel sound signal into time frames according to a preset time period;
extracting a spectral magnitude vector from the time frame;
and carrying out normalization processing on the frequency spectrum amplitude vector to form acoustic features.
3. The method of claim 2, wherein the spectral magnitude vector is normalized to form the acoustic features, and wherein the step of forming the acoustic features comprises:
and combining the spectral amplitude vectors of the current time frame and the past time frame and carrying out normalization processing to form acoustic features.
4. The method of claim 2, wherein the spectral magnitude vector is normalized to form the acoustic features, and wherein the step of forming the acoustic features comprises:
and combining the spectral amplitude vectors of the current time frame, the past time frame and the future time frame for normalization processing to form the acoustic features.
5. The method of claim 1, wherein the convolutional neural network is a convolutional encoder-decoder structure, the encoder comprises a set of convolutional layers and pooling layers, the decoder structure is the same as the encoder in reverse order, and the output of the encoder is connected to the input of the decoder.
6. The method of claim 1, wherein the recurrent neural network with long-short term memory comprises two stacked layers of long-short term memory.
7. The method of claim 1, wherein each convolutional or pooling layer in the convolutional neural network comprises a maximum of 16 kernels, and each long-short term memory layer of the recurrent neural network with long-short term memory comprises 64 neurons.
8. The method of claim 1, wherein the speech training set is composed of a combination of background noise collected in a daily environment, various types of male and female voices, and speech signals mixed with a specific signal-to-noise ratio.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
10. A computer readable storage medium storing a program, wherein the program, when executed, causes an electronic device to perform the method of any of claims 1-8.
CN201811010712.6A 2018-08-31 2018-08-31 Single-channel real-time noise reduction method based on convolution recurrent neural network Active CN109841226B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811010712.6A CN109841226B (en) 2018-08-31 2018-08-31 Single-channel real-time noise reduction method based on convolution recurrent neural network
PCT/CN2019/090530 WO2020042707A1 (en) 2018-08-31 2019-06-10 Convolutional recurrent neural network-based single-channel real-time noise reduction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811010712.6A CN109841226B (en) 2018-08-31 2018-08-31 Single-channel real-time noise reduction method based on convolution recurrent neural network

Publications (2)

Publication Number Publication Date
CN109841226A CN109841226A (en) 2019-06-04
CN109841226B true CN109841226B (en) 2020-10-16

Family

ID=66883030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811010712.6A Active CN109841226B (en) 2018-08-31 2018-08-31 Single-channel real-time noise reduction method based on convolution recurrent neural network

Country Status (2)

Country Link
CN (1) CN109841226B (en)
WO (1) WO2020042707A1 (en)

Families Citing this family (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109841226B (en) * 2018-08-31 2020-10-16 大象声科(深圳)科技有限公司 Single-channel real-time noise reduction method based on convolution recurrent neural network
CN110335622B (en) * 2019-06-13 2024-03-01 平安科技(深圳)有限公司 Audio single-tone color separation method, device, computer equipment and storage medium
CN110321810A (en) * 2019-06-14 2019-10-11 华南师范大学 Single channel signal two-way separation method, device, storage medium and processor
CN110334654A (en) * 2019-07-08 2019-10-15 北京地平线机器人技术研发有限公司 Video estimation method and apparatus, the training method of video estimation model and vehicle
CN110503940B (en) * 2019-07-12 2021-08-31 中国科学院自动化研究所 Voice enhancement method and device, storage medium and electronic equipment
CN110534123B (en) * 2019-07-22 2022-04-01 中国科学院自动化研究所 Voice enhancement method and device, storage medium and electronic equipment
CN110491404B (en) * 2019-08-15 2020-12-22 广州华多网络科技有限公司 Voice processing method, device, terminal equipment and storage medium
CN110491407B (en) * 2019-08-15 2021-09-21 广州方硅信息技术有限公司 Voice noise reduction method and device, electronic equipment and storage medium
CN110379412B (en) * 2019-09-05 2022-06-17 腾讯科技(深圳)有限公司 Voice processing method and device, electronic equipment and computer readable storage medium
CN110634502B (en) * 2019-09-06 2022-02-11 南京邮电大学 Single-channel voice separation algorithm based on deep neural network
CN110751958A (en) * 2019-09-25 2020-02-04 电子科技大学 Noise reduction method based on RCED network
CN110660406A (en) * 2019-09-30 2020-01-07 大象声科(深圳)科技有限公司 Real-time voice noise reduction method of double-microphone mobile phone in close-range conversation scene
WO2021062706A1 (en) * 2019-09-30 2021-04-08 大象声科(深圳)科技有限公司 Real-time voice noise reduction method for dual-microphone mobile telephone in near-distance conversation scenario
CN110767223B (en) * 2019-09-30 2022-04-12 大象声科(深圳)科技有限公司 Voice keyword real-time detection method of single sound track robustness
CN110931031A (en) * 2019-10-09 2020-03-27 大象声科(深圳)科技有限公司 Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals
CN110806640B (en) * 2019-10-28 2021-12-28 西北工业大学 Photonic integrated visual feature imaging chip
CN110808063A (en) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 Voice processing method and device for processing voice
CN113053400A (en) * 2019-12-27 2021-06-29 武汉Tcl集团工业研究院有限公司 Training method of audio signal noise reduction model, audio signal noise reduction method and device
CN113223545A (en) * 2020-02-05 2021-08-06 字节跳动有限公司 Voice noise reduction method and device, terminal and storage medium
CN113270091B (en) * 2020-02-14 2024-04-16 声音猎手公司 Audio processing system and method
CN111326167B (en) * 2020-03-09 2022-05-13 广州深声科技有限公司 Acoustic feature conversion method based on neural network
CN111429930B (en) * 2020-03-16 2023-02-28 云知声智能科技股份有限公司 Noise reduction model processing method and system based on adaptive sampling rate
CN111292759B (en) * 2020-05-11 2020-07-31 上海亮牛半导体科技有限公司 Stereo echo cancellation method and system based on neural network
CN111754983A (en) * 2020-05-18 2020-10-09 北京三快在线科技有限公司 Voice denoising method and device, electronic equipment and storage medium
CN111883091A (en) * 2020-07-09 2020-11-03 腾讯音乐娱乐科技(深圳)有限公司 Audio noise reduction method and training method of audio noise reduction model
EP4229635A2 (en) * 2020-10-15 2023-08-23 Dolby International AB Method and apparatus for neural network based processing of audio using sinusoidal activation
CN112466318B (en) * 2020-10-27 2024-01-19 北京百度网讯科技有限公司 Speech processing method and device and speech processing model generation method and device
CN116508099A (en) * 2020-10-29 2023-07-28 杜比实验室特许公司 Deep learning-based speech enhancement
CN112565977B (en) * 2020-11-27 2023-03-07 大象声科(深圳)科技有限公司 Training method of high-frequency signal reconstruction model and high-frequency signal reconstruction method and device
CN112671419B (en) * 2020-12-17 2022-05-03 北京邮电大学 Wireless signal reconstruction method, device, system, equipment and storage medium
CN112614504A (en) * 2020-12-22 2021-04-06 平安科技(深圳)有限公司 Single sound channel voice noise reduction method, system, equipment and readable storage medium
CN112613431B (en) * 2020-12-28 2021-06-29 中北大学 Automatic identification method, system and device for leaked gas
CN112992168B (en) * 2021-02-26 2024-04-19 平安科技(深圳)有限公司 Speech noise reducer training method, device, computer equipment and storage medium
US11715480B2 (en) 2021-03-23 2023-08-01 Qualcomm Incorporated Context-based speech enhancement
CN113345463B (en) * 2021-05-31 2024-03-01 平安科技(深圳)有限公司 Speech enhancement method, device, equipment and medium based on convolutional neural network
CN113314140A (en) * 2021-05-31 2021-08-27 哈尔滨理工大学 Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network
CN113744748A (en) * 2021-08-06 2021-12-03 浙江大华技术股份有限公司 Network model training method, echo cancellation method and device
CN114283830A (en) * 2021-12-17 2022-04-05 南京工程学院 Deep learning network-based microphone signal echo cancellation model construction method
CN114464206A (en) * 2022-04-11 2022-05-10 中国人民解放军空军预警学院 Single-channel blind source separation method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160111108A1 (en) * 2014-10-21 2016-04-21 Mitsubishi Electric Research Laboratories, Inc. Method for Enhancing Audio Signal using Phase Information
CN106328122A (en) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 Voice identification method using long-short term memory model recurrent neural network
US20170061978A1 (en) * 2014-11-07 2017-03-02 Shannon Campbell Real-time method for implementing deep neural network based speech separation
CN106875007A (en) * 2017-01-25 2017-06-20 上海交通大学 End-to-end deep neural network is remembered based on convolution shot and long term for voice fraud detection
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9524730B2 (en) * 2012-03-30 2016-12-20 Ohio State Innovation Foundation Monaural speech filter
CN106847302B (en) * 2017-02-17 2020-04-14 大连理工大学 Single-channel mixed voice time domain separation method based on convolutional neural network
CN109841226B (en) * 2018-08-31 2020-10-16 大象声科(深圳)科技有限公司 Single-channel real-time noise reduction method based on convolution recurrent neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160111108A1 (en) * 2014-10-21 2016-04-21 Mitsubishi Electric Research Laboratories, Inc. Method for Enhancing Audio Signal using Phase Information
US20170061978A1 (en) * 2014-11-07 2017-03-02 Shannon Campbell Real-time method for implementing deep neural network based speech separation
CN106328122A (en) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 Voice identification method using long-short term memory model recurrent neural network
CN106875007A (en) * 2017-01-25 2017-06-20 上海交通大学 End-to-end deep neural network is remembered based on convolution shot and long term for voice fraud detection
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《A Deep Convolutional Encoder-Decoder Model for Robust Speech Dereverberation》;D.S.Wang et al.;《2017 22nd International Conference on Digital Signal Processing(DSP)》;20171007;第1-5页 *
《Convolutional-Recurrent Neural Networks for Speech Enhancement》;Han Zhao et al.;《ICASSP 2018》;20180502;第1-5页 *
《Long short-term memory for speaker generalization in supervised speech separation》;Jitong Chen et al.;《INTERSPEECH 2016》;20160912;第3314-3318页 *

Also Published As

Publication number Publication date
WO2020042707A1 (en) 2020-03-05
CN109841226A (en) 2019-06-04

Similar Documents

Publication Publication Date Title
CN109841226B (en) Single-channel real-time noise reduction method based on convolution recurrent neural network
CN107452389B (en) Universal single-track real-time noise reduction method
EP3776534B1 (en) Systems, methods, and computer-readable media for improved real-time audio processing
CN109859767B (en) Environment self-adaptive neural network noise reduction method, system and storage medium for digital hearing aid
CN108766419B (en) Abnormal voice distinguishing method based on deep learning
Shah et al. Time-frequency mask-based speech enhancement using convolutional generative adversarial network
CN111653288B (en) Target person voice enhancement method based on conditional variation self-encoder
CN110867181A (en) Multi-target speech enhancement method based on SCNN and TCNN joint estimation
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN112331224A (en) Lightweight time domain convolution network voice enhancement method and system
JP7486266B2 (en) Method and apparatus for determining a depth filter - Patents.com
CN113488060B (en) Voiceprint recognition method and system based on variation information bottleneck
CN113936681A (en) Voice enhancement method based on mask mapping and mixed hole convolution network
KR102026226B1 (en) Method for extracting signal unit features using variational inference model based deep learning and system thereof
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
CN114283829B (en) Voice enhancement method based on dynamic gating convolution circulation network
CN117174105A (en) Speech noise reduction and dereverberation method based on improved deep convolutional network
CN111899750A (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
Raj et al. Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients
CN109741733B (en) Voice phoneme recognition method based on consistency routing network
Li et al. A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
Ye et al. Subjective feedback-based neural network pruning for speech enhancement
Zhou et al. Speech Enhancement via Residual Dense Generative Adversarial Network.
CN108573698B (en) Voice noise reduction method based on gender fusion information
Chiluveru et al. A real-world noise removal with wavelet speech feature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40008140

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 533, podium building 12, Shenzhen Bay science and technology ecological park, No.18, South Keji Road, high tech community, Yuehai street, Nanshan District, Shenzhen, Guangdong 518000

Patentee after: ELEVOC TECHNOLOGY Co.,Ltd.

Address before: 2206, phase I, International Students Pioneer Building, 29 Gaoxin South Ring Road, Yuehai street, Nanshan District, Shenzhen, Guangdong 518000

Patentee before: ELEVOC TECHNOLOGY Co.,Ltd.

CP02 Change in the address of a patent holder