CN109841226B

CN109841226B - Single-channel real-time noise reduction method based on convolution recurrent neural network

Info

Publication number: CN109841226B
Application number: CN201811010712.6A
Authority: CN
Inventors: 不公告发明人
Original assignee: Elevoc Technology Co ltd
Current assignee: Elevoc Technology Co ltd
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2020-10-16
Anticipated expiration: 2038-08-31
Also published as: WO2020042707A1; CN109841226A

Abstract

The disclosure discloses a single-channel real-time noise reduction method and device based on a convolution recurrent neural network, electronic equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: extracting acoustic features from received single-channel sound signals, carrying out iterative operation on the acoustic features in a pre-trained convolution recurrent neural network model, calculating a ratio film of the acoustic features, masking the acoustic features by adopting the ratio film, and synthesizing the masked acoustic features and the phase of the single-channel sound signals to obtain voice signals. The single-channel real-time noise reduction method and device based on the convolution recurrent neural network can reduce the number of parameters of the neural network, reduce the data storage capacity and the requirements on the system data bandwidth, and greatly improve the real-time performance of single-channel voice noise reduction while realizing good noise reduction performance.

Description

Single-channel real-time noise reduction method based on convolution recurrent neural network

Technical Field

The present disclosure relates to the field of computer application technologies, and in particular, to a single-channel real-time noise reduction method and apparatus based on a convolutional recurrent neural network, an electronic device, and a storage medium.

Background

Voice noise reduction refers to separating a target voice signal from background noise to eliminate or suppress the background noise. Single channel speech is a speech signal generated by relying only on single microphone recordings, and single channel speech noise reduction can be applied to a wider range of acoustic scenes than beamforming-based noise reduction techniques (i.e., spatial filtering by appropriate configuration of the microphone array). Single-channel speech noise reduction is not only cost-advantageous, but is also easier to use in practical situations. Furthermore, single channel speech separation can be used to enhance the effect of beamforming and associated microphone arrays.

Monophonic speech noise reduction is particularly difficult because single channel speech does not have the spatial information provided by the microphone array as a reference. Recently, single channel speech noise reduction is regarded as a kind of supervised learning, and the signal processing problem is converted into a supervised learning task in a breakthrough manner. Signal processing methods, typified by traditional speech enhancement, are based on general statistical analysis of background noise and speech, whereas supervised learning methods are data driven and can automatically learn from specific training samples. It can be said that the introduction of the supervised learning method realizes the leap of the single-channel speech noise reduction technology. However, in the conventional supervised learning single-channel speech noise reduction method, the number of network parameters is large and the model is complex, which affects the real-time performance and noise reduction effect of single-channel speech noise reduction.

Disclosure of Invention

In order to solve the technical problems that the number of network parameters for single-channel voice noise reduction is large and a model is complex in the related art, the disclosure provides a single-channel real-time noise reduction method, a single-channel real-time noise reduction device and a single-channel real-time noise reduction terminal based on a convolutional recurrent neural network.

In a first aspect, a single-channel real-time noise reduction method based on a convolutional recurrent neural network is provided, which includes:

extracting acoustic features from the received single-channel sound signal;

carrying out iterative operation on the acoustic features in a pre-trained convolution recurrent neural network model, and calculating a ratio film of the acoustic features;

masking the acoustic feature with the ratio film;

and synthesizing the masked acoustic features and the phase of the single-channel sound signal to obtain a noise-reduced sound signal.

Optionally, the step of extracting the acoustic features from the received single-channel sound signal includes:

dividing the received single-channel sound signal into time frames according to a preset time period;

extracting a spectral magnitude vector from the time frame;

and carrying out normalization processing on the frequency spectrum amplitude vector to form acoustic features.

Optionally, the step of performing normalization processing on the spectral magnitude vector to form an acoustic feature includes:

and combining the frequency spectrum amplitude vectors of the current time frame and the past time frame for normalization processing to form the acoustic features.

and combining the spectral amplitude vectors of the current time frame, the past time frame and the future time frame for normalization processing to form the acoustic features.

Optionally, before the step of performing iterative operation on the acoustic features in a pre-trained convolutional recurrent neural network model and calculating a ratio film of the acoustic features, the method further includes:

combining the convolutional neural network with a recurrent neural network with long and short term memory to obtain a convolutional recurrent neural network;

and training a pre-collected voice training set through the convolution recurrent neural network to construct the convolution recurrent neural network model.

Optionally, the convolutional neural network is a convolutional encoder-decoder structure, the encoder includes a set of convolutional layers and pooling layers, the decoder structure is the same as the encoder in reverse order, and the output of the encoder is connected to the input of the decoder.

Optionally, the recurrent neural network with long-short term memory comprises two stacked long-short term memory layers.

Optionally, the step of combining the convolutional neural network with a recurrent neural network with long-term and short-term memory to obtain a convolutional recurrent neural network includes:

two stacked long-short term memory layers are incorporated between the encoder and decoder of a convolutional neural network, which is constructed.

Optionally, each convolutional or pooling layer in the convolutional neural network comprises a maximum of 16 kernels, and each long-short term memory layer of the recurrent neural network with long-short term memory comprises 64 neurons.

Optionally, the speech training set is formed by combining the background noise collected in the daily environment, various types of male and female voices and the speech signal mixed with a specific signal-to-noise ratio.

In a second aspect, a single-channel real-time noise reduction apparatus is provided, including:

the acoustic feature extraction module is used for extracting acoustic features from the received single-channel sound signal;

the specific membrane calculation module is used for carrying out iterative operation on the acoustic features in a pre-constructed convolution recurrent neural network model and calculating the specific membrane of the acoustic features;

a masking module for masking the acoustic feature with the ratio film;

and the voice synthesis module is used for synthesizing the masked acoustic features and the phase of the single-channel voice signal to obtain a noise-reduction voice signal.

Optionally, an ideal ratio mask is used as a training target of the convolutional recurrent neural network.

In a third aspect, an electronic device is provided, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

In a fourth aspect, there is provided a computer readable storage medium storing a program which, when executed, causes an electronic device to perform the method of the first aspect.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

when single-channel real-time noise reduction is carried out, acoustic features are extracted from received single-channel sound signals, after iterative operation is carried out on the acoustic features in a pre-trained convolution recurrent neural network model to calculate a ratio film of the acoustic features, the ratio film is adopted to mask the acoustic features, and then the masked acoustic features and the phase of the single-channel sound signals are synthesized to obtain voice signals.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the scope of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flow diagram illustrating a single-channel real-time noise reduction method based on a convolutional recurrent neural network according to an exemplary embodiment.

Fig. 2 is a flowchart illustrating an implementation of step S110 in the single-channel real-time noise reduction method based on the convolutional recurrent neural network according to the corresponding embodiment of fig. 1.

Fig. 3 is a flowchart illustrating an implementation of step S120 in the single-channel real-time noise reduction method based on the convolutional recurrent neural network in fig. 1.

Fig. 4 is a flow diagram illustrating single-channel real-time noise reduction according to an example embodiment.

Fig. 5 is a graphical representation of the predicted spectral magnitudes when the CRN model is not compressed.

FIG. 6 is a diagram illustrating the predicted spectral magnitudes after compression of a CRN model.

FIG. 7 is a diagram illustrating the results of LSTM model training, CRN model training, and comparison to untrained STOI parameters in a multi-person conversation noise scenario, according to an exemplary embodiment.

FIG. 8 is a graph illustrating the results of LSTM model training, CRN model training, and comparison to untrained STOI parameters in a coffee shop noise scenario, according to an exemplary embodiment.

Fig. 9 is a spectral plot of an untrained sound signal in a multi-person talk noise scene at-5 dB SNR (signal-to-noise ratio) output according to an example embodiment.

FIG. 10 is a clean speech spectral diagram corresponding to FIG. 9, according to an example embodiment.

FIG. 11 is a graph of a spectrum output after noise reduction using a CRN model according to an example embodiment.

Fig. 12 is a block diagram illustrating a single channel real-time noise reduction apparatus according to an exemplary embodiment.

Fig. 13 is a block diagram of the acoustic feature extraction module 110 in the single-channel real-time noise reduction apparatus according to the corresponding embodiment of fig. 12.

FIG. 14 is a block diagram of the ratio film calculation module 120 shown in the corresponding embodiment of FIG. 12.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Fig. 1 is a flow diagram illustrating a single-channel real-time noise reduction method based on a convolutional recurrent neural network according to an exemplary embodiment. The single-channel real-time noise reduction method based on the convolutional recurrent neural network can be used for electronic equipment such as smart phones and computers. As shown in fig. 1, the single-channel real-time noise reduction method based on the convolutional recurrent neural network may include step S110, step S120, step S130, and step S140.

Step S110, extracting acoustic features from the received single-channel sound signal.

The single channel sound signal is a signal to be subjected to real-time noise reduction processing.

Typically, a single channel sound signal contains speech and non-human interference noise.

When the electronic device performs the real-time noise reduction processing of the single-channel voice, the electronic device can receive the single-channel voice signals collected by recording devices such as a microphone, can also receive the single-channel voice signals sent by other electronic devices, and can also receive the single-channel voice signals in other modes, which are not described one by one here.

The acoustic features are data features that can characterize a single channel sound signal.

When extracting acoustic features from a received single-channel sound signal, the acoustic features may be extracted from the single-channel sound signal by using short-time Fourier transform (STFT), or may be extracted from the single-channel sound signal by using wavelet transform, or may be extracted from the received single-channel sound signal by using other forms.

Alternatively, as shown in fig. 2, step S110 may include step S111, step S112, and step S113.

Step S111, dividing the received single channel sound signal into time frames according to a preset time period.

The preset time period is a preset time interval period, and the single-channel sound signal is divided into a plurality of time frames according to the preset time period.

In a specific exemplary embodiment, the received single-channel sound signal is divided into time frames by 20 milliseconds per frame, with an overlap of 10 milliseconds between every two adjacent time frames.

Step S112, a spectral magnitude vector is extracted from the time frame.

And step S113, carrying out normalization processing on the frequency spectrum amplitude vector to form acoustic features.

In an exemplary embodiment, an STFT is applied to each time frame to extract spectral magnitude vectors, each of which, after normalization, forms acoustic features.

Optionally, since the temporal context is a feature of the speech signal, the context information is integrated by concatenating a plurality of consecutive frames centered on the current time frame into a larger vector to further improve the single-channel speech noise reduction performance.

For example, when the spectral magnitude vector is normalized, the spectral magnitude vectors of the current time frame and the past time frame are combined and normalized to form the acoustic feature.

Future time frames cannot be used because of the high level of real-time performance required for noise reduction applications such as mobile communications and hearing aids. For such real-time applications, when the spectral magnitude vector is normalized, the spectral magnitude vectors of the current time frame and the past time frame are merged for normalization. Specifically, the previous 5 frames and the current time frame are spliced into a unified feature vector as the input of the present invention. The number of the past time frames can be less than 5, so that the calculation time is further saved under the condition of sacrificing a certain noise reduction performance, and the real-time performance of the application is improved.

For another example, when the spectral magnitude vector is normalized, the spectral magnitude vectors of the current time frame, the past time frame, and the future time frame are combined and normalized to form the acoustic feature.

For applications that do not require real-time processing, such as Automatic Speech Recognition (ASR), future time frames may be used as input. Specifically, by splicing 7 Time frames including the current Time frame, 1 future Time frame and 5 past Time frames into a unified feature vector as an input of the present invention, the STOI (Short-Time Objective intelligibility) can be improved by about one percentage point in this scenario compared with a scenario without adding a future Time frame. STOI is an important indicator for evaluating the noise reduction performance of speech, and its typical value ranges between 0 and 1, which can be interpreted as a percentage of intelligible speech.

Therefore, when extracting acoustic features from a single-channel sound signal, the single-channel sound signal is divided into time frames in advance according to a preset time period, an appropriate time period is set so that an input is provided for noise reduction processing based on the acoustic features extracted from each time frame, and the noise reduction performance can be improved by selectively combining the current time frame with the frequency spectrum magnitude vectors of the past time frame and the future time frame to form the acoustic features.

And step S120, carrying out iterative operation on the acoustic features in a pre-trained convolution recurrent neural network model, and calculating a ratio membrane of the acoustic features.

The ratio film is a film that characterizes the relationship between the noisy speech signal and the clean speech signal, indicating a tradeoff of noise suppression versus retained speech.

Ideally, after the noise-carrying speech signal is masked by the ratio film, a pure speech signal can be restored from the noise-carrying speech.

The convolution recurrent neural network model is trained in advance.

And (5) taking the acoustic features obtained in the step (S110) as the input of a convolution recurrent neural network model, performing iterative operation in the convolution recurrent neural network model, and calculating a ratio film for the acoustic features.

In this step, an IRM (Ideal Ratio Mask) is targeted for the iterative operation. The IRM for each time-frequency cell in a spectrogram can be expressed by the following equation:

wherein S_FFT(t, f) and N_FFT(t, f) represent the clean speech magnitude spectrum and the noise magnitude spectrum at time frame t and frequency bin f, respectively.

The ideal ratio film is predicted in the supervision training process, and then the ratio film is adopted to mask the acoustic features so as to obtain the voice signals after noise reduction.

Step S130, masking the acoustic feature with a ratio film.

And step S140, synthesizing the masked acoustic features and the phase of the single-channel sound signal to obtain a noise-reduction voice signal.

After the training is completed, the trained CRN (Convolutional Recurrent Network) can be used for the application of speech noise reduction. The use of a trained neural network for a particular application is referred to as reasoning or operation. In the inference phase, the various layers of the CRN model process noisy signals. The T-F mask is derived from the results of the inference and is then used to weight (or mask) the noisy speech amplitude to produce an enhanced speech signal that is sharper than the original noisy input.

In an exemplary embodiment, the masked spectral magnitude vector is sent to an inverse Fourier transform along with the phase of the single channel sound signal to derive the speech signal in the corresponding time domain.

By using the method, when single-channel real-time noise reduction is carried out, acoustic features are extracted from received single-channel sound signals, iterative operation is carried out on the acoustic features in a pre-trained convolution recurrent neural network model to calculate a ratio film of the acoustic features, and then the ratio film is adopted to mask the acoustic features. And synthesizing the masked acoustic features and the phase of the single-channel sound signal to obtain a voice signal. Because the pre-trained convolution recurrent neural network model is adopted in the scheme, the number of neural network parameters is greatly reduced, the data storage capacity is reduced, the requirement on system data bandwidth is met, and the good noise reduction performance can be realized while the real-time performance of single-channel voice noise reduction can be greatly improved.

Fig. 3 is a flowchart illustrating a specific implementation of step S120 in the single-channel real-time noise reduction method based on the convolutional recurrent neural network according to the corresponding embodiment of fig. 1. As shown in fig. 3, the single-channel real-time noise reduction method based on the convolutional recurrent neural network may include steps S121 and S122.

And step S121, combining the convolutional neural network and the recurrent neural network with long-term and short-term memory to obtain the convolutional recurrent neural network.

CNN (Convolutional Neural Network) is an efficient recognition method that has attracted much attention in recent years. In the 60's of the 20 th century, Hubel and Wiesel discovered their unique network structures in the study of neurons in the feline cerebral cortex for local sensitivity and direction selection, which effectively reduced the complexity of the feedback neural network, and subsequently proposed neural structures similar to convolutional neural networks. CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification, and since the network avoids the complex preprocessing of the image and can directly input the original image, it has been widely applied. The new recognition machine proposed by fukushima in 1980 is the first network to implement convolutional neural networks. Of these, the version proposed by LeCun et al is representative.

The convolution operation in CNN is based on a two-dimensional structure definition that defines local perceptual domains where each underlying feature is only associated with a subset of the inputs, such as a topological neighborhood. Topological local constraints within convolutional layers make the weight matrix very sparse, so two layers connected by convolutional operations are only locally connected. Calculating such a matrix multiplication is more convenient and efficient than calculating a dense matrix multiplication, and in addition, a smaller number of free parameters would make statistical calculations more beneficial. In an image with a two-dimensional topology, the same input pattern appears at different positions, and the similar values are more likely to depend more strongly, which is very important for the data model.

The CNN reduces the number of parameters to be learned by using a weight sharing mode, so that the complexity of a model is reduced, and the number of network weights is greatly reduced. Compared with the common forward BP algorithm (Error Back Propagation ), the training speed and accuracy are greatly improved. The CNN is used as a deep learning algorithm, and can minimize the overhead of preprocessing data.

A Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) (hereinafter, "Recurrent Neural Network with Long Short-Term Memory" is simply referred to as "LSTM") is a temporal Recurrent Neural Network, and the paper was first published in 1997. Due to the unique design structure, LSTM is suitable for handling and predicting significant events of very long intervals and delays in a time series.

LSTM generally performs better than the temporal recurrent neural networks and Hidden Markov Models (HMMs), such as used in non-segmented continuous handwriting recognition. In 2009, an ICDAR handwriting recognition match champion won with an artificial neural network model constructed with LSTM. LSTM is also commonly used for automatic speech recognition, with records of 17.7% error rate in 2013 using the TIMIT natural speech database. As a nonlinear model, LSTM can be used as a complex nonlinear unit to construct larger deep neural networks.

LSTM is a specific type of RNN that can effectively record long-term correlations of sound signals. LSTM improves the gradient reduction or gradient explosion problem over time during training compared to traditional RNNs. The memory cell of the LSTM module has three gates: an input gate, a forgetting gate and an output gate. The input gate controls how much current information should be added to the memory unit, forgets how much previous information the gate should retain, and the output gate controls whether or not to output information. Specifically, the LSTM may be described by a mathematical formula as follows.

i_t＝σ(W_iix_t+b_ii+W_hih_t-1+b_hi)

f_t＝σ(W_ifx_t+b_if+W_hfh_t-1+b_hf)

g_t＝tanh(W_igX_t+b_ig+W_hgh_t-1+b_hg)

o_t＝σ(W_iox_t+b_io+W_hoh_t-1+b_ho)

c_t＝f_t⊙c_t-1+i_t⊙g_t

h_t＝o_t⊙tanh(c_t)

Wherein i_t，f_tAnd o_tRespectively the outputs of the input gate, the forgetting gate and the output gate. x is the number of_tAnd h_tRespectively representing the input feature and the hidden activation at time t. g_tAnd c_tBlock input and storage units are represented separately. σ represents a sigmoidal function, e.g. σ (x) 1/(1+ e)^x) Tanh represents a hyperbolic tangent function, such as tanh (x) ═ e^x-e^-x)/(e^x+e^-x) The symbol ⊙ represents the multiplication of array elements in turn the input and forgetting gates are computed from the previous activation function and the current input, and are based onThe context-associated update is performed on the memory unit according to the entry gate and the forgetting gate.

When training for speech denoising, the LSTM saves the relevant context for mask prediction at the current time.

A convolutional recurrent neural network (CRN) is obtained by combining a convolutional neural network with a recurrent neural network having long and short term memory. The CRN has the characteristics of CNN and LSTM, so that the quantity of neural network parameters of the CRN is greatly reduced while voice denoising can be efficiently carried out, and the size of the CRN is effectively reduced.

And step S122, training a pre-collected voice training set through a convolution recurrent neural network, and constructing a convolution recurrent neural network model.

After the convolutional recurrent neural network and the recurrent neural network with long and short term memory are combined to obtain the convolutional recurrent neural network, training a pre-collected voice training set through the convolutional recurrent neural network, adjusting network neural parameters of the convolutional recurrent neural network, and enabling the output of the network neural parameters to approach to IRM (inverse Fourier transform) so as to construct a convolutional recurrent neural network model.

Fig. 4 is a flow diagram illustrating single-channel real-time noise reduction according to an example embodiment. As shown in FIG. 4, the input is a voice signal, and the output is a voice signal after noise reduction, wherein the dotted arrow is shown

Representing the steps involved during training, dotted dashed arrows in the figure

Representing the steps of the prediction phase, the solid arrow "→" in the figure represents the steps of training and prediction sharing. As a supervised learning approach, the present invention uses an Ideal Ratio Membrane (IRM) as a training target. The IRM is obtained by comparing the STFT of a noisy speech signal with a corresponding clean speech signal. During the training phase, the RNN with LSTM estimates the ideal ratio film for each input noisy speech, and then calculates the Mean-squared error (MSE) between the ideal ratio film and the estimated ratio film. Is subjected to repeated multipleThe iterations of the round minimize the MSE for the entire training set, while the training samples are used only once per iteration of the round. And after the training stage is finished, entering a prediction stage, namely directly denoising the input sound signal by using a trained convolution recurrent neural network model, specifically, processing the input sound signal by using the trained convolution recurrent neural network model and calculating a ratio film, then processing the input sound signal by using the calculated ratio film, and finally re-synthesizing to obtain the denoised sound signal.

The output of the top deconvolution layer is passed through a sigmoidal function (see fig. 4) to obtain a prediction of the ratio film, which is then compared to the IRM, which by comparison generates an MSE error for adjusting the CRN weights.

Optionally, the convolutional neural network is a convolutional encoder-decoder structure, including a convolutional encoder and a corresponding decoder. The encoder includes a set of convolutional layers and pooling layers for extracting advanced features from the input; the decoder is constructed as a reverse order encoder, and is capable of mapping low resolution features at the output of the encoder to a full input size feature map. The number of kernels remains symmetric: the number of cores is gradually increased in the encoder and gradually decreased in the decoder. A symmetric encoder-decoder architecture ensures that the output has the same shape as the input. To improve the information and gradient flow in the whole network we use skip connections, connecting the output of each encoder layer to the input of each decoder layer. Causal convolution and causal deconvolution, where the output is independent of future inputs, are applied in the encoder and decoder, respectively, to implement a causal system for real-time processing.

Optionally, the recurrent neural network with long-short term memory comprises two stacked long-short term memory layers. A convolutional recurrent neural network is constructed to handle the temporal dynamics of a sound signal by incorporating two stacked long and short term memory layers between the encoder and decoder of the convolutional neural network.

Optionally, to further reduce the CRN model size, the CRN model is compressed, e.g., each convolutional or pooled layer in the CRN includes up to 16 cores. Each long-short term memory layer of the recurrent neural network with long-short term memory includes 64 neurons. By compressing the CRN model, its STOI performance is only slightly reduced compared to the fully trained uncompressed CRN model. The reduction in the STOI parameter is about 4-5% when the input SNR is-5 dB. The STOI parameter drops even less when the input SNR is higher. Fig. 5 shows the predicted spectral amplitude when the CRN model is not compressed, and fig. 6 shows the predicted spectral amplitude after the CRN model is compressed.

In summary, the compressed CRN model can also greatly improve STOI parameters compared to the unprocessed mixed speech; compared with an uncompressed CRN model, the method has the advantages that under the condition that the noise reduction performance of the CRN model is not greatly influenced, the size of the model is greatly reduced, and the data storage capacity of the CRN model is reduced.

By utilizing the method, the convolution neural network and the recurrent neural network with long and short term memory are combined to obtain the convolution recurrent neural network, so that the constructed convolution recurrent neural network model has the characteristic of less parameter number of the convolution neural network, and the real-time performance of single-channel speech noise reduction is greatly improved while the good noise reduction performance is ensured.

Optionally, to further improve the generalization ability of the present solution, i.e. to achieve a general noise reduction effect that is not limited to a specific noise environment and a specific speaker, the speech training set of the present exemplary embodiment is formed by combining a large amount of background noise collected in a daily environment, various types of male voices and female voices, and voice and noise mixed with a specific signal-to-noise ratio (SNR).

In addition, the speech training set contains a lot of noise (about 1000 hours) and numerous speech segments, and the whole speech training set lasts for about several hundred hours, ensuring that the CRN model is adequately trained.

Fig. 7 is a diagram illustrating results of alignment of STOI parameters trained with LSTM models, trained with CRN models, and untrained in a multiple person conversation noise scenario, according to an exemplary embodiment, and fig. 8 is a diagram illustrating results of alignment of STOI parameters trained with LSTM models, trained with CRN models, and untrained in a cafe noise scenario, according to an exemplary embodiment. The CRN model used in this example has five convolutional layers in the encoder, five deconvolution layers in the decoder, and two LSTM layers between the encoder and decoder. As shown in fig. 7 and 8, the STOI parameter of the CRN model-based method proposed in this embodiment is greatly improved relative to the unprocessed noisy sound signal, and the STOI parameter is improved by about 20 percentage points in a state where the signal-to-noise ratio is-5 dB; in the state of 5dB SNR, the STOI parameter is improved by about 10 percentage points. Fig. 7, 8 also show that the performance of the method is consistently better than the LSTM-based RNN model method, and the STOI improvement is more pronounced at lower SNR.

To further illustrate the noise reduction results, FIG. 9 shows a spectral plot of an untrained sound signal in a multi-talk noise scenario at-5 dB SNR; FIG. 10 shows a spectral plot of a corresponding clean utterance; fig. 11 shows a spectrum diagram after noise reduction using the CRN model. Fig. 9, 10, 11 show that the noise-reduced speech signal is much closer to pure speech than the sound signal of a multi-talk noise scene.

The single-channel noise reduction of the invention refers to processing the signal collected by a single microphone, and compared with a noise reduction method of a microphone array formed by wave beams, the single-channel noise reduction has wider practicability. The invention adopts a supervised learning method to perform voice noise reduction, and predicts the ratio film of a sound signal by using a model with a convolution recursive neural network. The voice training set used by the convolution recurrent neural network model provided by the invention contains a large amount of noise (about 1000 hours) and a plurality of voice segments, and the whole voice training set lasts for about hundreds of hours, so that the CRN model is ensured to be fully trained, and the realization of single-channel real-time noise reduction is independent of a future time frame. The convolutional recurrent neural network is obtained by combining the convolutional neural network and the recurrent neural network with long and short term memory, and then the convolutional recurrent neural network trains a pre-collected voice training set to construct a convolutional recurrent neural network model. The model greatly reduces the quantity of parameters of the neural network, reduces the data storage capacity, and greatly improves the real-time performance of single-channel voice noise reduction while realizing good noise reduction performance.

The following is an embodiment of the apparatus of the present disclosure, which may be used to implement the above embodiment of the single-channel real-time noise reduction method based on the convolutional recurrent neural network. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the single-channel real-time noise reduction method based on the convolutional recurrent neural network of the present disclosure.

Fig. 12 is a block diagram illustrating a single channel real-time noise reduction apparatus according to an exemplary embodiment, including but not limited to: an acoustic feature extraction module 110, a ratio film calculation module 120, a masking module 130, and a speech synthesis module 140.

An acoustic feature extraction module 110, configured to extract acoustic features from the received single-channel sound signal;

a ratio film calculation module 120, configured to perform iterative operation on the acoustic features in a pre-constructed convolutional recurrent neural network model, and calculate a ratio film of the acoustic features;

a masking module 130 for masking the acoustic feature with the ratio film;

and a speech synthesis module 140, configured to synthesize the masked acoustic features and the phase of the single-channel speech signal to obtain a noise-reduced speech signal.

The implementation processes of the functions and actions of each module in the device are specifically described in the implementation processes of the corresponding steps in the single-channel real-time noise reduction method based on the convolutional recurrent neural network, and are not described herein again.

Optionally, as shown in fig. 13, the acoustic feature extraction module 110 shown in fig. 12 includes but is not limited to: a temporal frame dividing unit 111, a spectral magnitude vector extracting unit 112, and an acoustic feature forming unit 113.

A time frame dividing unit 111, configured to divide the received single-channel sound signal into time frames according to a preset time period;

a spectral magnitude vector extraction unit 112 for extracting a spectral magnitude vector from the time frame;

an acoustic feature forming unit 113, configured to perform normalization processing on the spectral magnitude vector to form an acoustic feature.

Optionally, as shown in fig. 14, the ratio film calculation module 120 in fig. 12 includes, but is not limited to: a network combining unit 121 and a network model building unit 122.

A network combining unit 121, configured to combine the convolutional neural network and a recurrent neural network with long-term and short-term memory to obtain a convolutional recurrent neural network;

a network model constructing unit 122, configured to train a pre-collected speech training set through the convolutional recurrent neural network, and construct the convolutional recurrent neural network model.

Optionally, the present invention further provides an electronic device, which performs all or part of the steps of the single-channel real-time noise reduction method based on the convolutional recurrent neural network according to any of the above exemplary embodiments. The electronic device includes:

a processor; and

a memory communicatively coupled to the processor; wherein the content of the first and second substances,

the memory stores readable instructions which, when executed by the processor, implement the method of any of the above exemplary embodiments.

The specific manner in which the processor in the terminal performs the operation in this embodiment has been described in detail in the embodiment of the single-channel real-time noise reduction method based on the convolutional recurrent neural network, and will not be described in detail here.

In an exemplary embodiment, a storage medium is also provided that is a computer-readable storage medium, such as may be temporary and non-temporary computer-readable storage media, including instructions.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A single-channel real-time noise reduction method based on a convolution recurrent neural network is characterized by comprising the following steps:

extracting acoustic features from the received single-channel sound signal;

masking the acoustic feature with the ratio film;

synthesizing the masked acoustic features and the phase of the single-channel sound signal to obtain a voice signal;

the step of calculating a ratio film of the acoustic features includes:

training a pre-collected voice training set through the convolution recurrent neural network to construct a convolution recurrent neural network model;

the step of combining the convolutional neural network with the recurrent neural network with long-term and short-term memory to obtain the convolutional recurrent neural network comprises the following steps:

2. The method of claim 1, wherein the step of extracting the acoustic features from the received single channel sound signal comprises:

extracting a spectral magnitude vector from the time frame;

3. The method of claim 2, wherein the spectral magnitude vector is normalized to form the acoustic features, and wherein the step of forming the acoustic features comprises:

and combining the spectral amplitude vectors of the current time frame and the past time frame and carrying out normalization processing to form acoustic features.

4. The method of claim 2, wherein the spectral magnitude vector is normalized to form the acoustic features, and wherein the step of forming the acoustic features comprises:

5. The method of claim 1, wherein the convolutional neural network is a convolutional encoder-decoder structure, the encoder comprises a set of convolutional layers and pooling layers, the decoder structure is the same as the encoder in reverse order, and the output of the encoder is connected to the input of the decoder.

6. The method of claim 1, wherein the recurrent neural network with long-short term memory comprises two stacked layers of long-short term memory.

7. The method of claim 1, wherein each convolutional or pooling layer in the convolutional neural network comprises a maximum of 16 kernels, and each long-short term memory layer of the recurrent neural network with long-short term memory comprises 64 neurons.

8. The method of claim 1, wherein the speech training set is composed of a combination of background noise collected in a daily environment, various types of male and female voices, and speech signals mixed with a specific signal-to-noise ratio.

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

10. A computer readable storage medium storing a program, wherein the program, when executed, causes an electronic device to perform the method of any of claims 1-8.