CN116935879A

CN116935879A - Two-stage network noise reduction and dereverberation method based on deep learning

Info

Publication number: CN116935879A
Application number: CN202210355142.4A
Authority: CN
Inventors: 刘宏清; 夏俊杰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2023-10-24

Abstract

The invention relates to a two-stage network noise reduction and dereverberation method based on deep learning, which belongs to the field of voice processing and divides background noise and room reverberation into a noise reduction stage and a dereverberation stage for processing according to the property difference of interference signals. Firstly, independently training the two-stage network, reserving the weight parameters and related configuration of the training, and further transplanting the training to the time domain two-stage network for joint training. The invention processes noise and reverberation in time domain, does not need to carry out extra transformation on voice signals, and avoids the loss of useful information in the process of signal transformation. Through analysis of experimental data, the time domain two-stage network can show better performance than the single-stage network and the frequency domain network.

Description

Two-stage network noise reduction and dereverberation method based on deep learning

Technical Field

The invention belongs to the field of voice processing, and relates to a two-stage network noise reduction and reverberation removal method based on deep learning.

Background

In recent years, researchers have done a lot of work on how to suppress background noise and room reverberation. For suppressing reverberation alone, inverse filtering is one of the most common methods, which is to estimate a reasonable inverse filter by estimating an inverse filter that counteracts the effect of the room impulse response and then convolving the reverberation signal with the inverse filter to obtain an estimated clean speech signal, which is difficult to estimate. Subsequently, wu Mingyang et al propose a two-stage algorithm based on a single microphone scenario, using an inverse filter in the first stage and spectral subtraction in the second stage, to process the early and late reverberation, respectively. Next, zhao Yan et al learn a spectrum mapping from a noisy reverberant speech signal to a clean speech signal based on the frequency domain using a Deep Neural Network (DNN), which is the first study to simultaneously process room reverberation and background noise using a supervised learning approach. But due to the different nature of the background noise and the room reverberation, i.e. the reverberation signal is generated by convolving the clean speech signal with the Room Impulse Response (RIR), whereas the noisy speech signal is the superposition of the clean speech signal with the background noise. Thus, the background noise and room reverberation cannot be handled in the same model in a general way, and the two interfering signals should be handled separately. In addition, the proposed algorithm processes the voice signal based on the frequency domain, and before reconstructing the frequency domain signal into the time domain waveform, the spectrum of the clean voice signal is often estimated by means of the phase information of the noisy voice signal, which cannot fully utilize the phase information of the clean voice signal, and further causes the estimated clean voice signal to deviate from the target voice signal.

Disclosure of Invention

It is therefore an object of the present invention to provide a time domain two-phase joint network model, which aims to stage the background noise and room reverberation in the time domain. The invention firstly trains two single-stage networks, and transplants the network weight parameters obtained by independent training into a two-stage joint network model, and further uses the network weight parameters as initial values of the two-stage joint network training. According to the invention, training and testing are carried out on the frequency domain single-stage network, the time domain single-stage network, the frequency domain two-stage network and the time domain two-stage network under the same data set, and subjective voice quality assessment (PESQ) and short-time objective intelligibility (STOI) scores of different networks are further compared, so that the time domain two-stage method provided by the invention is verified to have better performance.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a two-stage network noise reduction and dereverberation method based on deep learning, comprising the steps of:

s1: preparing a data set: setting a reverberation environment, synthesizing the reverberation environment with pure voice signals to obtain reverberation signals, and synthesizing the reverberation signals with a training noise data set and a test noise data set respectively to obtain a voice signal training set and a test set which simultaneously contain noise and reverberation;

s2: building a two-stage joint network model based on a cyclic neural network (RNN) and a time domain convolutional network (TCN), wherein the two-stage joint network model comprises a noise reduction stage and a dereverberation stage;

s3: the time domain voice signal is input into a single-stage network for independent training, the input of the noise reduction stage comprises a noise reverberation signal and a noise-free reverberation signal H (t), the noise-free reverberation signal H (t) is used as a learning label, and the output of the noise reduction stage is an estimated noise-free reverberation signalThe loss function will constantly estimate +.>Fitting to the learning label H (t); the input of the dereverberation stage comprises a noise-free reverberant signal and a clean speech signal s (t), and the clean speech signal s (t) is used as a learning label, and the output of the dereverberation stage is an estimated clean speech signal +.>The loss function will constantly estimate +.>Fitting to learning tags s (t);

s4: performing joint training on the two-stage joint network model, and simultaneously inhibiting noise and reverberation; the optimal weight parameters of independent training of the noise reduction stage and the dereverberation stage are reserved and used as initial values of training of a two-stage combined network model; the inputs of the two-stage joint network model include the noise reverberation signal and the clean speech signal s (t), s (t) being the learned label, will be estimatedIs a clean speech signal of (1)Fitting to the tag s (t);

s5: repeating the step S4, and ending training when the loss value reaches the minimum and converges;

s6: and testing the trained two-stage joint network model by using the test set.

Further, in step S1, the setting a reverberant environment is: defining 5 different reverberation times between 0.1s and 0.9s, and the step size is 0.2s; the length and width of the room are arbitrarily valued between 2 meters and 10 meters, and the microphone and sound source positions are randomly arranged inside the room.

Further, in step S1, different signal-to-noise ratios are used in synthesizing the noise reverberation signal, and all the speech data are at the same sampling rate.

Further, the model of the noise reduction stage in step S2 includes an encoder, a noise reduction module, and a decoder, where the noise reduction module includes sequence segmentation, block processing, and overlap-add; the encoder and decoder are configured to convert the speech signal back and forth from a time domain waveform to a high-dimensional feature; the sequence segmentation is used for segmenting an input characteristic sequence into overlapped blocks, and then stacking all the blocks into a three-dimensional tensor; the block processing comprises an intra-block processing module for processing the first and second dimension information of the three-dimensional tensor and an inter-block processing module for processing the first and third dimension information of the three-dimensional tensor, and the overlap-add is used for synthesizing the long speech sequence.

Further, the model of the dereverberation stage in step S2 is used to generate high-dimensional features of the input speech signal, including an encoder, a time domain convolution network, an activation function and a decoder; the decoder output of the noise reduction stage is used as the encoder input of the dereverberation stage, the mask is estimated through the time domain convolution network and the activation function, then the output of the encoder is multiplied with the estimated mask, the high-dimensional characteristics of the estimated pure speech signal are obtained, and finally the decoder is used for converting the estimated high-dimensional characteristics into the time domain speech signal.

Further, the time domain convolution network is composed of stacked one-dimensional dilation convolutions (1-D D-Conv).

Further, in step S3, the loss function formula of the noise reduction stage is as follows:

where s is the target speech signal,is estimated speech signals and methods I.I ² Representing the inner product of the vectors.

Further, in step S3, the loss function formula of the dereverberation stage is as follows:

further, in step S4, the joint loss function of the two-stage network is as follows:

further, an Adam optimizer is adopted to optimize the joint loss of the two-stage network, the Adam algorithm sets independent adaptive learning rate for different parameters by calculating first moment estimation and second moment estimation of the gradient, the neuron weight is biased by counter propagation, and the weight of the network neuron is continuously updated by calculating an optimal solution.

The invention has the beneficial effects that: the invention processes noise and reverberation in time domain, does not need to carry out extra transformation on voice signals, and avoids the loss of useful information in the process of signal transformation. Through analysis of experimental data, the time domain two-stage network can show better performance than the single-stage network and the frequency domain network.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram of a two-stage joint network model;

FIG. 2 is a schematic diagram of sequence division;

FIG. 3 is a block processing flow diagram;

fig. 4 is a schematic structural diagram of TCN.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

Referring to fig. 1, a two-stage network noise reduction and dereverberation method based on deep learning mainly comprises the following steps:

step S1: a data set used in the present invention was made. The clean speech signal used was taken from the WSJ0 dataset, the noise dataset for training was taken from ESC-50, and the noise dataset for testing was taken from noise 92. Creating the dataset requires setting different reverberation times, room size, microphone locations and sound source locations to simulate different reverberant environments. First, 5 different reverberation times are defined between 0.1s and 0.9s, and the step size is 0.2s. Secondly, the length and width of the room are arbitrarily valued between 2 meters and 10 meters, and the microphone and sound source positions are randomly arranged inside the room. The clean speech signal from WSJ0 is used to synthesize a different reverberant signal from the randomly modeled reverberant environment. And randomly extracting noise from the ESC-50 and Noisex92 noise data sets and synthesizing the noise and the reverberation signal to obtain a voice signal containing both noise and reverberation. Different signal-to-noise ratios are used in synthesizing the noise reverberation signal, respectively-9 dB, -5dB, 0dB, 5dB, and 9dB. The training set in the final data set was 40 hours, the validation set was 15 hours, the test set was 15 hours, and the sampling rate of all voice data was 16kHz.

Step S2: the invention mainly builds a model based on two networks of RNN and TCN.

1) The noise reduction stage can be divided into three parts: encoder, noise reduction module and decoder. The noise reduction module in turn includes sequence segmentation, block processing, and overlap-add. The codec functions to convert the speech signal back and forth from a time domain waveform to high-dimensional features. As shown in fig. 2, the purpose of the sequence segmentation is to segment the input feature sequence into overlapping blocks, and then stack all the blocks into a three-dimensional tensor, which facilitates the learning of the block processing module. As shown in fig. 3, the block processing includes intra-block processing and inter-block processing, and for the intra-block processing module, it processes the first and second dimension information of the three-dimensional tensor, and the inter-block processing module processes the first and third dimension information of the three-dimensional tensor.

2) The dereverberation stage uses an encoder for generating high-dimensional features of the input speech signal, further multiplies the output of the encoder with the estimated mask to obtain high-dimensional features of the estimated clean speech signal, and finally uses a decoder to convert the estimated features into a time-domain speech signal. As shown in fig. 4, the TCN of stacked 1-D D-Conv is used in estimating the mask.

Step S3: the time domain speech signal is input into a single stage network for individual training. The purpose of the noise reduction stage is to suppress the noise so as to obtain a noise-free reverberant signal, the input of which comprises a noise reverberant signal and a noise-free reverberant signal H (t), and the latter is a tag for learning. The output of the noise reduction stage is an estimated noise-free reverberation signalThe loss function will constantly estimate +.>Fitting to the learning label H (t). The loss function formula for the noise reduction stage is as follows:

wherein:

The purpose of the dereverberation stage is to recover a clean speech signal from the noiseless reverberant signal. The inputs include a noise-free reverberant signal and a clean speech signal s (t), s (t) being considered a learned tag. The output of the dereverberation stage is an estimated clean speech signalThe loss function will constantly estimate +.>Fitting to the learning label s (t) to achieve the expected effect of suppressing reverberation. The loss function of the dereverberation stage is formulated as follows:

step S4: the two-stage network is jointly trained, and noise and reverberation are restrained. The invention reserves the optimal weight parameters of independent training of the noise reduction stage and the dereverberation stage and uses the optimal weight parameters as initial values of two-stage joint network training. This not only shortens the training period of the two-phase joint network, but also makes it easier to obtain an optimal two-phase network model. The inputs to the two-stage joint network training include a noise reverberant signal and a clean speech signal s (t), the purpose of the model being to suppress both noise and reverberant to obtain an estimated clean speech signalAnd s (t) is used as a learning label to estimate the pure speech signal +.>Fitting to the tag s (t). The joint loss function of the two-stage network is as follows:

when the loss is large, the network performance is poor and the network is not optimal. In order to minimize loss, an Adam optimizer is adopted for parameter optimization, an Adam algorithm sets independent adaptive learning rate for different parameters by calculating first moment estimation and second moment estimation of gradients, and the neuron weight is biased by counter propagation, so that the weight of the network neuron is continuously updated by calculating an optimal solution.

Step S5: and (4) repeating the step (S4), ending training when the loss value reaches the minimum and converges, wherein the network parameters reach the optimum, and taking the network model as a system model.

Step S6: the trained model is tested by using the test data set synthesized in the step S1, and subjective voice quality assessment (PESQ) and short-time objective intelligibility (STOI) scores of various methods are respectively obtained by comparing different methods, so that the superior performance of the invention is verified, and the invention is shown in a table 1 as a PESQ score table, and a table 2 as a STOI score table.

TABLE 1

TABLE 2

Wherein PESQ scores between-0.5 and 4.5, STOI scores between 0 and 1, and the higher their scores, the better the performance of the network.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A two-stage network noise reduction and dereverberation method based on deep learning is characterized in that: the method comprises the following steps:

s4: performing joint training on the two-stage joint network model, and simultaneously inhibiting noise and reverberation; the optimal weight parameters of independent training of the noise reduction stage and the dereverberation stage are reserved and used as initial values of training of a two-stage combined network model; the inputs of the two-stage joint network model include the noise reverberation signal and the clean speech signal s (t), s (t) being the learned label, the estimated clean speech signalFitting to the tag s (t);

2. The deep learning based two-stage network noise reduction and dereverberation method according to claim 1, wherein: the setting the reverberation environment in step S1 is: defining 5 different reverberation times between 0.1s and 0.9s, and the step size is 0.2s; the length and width of the room are arbitrarily valued between 2 meters and 10 meters, and the microphone and sound source positions are randomly arranged inside the room.

3. The deep learning based two-stage network noise reduction and dereverberation method according to claim 1, wherein: in step S1, different signal-to-noise ratios are used when synthesizing the noise reverberation signal, and all the speech data are at the same sampling rate.

4. The deep learning based two-stage network noise reduction and dereverberation method according to claim 1, wherein: the model of the noise reduction stage in the step S2 comprises an encoder, a noise reduction module and a decoder, wherein the noise reduction module comprises sequence segmentation, block processing and overlap addition; the encoder and decoder are configured to convert the speech signal back and forth from a time domain waveform to a high-dimensional feature; the sequence segmentation is used for segmenting an input characteristic sequence into overlapped blocks, and then stacking all the blocks into a three-dimensional tensor; the block processing comprises an intra-block processing module for processing the first and second dimension information of the three-dimensional tensor and an inter-block processing module for processing the first and third dimension information of the three-dimensional tensor, and the overlap-add is used for synthesizing the long speech sequence.

5. The deep learning based two-stage network noise reduction and dereverberation method according to claim 1, wherein: the model of the dereverberation stage in step S2 is used to generate high-dimensional features of the input speech signal, including an encoder, a time domain convolution network, an activation function and a decoder; the decoder output of the noise reduction stage is used as the encoder input of the dereverberation stage, the mask is estimated through the time domain convolution network and the activation function, then the output of the encoder is multiplied with the estimated mask, the high-dimensional characteristics of the estimated pure speech signal are obtained, and finally the decoder is used for converting the estimated high-dimensional characteristics into the time domain speech signal.

6. The deep learning based two-stage network noise reduction and dereverberation method according to claim 5, wherein: the time domain convolution network is composed of stacked one-dimensional dilation convolutions (1-D D-Conv).

7. The deep learning based two-stage network noise reduction and dereverberation method according to claim 1, wherein: in step S3, the loss function formula in the noise reduction stage is as follows:

8. The deep learning based two-stage network noise reduction and dereverberation method according to claim 1, wherein: in step S3, the loss function formula of the dereverberation stage is as follows:

9. the deep learning based two-stage network noise reduction and dereverberation method according to claim 1, wherein: in step S4, the joint loss function of the two-stage network is as follows:

10. the deep learning based two-stage network noise reduction and dereverberation method according to claim 1, wherein: and optimizing the joint loss of the two-stage network by adopting an Adam optimizer, setting independent adaptive learning rates for different parameters by calculating first moment estimation and second moment estimation of the gradient, carrying out bias analysis on neuron weights by counter propagation, and continuously updating the weights of the network neurons by calculating an optimal solution.