CN116364109A

CN116364109A - Speech enhancement network signal-to-noise ratio estimator and loss optimization method

Info

Publication number: CN116364109A
Application number: CN202310200774.8A
Authority: CN
Inventors: 崔立恒; 周翊; 刘宏清
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-06-30

Abstract

The invention discloses a signal-to-noise ratio estimator of a voice enhancement network, which comprises an encoder and a decoder, wherein a complex operation CNN, a complex LSTM and a complex BN layer are arranged between the encoder and the decoder, the encoder comprises a complex Conv2D layer, a complex BN layer and a real PReLU layer, a 1-D convolution module is arranged behind the LSTM layer, the 1-D convolution module is formed by alternately and serially combining a plurality of one-dimensional convolution layers and a full-connection layer, the full-connection layer is provided with a sigmoid function, the signal-to-noise ratio estimator adopts the serial combination of two one-dimensional convolution layers and the full-connection layer with the sigmoid function, and the input is the concatenation of the real part and the imaginary part of a voice signal with noise after the complex LSTM calculation, and the output is the frame-level priori signal-to-noise ratio calculated according to a formula so as to keep good voice quality.

Description

Speech enhancement network signal-to-noise ratio estimator and loss optimization method

Technical Field

The invention belongs to the technical field related to voice enhancement technology, and particularly relates to a voice enhancement network signal-to-noise ratio estimator and a loss optimization method.

Background

In a real acoustic scene, the voice signal is polluted by noise with different degrees, and the voice quality is seriously affected under the condition of larger pollution degree. The purpose of speech enhancement is to recover a clean original speech signal from noisy speech, and to improve speech intelligibility and perceptual quality by suppressing the interference of background noise on the speech. In modern voice communication systems, voice enhancement techniques are widely used, such as automatic voice recognition, mobile communication, conference systems, etc.

The current stage voice enhancement algorithm is mainly divided into a traditional algorithm and an algorithm based on deep learning. The traditional voice enhancement algorithm has low calculation cost and hardware requirement, and achieves better effect in removing stationary noise, but has weaker processing capability under the conditions of non-stationary noise and low signal-to-noise ratio. In recent years, a deep learning-based speech enhancement algorithm achieves very good results, and the method is generally regarded as a supervised learning task, and is used for enhancing noise speech in a time-frequency domain or directly in a time domain.

However, the noise reduction method based on deep learning can inevitably generate the problem of voice distortion while reducing noise, and when the noise-carrying voice with high signal to noise ratio is enhanced, the network can bring more serious attenuation and even worse than the original noise-carrying voice.

Disclosure of Invention

The invention aims to provide a speech enhancement network signal-to-noise ratio estimator and a loss optimization method, so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions:

compared with the prior art, the invention provides a voice enhancement network signal-to-noise ratio estimator, which comprises an encoder and a decoder, wherein a complex operation CNN, a complex LSTM and a complex BN layer are arranged between the encoder and the decoder, and the encoder comprises a complex Conv2D layer, a complex BN layer and a real PReLU layer;

a signal-to-noise ratio estimator is arranged behind the LSTM layer, the signal-to-noise ratio estimator is formed by alternately and serially combining a plurality of one-dimensional convolution layers and a full connection layer, and the full connection layer is provided with a sigmoid function;

the process of 1-D convolution modulus calculating a priori signal to noise ratio is as follows:

wherein PSNR represents an a priori signal-to-noise ratio, A _clean Represent clean speech spectrum, A _noise Representing the noise spectrum. The PSNR is standardized, and the following steps are obtained:

wherein the method comprises the steps of

Represents the standard deviation of PSNR, and σ represents the mean value of PSRN.

Afterwards, PSNR is compressed to between 0 and 1 using ERFC (complementary error function)

PSNR＝(erfc(PSNR+1))/2。

2. Preferably, the Conv2D operation is implemented by a complex-valued filter W defined as:

W＝W _r +jW _i

real matrix W _r And W is _i Representing the real and imaginary parts of a complex convolution kernel.

Complex matrix x=x defining one input _r +jX _i The complex output of the complex convolution operation is obtained as:

wherein F is _out Representing the output characteristics of a plurality of layers.

3. A method of optimizing loss of a speech enhancement network according to claim 1, comprising the steps of:

s1: respectively manufacturing a voice data set with reverberation and a voice data set without reverberation by using a dynamic voice-noise mixing method, wherein the signal-to-noise ratio of the voices is distributed between 0 decibel and 20 decibels;

s2: inputting the voice manufactured in the step S1 into a DCCRN voice enhancement model with a signal-to-noise ratio estimator;

s3: the signal-to-noise ratio estimator calculates the prior signal-to-noise ratio Lable_snr, and the loss value is obtained by subtracting the prior signal-to-noise ratio Lable_snr from the SI-SNR, and the SI-SNR calculation formula is as follows:

s represents the time domain waveform of clean speech,

representing the time domain waveform of the estimated speech,<·,·>representing the dot product between the two vectors, I.I ₂ Representing euclidean distance;

Loss＝(-1)×Loss _SI-SNR +2×Loss _SNR

wherein, loss _SNR The coefficient of (2) is set to be 2, so that the scales of the two loss functions are unified;

s4: the prior signal-to-noise ratio calculated by the signal-to-noise ratio estimator is decoded by a decoder and then output to obtain the enhanced voice.

A speech enhancement network signal-to-noise ratio estimator and a loss optimization method have the following beneficial effects:

1. the voice enhancement model effectively combines the advantages of DCUNET and CRN, and a complex network is adopted between the encoder and the decoder, so that the correlation between the amplitude and the phase can be modeled through complex operation multiplication, and the phase information of voice can be better utilized.

2. The invention designs the signal-to-noise ratio estimator after two LSTM layers, and redesigns the loss function in the voice enhancement algorithm of the signal-to-noise ratio estimator, thereby being capable of calculating the prior signal-to-noise ratio of the frame level, maintaining good voice quality and improving the voice attenuation problem caused by the neural network.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate and together with the embodiments of the invention and do not constitute a limitation to the invention, and in which:

fig. 1 is a schematic diagram of a DCCRN voice enhancement model provided with a voice enhancement network signal-to-noise estimator according to the present invention;

fig. 2 is a schematic diagram of a snr estimator according to the present invention.

In the figure: 10. an encoder; 11. a complex Conv2D layer; 12. a complex BN layer; 13. real PReLU layer; 20. a decoder; 30. a complex-operation CNN; 40. a plurality of LSTM; 50. a plurality of BN layers; 60. a 1-D convolution module; 61. a one-dimensional convolution layer; 62. and (5) a full connection layer.

Detailed Description

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

The invention provides a voice enhancement network signal-to-noise ratio estimator which is arranged in a DCCRN voice enhancement model shown in figure 1.

The DCCRN speech enhancement model includes an encoder 10 and a decoder 20, between which a complex operation CNN30, a complex LSTM40, and a complex BN layer 50 are disposed between the encoder 10 and the decoder 20. The encoder 10 comprises a complex Conv2D layer 11, a complex BN layer 12 and a real prehu layer 13. The complex Conv2D layer 11 contains four conventional Conv2D operations for controlling the flow of complex information in the encoder.

The complex value filter W is defined as:

W＝W _r +jW _i

The DCCRN voice enhancement model uses a complex network, and can model the correlation between the amplitude and the phase through complex operation multiplication, so that the phase information of the voice can be better utilized.

Referring to fig. 2, two LSTM layers 40 are disposed between the encoder 10 and the decoder 20, and a signal to noise ratio estimator 60 is disposed after the LSTM layers 40, where the signal to noise ratio estimator 60 is formed by alternately and serially combining a plurality of one-dimensional convolution layers 61 and a full connection layer 62, and the full connection layer 62 has a sigmoid function.

The spliced signal of the real part and the imaginary part of the voice signal with noise calculated by the LSTM layer 40 is input to the signal-to-noise ratio estimator 60, and the signal-to-noise ratio estimator 60 processes the signal-to-noise ratio to calculate the frame-level prior signal-to-noise ratio so as to maintain good voice quality.

The signal-to-noise ratio estimator 60 calculates the prior signal-to-noise ratio as follows:

wherein the method comprises the steps of

PSNR＝(erfc(PSNR+1))/2

A method for optimizing loss of a speech enhancement network, comprising the steps of:

s2: inputting the voice produced in S1 into a DCCRN voice enhancement model having a signal-to-noise ratio estimator 60;

s3: the SNR estimator 60 calculates the a priori SNR lable_snr, which is subtracted from the SI-SNR to obtain the loss value, where the SI-SNR calculation formula is as follows:

where s represents the time domain waveform of clean speech,

representing the time domain waveform of the estimated speech,<·,·>representing two vectorsDot product between the two, I.I.I ₂ Representing Euclidean distance

The prior signal to noise ratio calculation formula is as follows:

Loss＝(-1)×Loss _SI-SNR +2×Loss _SNR

Loss _SNR the coefficient of (2) is set to be 2, so that the loss function scales of the two parts are unified;

s4: the a priori signal to noise ratio calculated by the signal to noise ratio estimator 60 is decoded by the decoder 20 and output to yield enhanced speech.

To fully verify the effect of the loss function used by the signal-to-noise estimator 60 on the improvement of speech quality, the present invention uses the data set provided by the DNS challenge (INTERSPEECH 2020) to train and evaluate the proposed model. The corpus provided contains a total of 500 hours of clean speech segments from 2150 speakers, a 180 hour noise dataset from 150 categories, and room impulse responses from openSLR26 and openSLR28 datasets. DNS challenge games also present a publicly available test dataset comprising two categories of synthetic clips, the first being reverberant and the other being non-reverberant, with 150 segments per category, with a signal to noise ratio distribution between 0db and 20 db.

In order to diversify data, the invention adopts a method of dynamically mixing voice and noise, respectively prepares 50 hours data sets with reverberation and without reverberation, totaling 100 hours, and mixes clean voice and noise under the random signal-to-noise ratio of-5 dB and 20 dB; to produce reverberated speech, the present invention convolves the random decimation of the data set of RIR_noise for speech and simulated room reverberation impact, respectively. Finally 60000 speech segments of 6 seconds were obtained, all audio files at a 16kHz sampling rate.

In the test model of the present invention, the frame size is set to 400 points, the hop size is set to 256 points, and the FFT length is 512 points. The number of channels of the encoder is {32,64,128,128,256,256}, the convolution kernel is set to (5, 2), and the step size is set to (2, 1). In the one-dimensional convolution module, the input channel is set to 4 and the convolution kernel is set to (32, 1). The last LSTM layer is followed by a 1024 unit dense layer. For the decoder, looking back one frame at each convolutional layer, for a total of 6 frames, 6x6.25 = 37.5ms. The model optimizer employs an Adam optimizer.

The present invention uses a Pytorch deep learning framework to train the test model. The input to the model is a 100 second long sequence, totaling 100 epochs. The initial value of the learning rate is set to 0.001. If the model is not improved continuously over a validation set of 2 epochs, the learning rate will drop by one third. Training is terminated if the model is not improved over a validation set of 10 epochs in succession.

As shown in table 1, model 1 (Proposed Model 1) corresponds to adding a signal to noise ratio estimator after LSTM layer, and Model 2 (Proposed Model 2) adds a new loss function provided by the present invention based on Model 1.

Table 1 DNS challenge synthesis development set objective evaluation results

In Table 1, peSQ is an objective speech quality evaluation, STOI is speech intelligibility evaluation, SI-SDR is a scale-invariant signal-to-distortion ratio, and the higher the index score, the better the speech quality, and the better the model effect. As can be seen from table 1, the method proposed by the present invention improves over other methods in 3 indexes under the condition of reverberation and no reverberation. The PESQ score for model 2 was increased by 4.62% over DCCRN under no reverberation conditions, and the scores at STOI and SI-SDR were increased by 1.53% and 1.77dB over NSnet, respectively. Model 2 has 8.22% higher PESQ score than DCCRN under reverberant conditions, 7.82% higher STOI score than DTLN, 5.927dB higher SI-SDR score than NSnet.

Table 2 DNS challenge blind test set MOS test results

Model	no reverb	reverb	realrec	Ave.
					Noisy	3.13	2.64	2.83	2.85
NSNet	3.49	2.64	3.00	3.03
					DCCRN-E	4.00	2.94	3.37	3.42
Proposed model 1	3.95	2.98	3.42	3.45
					Proposed model 2	4.03	3.02	3.57	3.54

Table 2 shows the results of tests performed on the blind test set provided by DNS Change, and it can be seen that the model provided by the invention has an MOS score improved by 15.47% under the condition of no reverberation compared with NSnet, and an MOS score improved by 2.72% and 5.93% under the condition of reverberation and true recording compared with DCCRN, respectively

Tables 3-5 show the objective evaluation results for the second test set under no reverberation conditions. It can be seen that the test results of the model provided by the invention on 5 typical signal-to-noise ratios are improved to a certain extent. Under the non-reverberation condition, the model provided by the invention improves 2.05%, 0.47% and 0.796dB on the PESQ, STOI and SI-SDR respectively, and exceeds the best performance of DCCRN-E.

TABLE 3 test results for a reverberant less noisy speech test set PESQ

test SNR	0dB	5dB	10dB	15dB	20dB	Ave.
							Noisy	1.789	2.132	2.47	2.833	3.176	2.48
DCCRN-E	2.704	3.075	3.379	3.628	3.814	3.32
							Proposed model 2	2.805	3.137	3.398	3.735	3.867	3.388

TABLE 4 test results for non-reverberant noisy speech test set STOI

test SNR	0dB	5dB	10dB	15dB	20dB	Ave.
							Noisy	77.55	84.78	90.44	94.34	96.76	88.77
DCCRN-E	87.44	92.31	95.31	97.13	98.18	94.07
							Proposed model 2	89.07	92.33	95.53	97.43	98.24	94.52

TABLE 5 test results for non-reverberant noisy speech test set SI-SDR

test SNR	0dB	5dB	10dB	15dB	20dB	Ave.
							Noisy	0.005	5.003	10.003	14.899	20.001	9.982
DCCRN-E	11.19	14.905	18.22	21.343	24.039	17.939
							Proposed model 2	12.181	15.905	19.208	21.896	24.487	18.735

In summary, the model introduced into the signal-to-noise estimator according to the present invention focuses on improving the damage degree of the neural network to the speech perception quality. From the experimental results and the index scoring situation, it can be seen that the proposed model exceeds other models, and the effect of the signal-to-noise ratio estimator 60 using the prior signal-to-noise ratio calculation method provided by the invention is demonstrated.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A speech enhancement network signal to noise ratio estimator and loss optimization method the speech enhancement network signal to noise ratio estimator comprises an encoder (10) and a decoder (20), characterized in that:

a complex operation CNN (30), a complex LSTM (40) and a complex BN layer (50) are arranged between the encoder (10) and the decoder (20), and the encoder (10) comprises a complex Conv2D layer (11), a complex BN layer (12) and a real PReLU layer (13);

the LSTM layer (40) is provided with a signal-to-noise ratio estimator (60) at the back, the signal-to-noise ratio estimator (60) is formed by alternately and serially combining a plurality of one-dimensional convolution layers (61) and a full connection layer (62), and the full connection layer (62) is provided with a sigmoid function;

the process of the signal to noise ratio estimator (60) to calculate the a priori signal to noise ratio is as follows:

wherein PSNRRepresenting a priori signal to noise ratio, A _clean Represent clean speech spectrum, A _noise Representing the noise spectrum. The PSNR is standardized, and the following steps are obtained:

wherein the method comprises the steps of

PSNR＝(erfc(PSNR+1))/2。

2. The speech enhancement network signal-to-noise ratio estimator and loss optimization method according to claim 1, wherein: the Conv2D operation is implemented by a complex-valued filter W defined as:

W＝W _r +jW _i

3. A speech enhancement network signal to noise ratio estimator and loss optimization method according to any of claims 1 or 2, the speech enhancement network loss optimization method comprising the steps of:

s2: inputting the voice produced in the step S1 into a DCCRN voice enhancement model with a signal-to-noise ratio estimator (60);

s3: a signal-to-noise ratio estimator (60) calculates a priori signal-to-noise ratio Lable_snr, and subtracts the priori signal-to-noise ratio Lable_snr from SI-SNR to obtain a loss value, wherein the SI-SNR calculation formula is as follows:

s represents the time domain waveform of clean speech,

representing the time domain waveform of the estimated speech,<·，·>representing the dot product between the two vectors, I.I ₂ Representing euclidean distance;

the prior signal-to-noise ratio lable_SNR in the loss function is calculated as follows:

Loss＝(-1)×Loss _SI-SNR +2×Loss _SNR

s4: the a priori signal to noise ratio calculated by the signal to noise ratio estimator (60) is decoded by the decoder (20) and output to obtain enhanced speech.