CN116013339A

CN116013339A - Single-channel voice enhancement method based on improved CRN

Info

Publication number: CN116013339A
Application number: CN202211598075.5A
Authority: CN
Inventors: 吕泽均; 朱智慧; 刘蕊
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-04-25

Abstract

The invention discloses a single-channel voice enhancement method based on an improved convolution cyclic network (Convolutional recurrentnetworks, CRN), which increases the loss L of a signal distortion ratio (SI-SDR) with unchanged scale _SI‑SDR The joint optimization is used for solving the problem of single loss function of the CRN method, and the problem of large quantity of training parameters and difficult training of the LSTM model is solved by using a gating circulating unit (GRU) to build an RNN module, so that an enhanced voice signal with high perception quality and high signal-to-noise ratio can be generated from voice with noise.

Description

Single-channel voice enhancement method based on improved CRN

Technical Field

The invention belongs to the technical field of voice enhancement, and particularly relates to a single-channel voice enhancement method based on an improved convolution cyclic network (Convolutional recurrent networks, CRN).

Background

Speech enhancement techniques are widely used in telephony, hearing assistance, speech recognition, etc., where background noise typically severely degrades the performance of these speech communication applications. The goal of single channel speech enhancement is to eliminate background noise interference, reconstruct clean speech from noisy speech signals, and improve speech signal quality. Conventional speech enhancement methods such as wiener filtering, spectral subtraction and subspace-based methods typically take the mean of the noisy signal as the estimated noise spectrum result in signal segments that do not contain speech, however the confidence of the estimated result is severely degraded in the face of non-stationary noise or low signal-to-noise environments. In addition, the objective evaluation index of the voice enhancement in the traditional method is mainly only focused on the voice perception quality (PESQ) and short-time objective intelligibility (STOI), and for downstream tasks such as voice recognition, the information contained in the voice signal is destroyed, so that the recognition rate of the voice recognition system is reduced.

In recent years, data-driven Deep Neural Network (DNN) approaches have shown superiority in dealing with most non-stationary noise conditions. The DNN-based speech enhancement framework can be classified into a spectrum Mask (Spectral Mask) based method and a spectrum Mapping (Spectral Mapping) based method from the learning object.

The Spectral Mask (Spectral Mask) based approach assumes sparse and disjoint signal energy, predicts time-frequency Mask (T-F Mask) by time-frequency unit estimation, such as signal-to-noise ratio based Ideal Binary Mask (IBM) and signal-to-noise energy ratio based Ideal Ratio Mask (IRM), calculates the hadamard product of time-frequency Mask and noisy speech spectrograms to get enhanced speech, but IBM signal-to-noise ratio discretization assumption is too simplified, IRM does not consider speech signal phase information. The separation problem becomes a regression problem by learning the Mapping relationship between noise spectrum features (such as Logarithmic Power Spectrum (LPS), logarithmic magnitude spectrum (LSA), mel spectrum) and clean spectrum features) based on the Spectral Mapping (Spectral Mapping), and finally, the waveform reconstruction is performed to obtain the enhanced speech. But the Spectral Mapping (Spectral Mapping) method does not make full use of the timing information of the speech signal. Depending on the Convolutional Neural Network (CNN), better performance than fully connected DNNs can be achieved while reducing training parameters by sharing weights, but convolutional networks are also limited to local signal features. A Recurrent Neural Network (RNN) models speech as time series data, processes long-term contexts on a sequence basis, but typically requires artificial features; the RNN can learn long-term correlation by using a circulating structure, but the attention degree of voice components in long-term noisy voice is insufficient, and the network scale is large, so that the problems of gradient disappearance and gradient explosion are faced.

Wiener filtering (WinnerFilter) is used as a classical method, and has wide application range. The method is characterized in that a filter is trained to enable the mean square error between an enhanced signal and a clean signal generated by a noisy signal passing through the filter to be minimum. But it is not suitable for non-stationary noise conditions and requires all future speech frame conditions, which are difficult to meet in practical applications.

Long-term memory networks (LSTM) are capable of learning information related to the length of a speech signal, but have insufficient attention to speech components in long-term noisy speech. Each input frame is regarded as a flat feature vector, and the time-frequency structure information in the amplitude diagram cannot be fully utilized. And the network scale is large, and the problems of gradient disappearance and gradient explosion are faced.

An encoder of a convolutional cyclic network (CRN) extracts high-dimensional features from signal local time-frequency information and a decoder reconstructs target speech. The long-term time dependence is further modeled by a cyclic structure. The CRN does not need future speech frame information and can enhance speech in real time based on the current speech frame. However, the objective function of the CRN is to mask the loss based on the ideal ratio of the mean square error, and is still limited to the time-frequency domain, so that the advantage of the model structure is not fully exerted. The use of LSTM as RNN layer also presents difficulties for model training.

In summary, a signal distortion ratio (SI-SDR) loss L with increased scale is proposed _SI-SDR The single-channel voice enhancement method based on the improved CRN is used for solving the problem of single loss function of the CRN method, and an RNN module is built by using a gating circulating unit (GRU) to solve the problems of large quantity of training parameters and difficult training of an LSTM model, and generates enhanced voice signals with high perception quality and high signal-to-noise ratio from noisy voice.

Disclosure of Invention

In order to solve the technical problems, the invention provides a single-channel voice enhancement method based on improved CRN, which increases the loss L of the signal distortion ratio (SI-SDR) with unchanged scale _SI-SDR Joint optimization to solve CRN method loss function singlenessThe problem that the RNN module is built by using the gating circulating unit (GRU) solves the problem that the LSTM model is large in training parameter quantity and difficult to train, and can generate an enhanced voice signal with high perception quality and high signal to noise ratio from voice with noise.

In order to achieve the technical purpose, the invention is realized by the following technical scheme: a single-channel voice enhancement method based on improved CRN comprises the following steps:

s1: downloading a VoiceBank-DEMAND data set;

s2: applying a window function to the voice signal frame by frame to calculate a Short Time Fourier Transform (STFT), and obtaining amplitude spectrum characteristics after the transformation;

s3: inputting the features into a convolutional encoder, performing downsampling, and extracting high-dimensional features from an input spectrogram;

s4: building an RNN module by using a gating circulation unit GRU, and re-displaying the GRU output result back to the dimension meeting the input requirement of the decoder to perform sequence modeling;

s5: calculating the loss of the enhanced voice signal relative to clean voice, and reversely transmitting and updating model parameters;

s6: and reconstructing the phase spectrum of the enhanced signal amplitude spectrum and the phase spectrum of the noisy signal through inverse short-time Fourier transform (ISTFT), performing model training, training 200 epochs, calculating the evaluation index speech perception quality (PESQ), the short-time objective intelligibility (STOI) and the signal distortion ratio (SI-SDR) with unchanged scale, calculating and comparing the three index scoring sums, and storing an optimal model.

Preferably, in the step S1, the VoiceBank-DEMAND data set is provided, the noise is from the DEMAND data set, the clean voice is from VoiceBank, the audio sampling rate is 48kHz, 10 kinds of noise (2 kinds of artificial synthesis and 8 kinds of DEMAND) are used for training the input mixed voice, and the mixed voice is synthesized according to 4 kinds of different signal-to-noise ratio settings (15, 10,5 and 0 dB); the input of the test set is synthesized by the rest 5 noise types of the DEMAND and 2 speakers of the VoiceBank according to the signal-to-noise ratio (17.5, 12.5,7.5,2.5 dB) different from the training set; the training set and the test set are divided into two folders of clean and noise, which are clean voice and noisy voice corresponding to each other one by one, the training set comprises 28 speakers, the test set comprises 2 speakers, and each speaker comprises about 400 sentences.

Preferably, the step S1 uses the audio processing command ffmpeg to resample the speech data set to 16kHz.

Preferably, the noisy speech signal y (n) in S2 is:

y(n)＝s _(n) +d _(n) (1)

s (n) represents clean speech, d _(n) Representing a noise signal;

the Short Time Fourier Transform (STFT) is expressed as:

y _l (m)＝s _l (m)+d _l (m) (2)

l represents the frame length and m represents the frequency window index; the feature dimension of the transformed amplitude spectrum feature is 161 dimensions.

Preferably, the encoder in S3 is composed of 5 convolution layers, each having an input/output dimension of (1, 16), (16, 32), (32, 64), (64,128), (128, 256), a convolution kernel size of 3×2, a step size of (2, 1), a padding of (0, 1), a regularization of Batchnorm2d, an activation function of ELU, and a gradient vanishing prevention and a faster convergence with respect to the ReLU.

Preferably, the gating cycle unit GRU in S4 is defined as follows:

z _t ＝σ(W _z x _t +U _z h _t-1 +b _z ) (3)

r _t ＝σ(W _r x _t +U _r h _t-1 +b _r ) (4)

wherein x is _t 、h _t 、

z _t 、r _t Respectively representing t moment input information, hidden states, candidate hidden states, update gates and reset gates; the sequence modeling uses 2 GRU layers and 1024 units, in order to fit the input shape required by the GRU, the frequency dimension and the depth dimension of the encoder output are flattened into 1024 dimensions to generate a feature vector sequence, then the feature vector sequence is input to the GRU layers, and the output result dimension of the GRU is re-expanded back into 256 dimensions to meet the input requirement of the decoder.

Preferably, the decoder module in S4 is formed by 5 deconvolution layers, corresponding to the encoder convolution layers, but input is Skip splicing (Skip Connection) of the upper layer output and the corresponding encoder convolution layer output, the high-layer convolution features are fused with RNN output context information, the feature information on the corresponding scale is introduced into the deconvolution process, and multi-scale and multi-level information is provided for voice signal reconstruction; therefore, the input and output dimensions of each layer are (512,128), (256, 64), (128,32), (64,16) and (32, 1), the convolution kernel size is the same as the step size set by the encoder, padding is 0, the activation function of the first 4 layers of convolution layers is ELU, the activation function of the last layer of convolution layers is ReLU, and negative values are avoided; the decoder ultimately maps the low resolution feature map of the encoder output to the original input size.

Preferably, the loss function in S5 is composed of an ideal ratio masking loss and a scale-invariant signal distortion ratio loss, and the formula is as follows:

the parameter λ takes a value of 1.0, the masking Mask loss is a time-frequency domain loss, and the Mean Square Error (MSE) between the enhanced speech signal and the clean speech signal is calculated:

is a SI-SDR loss function, is directly executed in the signal time domain, and is defined as follows:

and->

Wherein the method comprises the steps of<·,·>The dot product is represented by a graph of the dot product, I.I ₂ Representing L2 regularization.

Preferably, in the step S6, the peSQ value range is-0.5 to 4.5, the STOI value range is 0 to 1, and the SI-SDR value range is 0 to infinity; normalizing the PESQ to a range of 0 to 1, multiplying the SI-SDR by a coefficient beta, wherein the value of beta is 0.1, calculating and comparing the score sums of three indexes, and storing an optimal model; if the loss of the verification set is continuously 10 rounds and is not reduced, stopping training by adopting an Early Stopping method (Early Stopping); and comparing the trained model with other models.

The beneficial effects of the invention are as follows:

(1) The invention inherits the advantages of the CRN model, reduces the model parameters and improves the voice enhancement effect by improving the model structure and the loss function of the CRN.

(2) The improved model fully utilizes the local space information in the input STFT amplitude spectrum through the encoder-decoder structure; the improved model does not use future voice frame information, and can perform real-time voice enhancement based on the current voice frame; the GRU layer in the improved model models the time dependence of the speech signal, which is important for noise, speaker independent speech enhancement features.

(3) Compared with the original model CRN, the improved model has the advantages that the training parameter quantity is reduced from 17.58M to 13.38M, the model parameter quantity is reduced by nearly 24%, and the model convergence speed is increased. On objective evaluation indexes of the VoiceBank-DEMAND dataset, the improved model is improved by 2% compared with CRN and PESQ, STOI is improved by 0.2% and SI-SDR is improved by 2%.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a model network architecture of the present invention;

FIG. 2 is a graph of a noisy speech signal of the present invention;

FIG. 3 is a graph of a clean speech utterance of the present invention;

FIG. 4 is a conventional CRN enhanced speech spectrogram;

fig. 5 is a graph of a modified CRN enhanced speech signal of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Experimental studies were performed under the ubnu system using the Jupyter development tool. The Python libraries involved include, but are not limited to: numpy, matploglib, librosa, soundfile, pytorch, etc.

The voice file is loaded with Librosa and STFT transformed, the FFT window size n_fft is 320, window length win_len is not 320, and frame shift hop_len is set to 160.

The encoder is composed of 5 layers of convolution layers, each layer has input and output dimensions of (1, 16), (16, 32), (32, 64), (64,128), (128, 256), two-dimensional convolution is used, the convolution kernel size is 3×2, the step size is (2, 1), filling padding is (0, 1), the regularization adopts Batchnorm2d, and the activation function is ELU

Sequence modeling uses 2 GRU layers, 1024 units. To fit the input shape required by the GRU, we flatten the frequency and depth dimensions of the encoder output to 1024 dimensions to produce a sequence of feature vectors, which are then input to the GRU layer. The output result dimension of the GRU is re-expanded back to 256 dimensions to meet the decoder input requirements.

The decoder module is composed of 5 deconvolution layers, corresponds to the encoder convolution layers, inputs are Skip splicing (Skip Connection) of the upper layer output and the corresponding encoder convolution layer output, high-level convolution characteristics are fused with RNN output context information, characteristic information on corresponding scales is introduced into the deconvolution process, and multi-scale and multi-level information is provided for voice signal reconstruction. Therefore, the input/output dimensions of each layer are (512,128), (256, 64), (128,32), (64,16), (32, 1), the convolution kernel size is the same as the step size as the encoder, the padding is 0, the first 4 layers of convolution layer activation functions are ELU, and the last layer of convolution layer activation functions are ReLU

During model training, the Mini_batch size is 32, and sentences are used as the hierarchy. Using an Adam optimizer, the learning rate lr is set to 1e-4, smoothing constant β ₁ And beta ₂ Respectively 0.9 and 0.999, the Loss function Loss adopts

The training round number is 200 Epoch, adopt Early stop ping, if verify that the collection loss is continuous 10 rounds and no longer descends, stop training. Table 1 is the algorithmic pseudocode of the present invention.

TABLE 1

The comparison of the improved CRN training model of the present invention with other models in the VoiceBank-DEMAND dataset is shown in Table 2.

TABLE 2

Wherein CRN+ is a model of the invention after CRN improvement, and the result shows that the CRN+ voice perception quality (PESQ) and short-time objective intelligibility (STOI) are superior to the existing model, and can generate an enhanced voice signal with high perception quality and high signal-to-noise ratio from voice with noise.

The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed.

Claims

1. The single-channel voice enhancement method based on the improved CRN is characterized by comprising the following steps of:

s1: downloading a VoiceBank-DEMAND data set;

s2: applying a window function to the voice signal frame by frame to calculate short-time Fourier transform, and obtaining amplitude spectrum characteristics after the transform;

s6: and reconstructing the phase spectrum of the enhanced signal amplitude spectrum and the phase spectrum of the noisy signal through short-time Fourier inverse transformation to obtain an enhanced voice signal, training a model, training 200 epochs, calculating the evaluation index voice perception quality PESQ, short-time objective intelligibility STOI and the signal distortion ratio SI-SDR with unchanged scale, calculating and comparing the three index scoring sums, and storing an optimal model.

2. The single channel speech enhancement method based on improved CRN according to claim 1, wherein the VoiceBank-DEMAND dataset in S1, noise is from the DEMAND dataset, clean speech is from VoiceBank, the audio sampling rate is 48kHz, 10 kinds of noise are used for the mixed speech used for training input, 2 kinds of artificial synthesis, 8 kinds of noise from DEMAND, different signal-to-noise ratio synthesis are set; the test set is input by the rest 5 noise types of the DEMAND and 2 speakers of the VoiceBank, and synthesized according to the signal-to-noise ratio different from the training set; the training set and the test set are divided into two folders of clean and noise, which are clean voice and noisy voice corresponding to each other one by one, the training set comprises 28 speakers, the test set comprises 2 speakers, and each speaker comprises about 400 sentences.

3. The single channel speech enhancement method based on modified CRN according to claim 2, wherein the speech data set is resampled to 16kHz at a batch sample rate in S1 using the audio processing command ffmpeg.

4. The single-channel speech enhancement method based on improved CRN according to claim 2, wherein the noisy speech signal y (n) in S2 is:

y(n)＝s _(n) +d _(n) (1)

s _(n) represents clean speech, d _(n) Representing a noise signal;

the short-time fourier transform is expressed as:

y _l (m)＝s _l (m)+d _l (m) (2)

5. The method of claim 1, wherein the encoder in S3 is composed of 5 convolutions, each having an input/output dimension of (1, 16), (16, 32), (32, 64), (64,128), (128, 256), using two-dimensional convolution, a convolution kernel size of 3×2, a step size of (2, 1), padding of (0, 1), regularization of Batchnorm2d, an activation function of ELU, and a faster convergence than ReLU.

6. The method for single channel speech enhancement based on modified CRN according to claim 1, wherein the gating loop unit GRU in S4 is defined as follows:

z _t ＝σ(W _z x _t +U _z h _t-1 +b _z ) (3)

r _t ＝σ(W _r x _t +U _r h _t-1 +b _r ) (4)

wherein x is _t 、h _t 、

7. The method for single channel speech enhancement based on improved CRN according to claim 1, wherein the decoder module in S4 is composed of 5 deconvolution layers, corresponding to the encoder convolution layers, but input is skip splice of the previous layer output and the corresponding encoder convolution layer output, the high layer convolution features are fused with the RNN output context information, the feature information on the corresponding scale is introduced into the deconvolution process, and multi-scale and multi-level information is provided for speech signal reconstruction; the input and output dimensions of each layer are (512,128), (256, 64), (128,32), (64,16) and (32, 1), the convolution kernel size and the step size are the same as those of the encoder, the padding is 0, the activation function of the first 4 layers of convolution layers is ELU, and the activation function of the last layer of convolution layers is ReLU, so that negative values are avoided; the decoder ultimately maps the low resolution feature map of the encoder output to the original input size.

8. The method for single channel speech enhancement based on improved CRN according to claim 1, wherein the loss function in S5 consists of an ideal ratio masking loss and a scale-invariant signal-to-distortion ratio loss, and the formula is as follows:

the parameter lambda takes a value of 1.0, the Mask loss is a time-frequency domain loss, and the mean square error between the enhanced voice signal and the clean voice signal is calculated:

and->

9. The method for single channel speech enhancement based on improved CRN according to claim 1, wherein PESQ in S6 ranges from-0.5 to 4.5, stoi ranges from 0 to 1, si-SDR ranges from 0 to infinity; normalizing the PESQ to a range of 0 to 1, multiplying the SI-SDR by a coefficient beta, wherein the value of beta is 0.1, calculating and comparing the score sums of three indexes, and storing an optimal model; if the loss of the verification set is 10 continuous rounds and the verification set is not lowered any more, stopping training by adopting an early stopping method; and comparing the trained model with other models.