CN112735456A

CN112735456A - Speech enhancement method based on DNN-CLSTM network

Info

Publication number: CN112735456A
Application number: CN202011323987.2A
Authority: CN
Inventors: 汪友明; 张天琦
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2021-04-30
Anticipated expiration: 2040-11-23
Also published as: CN112735456B

Abstract

The invention relates to a speech enhancement method based on a deep neural network and a residual error long-time memory network (DNN-CLSTM). The method inputs the voice amplitude feature obtained by spectral subtraction and the voice Mel cepstrum coefficient (MFCC) feature obtained by fast Fourier transform into a DNN-CLSTM network model, and achieves the purpose of voice enhancement. Firstly, carrying out time-frequency masking and windowing framing processing on noisy voices, acquiring amplitude and phase characteristics of the noisy voices by utilizing fast Fourier transform, and estimating noise amplitude of the noisy voices; then, the estimated noise signal amplitude is subtracted from the noisy speech amplitude to obtain a spectrally subtracted speech signal amplitude as a first feature of the neural network input. Secondly, Fast Fourier Transform (FFT) is carried out on the noisy speech, spectral line energy of the speech signal is obtained, and then MFCC characteristics of the noisy speech are obtained and serve as second characteristics of the speech signal. Inputting the two characteristics into a DNN-CLSTM network for training to obtain a network model, and evaluating the effectiveness of the model by adopting a Minimum Mean Square Error (MMSE) loss function evaluation index. And finally, inputting the actual noisy speech set into a speech enhancement network model which completes training, predicting the enhanced estimated amplitude and MFCC, and obtaining a final enhanced speech signal by adopting inverse Fourier transform. The invention has high fidelity of voice.

Description

Speech enhancement method based on DNN-CLSTM network

Technical Field

The invention belongs to the technical field of voice enhancement, and particularly relates to a voice enhancement method based on a DNN-CLSTM network.

Background

With the development of technology, voice not only plays a role in information transmission between people, but also applies a large amount of voice signals in human-computer interaction. However, in our communication process, the speech signal is often accompanied by a large amount of noise signals, such as factory noise, car noise, or background noise such as restaurant noise. A speech signal containing a large amount of noise may cause a receiving side to generate a large amount of interference when extracting useful information contained in the speech signal. In response to this problem, speech signal enhancement techniques have gained much attention.

Voice augmentation refers to a process of separating noise from a voice signal when real voice is disturbed by noise. Speech enhancement technology is widely used in many fields, such as mobile communication field, speech recognition field, etc. The main purpose of speech enhancement techniques is to improve speech quality and speech intelligibility. At present, the speech enhancement method is mainly divided into three types, namely spectral subtraction, subspace algorithm and statistical model-based algorithm. With the development of deep learning, neural networks have been applied to the field of speech enhancement.

The spectral subtraction shown in fig. 1 is one of the earliest denoising techniques in speech enhancement techniques. Spectral subtraction denoising is based on the following principle: assuming that the noise is additive noise, i.e., y (m) ═ x (m) + n (m), where y (m) is the signal containing the noise, x (m) is the clean speech signal, and n (m) is additive noise; by subtracting the estimate of the noise spectrum from the noise-containing speech signal, a clean speech signal is obtained. A prerequisite for this assumption is that the noise signal is stationary, so that between speech segments where the target signal is not present, the noise signal can be estimated and updated.

Spectral subtraction is a relatively simple speech enhancement algorithm whose principle is to subtract an estimated noise amplitude spectral value from an amplitude spectral value of an input mixed speech signal, and to synthesize a final spectrally subtracted speech signal by directly applying phase angle information before spectral subtraction to the spectrally subtracted information, taking advantage of the insensitivity of the human ear to phase. Since the spectral subtraction method only includes one fourier change and one inverse fourier change, it is computationally inexpensive and easy to implement. However, in reality, many noises are unstable signals, and therefore after a speech signal is enhanced by using spectral subtraction, the enhanced speech signal often has a large amount of music noises, which causes distortion of the speech signal, and thus the intelligibility and speech quality of the signal are poor.

Disclosure of Invention

The invention aims to solve the problems of speech signal distortion, poor signal intelligibility and speech quality and the like in the speech enhancement process based on spectral subtraction. In order to achieve the above object, the present invention provides a speech enhancement method based on DNN-CLSTM network, which is characterized by comprising the following steps:

the method comprises the following steps: acquiring at least two paths of noisy voice signals, wherein the noisy voice signals are formed by adding pure voice signals and noise signals:

y(m)＝x(m)+n(m)

wherein y (m) is a noisy speech signal containing noise, x (m) is a clean speech signal, n (m) is a noise signal, and m is a discrete time sequence;

step two: framing and windowing are performed, and the amplitude and the phase of the pure voice signal and the noisy voice signal are obtained as first characteristics: carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier transform; meanwhile, in the voice signal section which does not contain the target signal and only contains noise, the noise is estimated, and the amplitude of the noise signal is calculated;

subtracting the noise signal amplitude from the amplitude of the noise-containing voice signal to obtain a spectrum-subtracted voice signal amplitude as a second characteristic;

step four, obtaining MFCC as a third characteristic;

step five: establishing a DNN-CLSTM network model for training;

inputting the two characteristics of the voice signal amplitude and the MFCC after the noise-containing voice spectrum subtraction into a DNN-CLSTM network for training to obtain an enhanced estimated amplitude and an MFCC; and respectively calculating the minimum mean square error of the estimated amplitude value and the MFCC and the pure amplitude value and the pure MFCC, and inputting the obtained error into a neural network as an adjusting signal to optimize the network so as to obtain the trained network.

The concrete process of the step four is as follows:

(1) pretreatment: the preprocessing comprises pre-emphasis, framing and windowing functions;

pre-emphasis processing: the method is realized by a first-order high-pass filter, and the transfer function of the filter is as follows:

H(z)＝1-az^-1

wherein a is a pre-emphasis coefficient, and is generally 0.98;

the result of the pre-emphasis processing of the speech signal x (n) is:

y(n)＝x(n)-ax(n-1)

framing and windowing: the overlapped part between two adjacent frames is the frame shift and is set to be 10 ms; windowing function: carrying out Hamming window windowing processing on each frame of voice signals: y (n) is processed by frame division and window addition to obtain y_i(n), which is defined as:

where ω (n) is a Hamming window whose expression is

Wherein, y_i(n) represents the ith frame of speech signal, n represents the number of sample points, and L represents the frame length;

(2) fast Fourier Transform (FFT)

For each frame of speech signal y_i(n) performing fast Fourier transform to obtainThe spectrum of each frame of signal is expressed as follows:

Y(i,k)＝FFT[y_i(n)]

wherein k represents the kth spectral line in the frequency domain;

(3) calculating spectral line energy

The energy E (i, k) of each frame of speech signal spectral line in the frequency domain is represented as:

E(i,k)＝[Y(i,k)]²

(4) calculating the energy passed through the Mel Filter

The energy S (i, m) of each frame of spectral line energy passing through the Mel filter is defined as:

wherein N represents the number of points of FFT;

transfer function H of each filter_m(k) Is composed of

Wherein, f (M) is the center frequency of the mth filter, M is the mth filter, and M is the number of the filters;

(5) computing MFCC

And (3) calculating discrete cosine transform after taking logarithm of energy of the Mel filter to obtain MFCC characteristic parameters, wherein the following formula is as follows:

where j is the DCT spectral line.

3. The concrete process of the step five is as follows:

(1) DNN network establishment

An input layer: inputting the speech amplitude and MFCC characteristics after spectral subtraction into a DNN network, wherein the number of the nodes of the neurons of an input layer is 128;

full connection layer: setting 32 nodes, setting the discarding rate to 0.5, and setting an activation function to RELU;

full connection layer: setting 128 nodes, setting the discard rate value to 0.5, and setting the activation function to RELU;

full connection layer: setting 512 nodes, setting a discarding rate value to 0.5, and setting an activation function to RELU;

(2) multi-target feature fusion:

combining the enhanced amplitude and MFCC features of the DNN network with the amplitude and MFCC features of the original noisy speech;

wherein

And

respectively representing the MFCC characteristics and the speech amplitude subjected to DNN prediction in the k-th space domain;

respectively representing MFCC features and speech amplitude of original noisy speech in the kth space domain;

(3) C-LSTM network:

(a)CNN:

and (3) rolling layers: performing convolution on the result obtained by the DNN network, setting the number of nodes as 64 nodes, setting the step length as 1, taking 5 × 1 as a convolution kernel, and setting an activation function as an SELU;

BN layer: normalizing the data;

and (3) rolling layers: the number of nodes is set to 64 nodes, the step length is set to 1, the convolution kernel takes 3 x 1, and the activation function is set to SELU;

BN layer: normalizing data

And (3) rolling layers: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel takes 5 x 1,

(b) residual error network

Performing convolution on the result obtained by the DNN network, setting the number of nodes as 128 nodes, setting the step length as 1, and taking 5 x 1 as a convolution kernel;

combining the data obtained by the residual network with the data obtained by the CNN network, and then using an SELU activation function;

max Pooling layer: step size is set to 1, pooling layer size is set to 2

(c) LSTM network:

the bidirectional network nodes of the long-time and short-time memory network are all selected as 128 nodes, the activation function is a Sigmoid function,

(4) an output layer:

using two feedforward neural networks as output layers to output the predicted speech signal amplitude and MFCC; optimizing network parameters by adopting an Adam optimizer through the network model; all the convolution layers adopt an edge filling mode.

(5) Calculating a minimum mean square error objective function

Wherein T is 2, the ratio of T to T is 2,

MFCC feature vector and predicted magnitude feature representing spatial prediction of kth acoustic feature, respectively

Respectively representing the k-th acoustic feature space clean MFCC feature vector and the clean amplitude feature.

The invention has the following advantages: the enhanced voice signal is stable, and has high fidelity and good voice quality.

The present invention will be described in detail below with reference to the accompanying drawings and examples.

Drawings

Fig. 1 is a flow chart of a conventional spectral subtraction method.

Fig. 2 is a flow chart of a DNN-CLSTM network based speech enhancement method in the training phase.

Fig. 3 is a flow chart of a DNN-CLSTM network-based speech enhancement method in a test phase.

FIG. 4 is a MFCC derivation flow diagram.

FIG. 5 is a process architecture diagram for building a DNN-CLSTM neural network.

FIG. 6 is a spectrogram of clean speech

FIG. 7 is a speech spectrogram containing noise

FIG. 8 is a speech spectrogram after DNN speech enhancement

FIG. 9 is a speech spectrogram after processing by CNN speech enhancement method

FIG. 10 is a speech spectrogram after processing by LSTM speech enhancement method

FIG. 11 is a speech spectrogram after processing by GRU speech enhancement method

FIG. 12 is a speech spectrogram processed by DNN-CLSTM speech enhancement method

Detailed Description

In order to overcome the defect that after a speech signal is enhanced by using spectral subtraction, the enhanced speech signal often has a large amount of music noise, so that the speech signal is distorted, and the intelligibility of the signal and the speech quality are poor, the embodiment provides a speech enhancement method based on a DNN-CLSTM network (as shown in fig. 2 and 3), which includes the following steps:

acquiring two paths of noise-containing voice signals (or acquiring more than two paths of noise-containing voice signals, and selecting the signals according to actual needs); noisy speech signals consist of clean speech signals and noise signals:

y(m)＝x(m)+n(m)

where y (m) is a speech signal containing noise, x (m) is a clean speech signal, and n (m) is a noise signal.

A training stage:

1. framing and windowing

Obtaining the amplitude and phase of the clean speech signal and the noisy speech signal, taking the sum processing of the noisy speech as an example:

carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier transform as first characteristics;

windowing and framing a noisy speech signal using a hamming window:

y_w(m)＝w(m)y(m)＝w(m)[x(m)+n(m)]＝x_w(m)+n_w(m)

the windowing operation is represented in the frequency domain as:

Y_w(f)＝W(f)*Y(f)＝X_w(f)+N_w(f)

the subscript w of the signal is omitted for simplicity, assuming that the signal is windowed.

Will Y_w(f) Expressed in polar coordinate form:

where | Y (f) | is the amplitude spectrum, φ_y(f) Is a Phase signal Phase [ Y (f)]。

2. Noise estimation:

estimating noise in a voice signal segment which does not contain a target signal and only contains noise, and solving the amplitude of a noise signal; the invention selects the first five frames of the voice signal as the noise signal segment. Since the magnitude spectrum | N (f) of the noise is unknown, but | N (f) | can be replaced by an estimate of the average magnitude spectrum in the absence of speech activity, the phase of the noise is φ_n(f) Can be derived from the phase phi of a noisy speech signal_y(f) Instead of this. When the voice signal is not existed and only has noise, obtaining average noise amplitude spectrum

The calculation process is as follows:

wherein, | N_i(f) I is the spectrum of the frame of the ith noise and k is the number of frames in the pure noise signal period.

3. Spectral subtraction method

Subtracting the noise signal amplitude from the amplitude of the noisy speech signal to obtain a spectral subtraction speech signal amplitude as a second feature (using the enhanced speech signal obtained by spectral subtraction as a spectral subtraction speech signal, the corresponding amplitude being referred to as a spectral subtraction speech signal amplitude to distinguish from the enhanced speech signal finally obtained in this implementation);

the amplitude of the noise signal is subtracted from the amplitude of the noise-containing speech signal, and the calculation is as follows:

wherein the content of the first and second substances,

representing the amplitude of the spectrally subtracted speech signal, | Y (f) luminance^bRepresenting the amplitude of the noisy speech signal,

represents the average of the noise statistics representing the noise segment. α represents a spectral subtraction noise figure. b is a power exponent, and is a magnitude spectrum subtraction when the exponent b is 1, and is a power spectrum subtraction when the exponent b is 2.

Since the estimation of the noise signal may generate errors, the amplitude spectrum of the estimated signal is caused

A negative value may be possible. Generally the value of the amplitude spectrum should not be negative in order to avoid

And (3) carrying out half-wave rectification on the differential spectrum when the differential spectrum is a negative value, wherein the calculation process is as follows:

after spectral subtraction, the speech signal after spectral subtraction needs to be subjected to power reduction, so as to obtain the speech signal amplitude after the spectral subtraction stage, i.e. the amplitude of the spectral subtraction speech signal

The calculation process is as follows:

4. extracting MFCC as a second feature;

(1) pretreatment of

The pre-processing comprises pre-emphasis, framing and windowing functions, the pre-emphasis processing is realized by a first-order high-pass filter, and the transfer function of the filter is as follows:

H(z)＝1-az^-1

wherein a is a pre-emphasis coefficient and is generally 0.98

The result of the pre-emphasis processing of the speech signal x (n) is:

y(n)＝x(n)-ax(n-1)

framing refers to the process of generating speech in which the speech signal is known to be a non-stationary time-varying signal. The speech signal in a short time can be considered as a steady time-invariant signal, the short time is usually 10-30 ms, and is taken as 20ms in this text. Therefore, a speech signal is usually analyzed and processed by a short-time analysis technique, many frames are used to analyze their characteristic parameters, and in order to make a smooth transition between frames, a portion where there is an overlap between two adjacent frames, i.e., a frame shift, is set to 10 ms. The purpose of the windowing function is to reduce the spectral leakage in the frequency domain, and each frame of speech signal is windowed, usually with a hamming window, which has a smaller spectral leakage than a rectangular window. y (n) is processed by frame division and window addition to obtain y_i(n), which is defined as:

where ω (n) is a Hamming window whose expression is

Wherein, y_i(n) represents the i-th frame speech signal, n represents the number of samples, and L represents the frame length.

(2) Fast Fourier Transform (FFT)

Since the characteristics of the speech signal are generally not readily apparent from the transformation of the speech signal in the time domain, it is often analyzed by transforming it into the frequency domain, with the different frequency spectra representing the different characteristics of the speech signal. Thus for each frame of speech signal y_i(n) performing fast fourier transform to obtain a frequency spectrum of each frame of signal, as shown in the following formula:

Y(i,k)＝FFT[y_i(n)]

where k denotes the kth spectral line in the frequency domain.

(3) Calculating spectral line energy

The energy E (i, k) of each frame of speech signal spectral line in the frequency domain can be expressed as:

E(i,k)＝[Y(i,k)]²

(4) calculating the energy passed through the Mel Filter

The energy S (i, m) of each frame of spectral line energy passing through the Mel filter can be defined as:

where N represents the number of points of the FFT.

Transfer function H of each filter_m(k) Is composed of

Where f (M) is the center frequency of the mth filter, M is the mth filter, and M is the number of filters, and is usually set to 24.

(5) Calculating the MFCC, referring to FIG. 4, taking the logarithm of the energy of the Mel filter, and then calculating the Discrete Cosine Transform (DCT) to obtain the characteristic parameters of the MFCC, as shown in the following formula:

where j is the DCT spectral line.

8. Training based on a DNN-CLSTM network model;

And (3) a testing stage:

1. carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier transform as first characteristics;

2. the noise-containing voice signal is subjected to framing and windowing to obtain the voice signal amplitude after the spectral subtraction stage, namely the spectral subtraction voice signal amplitude

As a second feature;

3. performing framing and windowing on the noisy speech signal to obtain MFCC as a third characteristic;

4. performing time-frequency decomposition on the noisy speech signal, and extracting the characteristics to obtain E_MRACC(j, m) and Δ E_MRACC(j, m) as a fourth feature of the input to the deep neural network.

5. Extracting the amplitude of the noisy speech, the amplitude of the spectrally subtracted speech signal, MFCC and the features to obtain E_MRACC(j, m) and Δ E_MRACC(j, m) signals, etcInputting the characteristics into a trained DNN-CLSTM network to obtain an enhanced estimated amplitude, MFCC and a masking threshold;

6. and (3) combining the amplitude of the enhanced voice signal with the phase of the noisy voice signal obtained in the step (1), and performing inverse Fourier transform to obtain a final enhanced voice signal. Preliminary enhanced speech signal amplitude obtained after neural network training

The time domain signal is restored to obtain the final enhanced speech signal, and the amplitude of the initial enhanced speech signal is firstly enhanced

The phase phi of the noisy speech signal extracted in step 1 is required_y(f) Combined and then converted to a time domain signal using an inverse fourier transform

Thereby resulting in a final enhanced speech signal. The enhanced MFCC and the masking threshold obtained here do not participate in waveform recovery, and the network is optimized in the network processing process.

7. Network establishment

The DNN-CLSTM Network includes a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a Residual Network (Residual Network), and a Long-term Memory Network (Bidirectional Long-term Memory Network), and a specific process of establishing the DNN-CLSTM Neural Network is shown in fig. 5:

(1) DNN network establishment

An input layer: the amplitude of the spectrum-reduced voice signal obtained after the spectrum reduction is used as input, and the input nodes are 128 neurons;

(2) multi-target feature fusion:

wherein

And

(a)CNN:

BN layer: the data is normalized by the number of pixels in the sample,

BN layer: normalizing data

(b) residual error network

max Pooling layer: step size is set to 1, pooling layer size is set to 2

(3) LSTM network:

(4) an output layer:

using two feedforward neural networks as an output layer to output the enhanced speech signal amplitude and MFCC; optimizing network parameters by adopting an Adam optimizer through the network model; all the convolution layers adopt an edge filling mode.

(5) Calculating a minimum mean square error objective function

Wherein T is 2, the ratio of T to T is 2,

[ Experimental examples ]

The speech data used in the experiment were from the TIMIT dataset and the Noise dataset was from the Nonspeech Noise library and the Noise-15 Noise library. In this experiment, the total of the TIMIT data contained 6300 pieces of speech. About 80% of the speech was used as training set and the other 20% as test speech. All speech was resampled to 16 kHz. For the model proposed by the present invention, several typical neural network speech enhancement models are selected to compare with the method proposed by the present invention, including (a) dnn, (b) cnn, (c) lstm, (d) GRU. The method comprises the steps of obtaining a neural network model, obtaining a DNN (deep neural network) -based speech enhancement algorithm, obtaining a CNN (convolutional neural network) -based speech enhancement algorithm, obtaining an LSTM (long term memory) based speech enhancement algorithm, and obtaining a GRU (GRU) based speech enhancement algorithm.

All models were trained at SNR of-5 dB, 0dB,5dB, 10dB, 15dB,20dB and performance was evaluated at matched signal-to-noise ratios. To test the robustness of the speech enhancement model, the performance was evaluated under mismatched signal-to-noise conditions. PESQ and LSD are two important indexes for evaluating voice, PESQ refers to subjective voice quality evaluation index, and the higher the LESQ score is, the better the voice quality is; LSD refers to a log spectral distance index, the lower the LSD score, the better the speech quality. Table 1 shows the results of the tests in comparison with the other four algorithms (DNN, CNN, LSTM, GRU) under matched noise conditions, with the best performing algorithm results shown in bold. Table 2 shows the results of the tests in comparison with the other four algorithms (DNN, CNN, LSTM, GRU) under mismatched noise conditions, with the best performing algorithm results shown in bold.

TABLE 1 test results under matched noise conditions, best performing ones are in bold

TABLE 2 test results under mismatched noise conditions, best performing ones are in bold

Claims

1. A speech enhancement method based on DNN-CLSTM network is characterized by comprising the following steps:

the method comprises the following steps: and acquiring a noisy voice signal. The noisy speech signal is formed by adding a clean speech signal and a noise signal:

y(m)＝x(m)+n(m)

wherein y (m) is a noisy speech signal, x (m) is a clean speech signal, n (m) is a noise signal, and m is a discrete time sequence;

step two: performing frame windowing to obtain the amplitude and phase of the pure voice signal and the noisy voice signal;

and carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier change. Using the first five frame signals in the voice section as noise estimation to calculate the amplitude of the noise signal;

step three: subtracting the noise signal amplitude from the amplitude of the noise-containing voice signal to obtain a spectrum-subtracted voice signal amplitude as a first characteristic;

step four: obtaining the MFCC of the voice signal as a second characteristic;

step five: establishing a DNN-CLSTM network model for training;

inputting the two characteristics of the voice signal amplitude and the MFCC after the noise-containing voice spectrum subtraction into a DNN-CLSTM network for training to obtain a predicted amplitude and an MFCC; and respectively calculating the Minimum Mean Square Error (MMSE) of the predicted amplitude and the MFCC and the pure amplitude and the pure MFCC, and inputting the obtained error into a neural network as an adjusting signal to optimize the network so as to obtain the trained network.

2. The speech enhancement method based on the DNN-CLSTM network of claim 1, characterized in that the concrete process of said step four is:

H(z)＝1-az^-1

wherein a is a pre-emphasis coefficient, and is generally 0.98; the result of the pre-emphasis processing of the speech signal x (n) is:

y(n)＝x(n)-ax(n-1)

where ω (n) is a Hamming window whose expression is

(2) fast Fourier Transform (FFT)

For each frame of speech signal y_i(n) performing fast Fourier transform to obtain the frequency spectrum of each frame of signal, wherein the expression is as follows:

Y(i,k)＝FFT[y_i(n)]

wherein k represents the kth spectral line in the frequency domain;

(3) calculating spectral line energy

E(i,k)＝[Y(i,k)]²

(4) calculating the energy passed through the Mel Filter

wherein, N represents the number of FFT points, and M is the number of filters;

transfer function H of each filter_m(k) Is composed of

Wherein f (m) is the center frequency of the mth filter, and m is the mth filter;

(5) computing MFCC

Logarithm is taken to the energy of the Mel filter, and then discrete cosine transform is calculated to obtain MFCC characteristic parameters, which are shown as the following formula:

where j is the spectral line after Discrete Cosine Transform (DCT).

3. The speech enhancement method based on the DNN-CLSTM network as claimed in claim 1, characterized in that the concrete procedure of said step five is:

(1) DNN network establishment

full connection layer: setting 512 nodes, setting a discarding rate value to 0.5, and setting an activation function to RELU; (2) multi-target feature fusion: