CN112735456A - Speech enhancement method based on DNN-CLSTM network - Google Patents

Speech enhancement method based on DNN-CLSTM network Download PDF

Info

Publication number
CN112735456A
CN112735456A CN202011323987.2A CN202011323987A CN112735456A CN 112735456 A CN112735456 A CN 112735456A CN 202011323987 A CN202011323987 A CN 202011323987A CN 112735456 A CN112735456 A CN 112735456A
Authority
CN
China
Prior art keywords
network
amplitude
speech
signal
mfcc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011323987.2A
Other languages
Chinese (zh)
Other versions
CN112735456B (en
Inventor
汪友明
张天琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
Original Assignee
Xian University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications filed Critical Xian University of Posts and Telecommunications
Priority to CN202011323987.2A priority Critical patent/CN112735456B/en
Publication of CN112735456A publication Critical patent/CN112735456A/en
Application granted granted Critical
Publication of CN112735456B publication Critical patent/CN112735456B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention relates to a speech enhancement method based on a deep neural network and a residual error long-time memory network (DNN-CLSTM). The method inputs the voice amplitude feature obtained by spectral subtraction and the voice Mel cepstrum coefficient (MFCC) feature obtained by fast Fourier transform into a DNN-CLSTM network model, and achieves the purpose of voice enhancement. Firstly, carrying out time-frequency masking and windowing framing processing on noisy voices, acquiring amplitude and phase characteristics of the noisy voices by utilizing fast Fourier transform, and estimating noise amplitude of the noisy voices; then, the estimated noise signal amplitude is subtracted from the noisy speech amplitude to obtain a spectrally subtracted speech signal amplitude as a first feature of the neural network input. Secondly, Fast Fourier Transform (FFT) is carried out on the noisy speech, spectral line energy of the speech signal is obtained, and then MFCC characteristics of the noisy speech are obtained and serve as second characteristics of the speech signal. Inputting the two characteristics into a DNN-CLSTM network for training to obtain a network model, and evaluating the effectiveness of the model by adopting a Minimum Mean Square Error (MMSE) loss function evaluation index. And finally, inputting the actual noisy speech set into a speech enhancement network model which completes training, predicting the enhanced estimated amplitude and MFCC, and obtaining a final enhanced speech signal by adopting inverse Fourier transform. The invention has high fidelity of voice.

Description

Speech enhancement method based on DNN-CLSTM network
Technical Field
The invention belongs to the technical field of voice enhancement, and particularly relates to a voice enhancement method based on a DNN-CLSTM network.
Background
With the development of technology, voice not only plays a role in information transmission between people, but also applies a large amount of voice signals in human-computer interaction. However, in our communication process, the speech signal is often accompanied by a large amount of noise signals, such as factory noise, car noise, or background noise such as restaurant noise. A speech signal containing a large amount of noise may cause a receiving side to generate a large amount of interference when extracting useful information contained in the speech signal. In response to this problem, speech signal enhancement techniques have gained much attention.
Voice augmentation refers to a process of separating noise from a voice signal when real voice is disturbed by noise. Speech enhancement technology is widely used in many fields, such as mobile communication field, speech recognition field, etc. The main purpose of speech enhancement techniques is to improve speech quality and speech intelligibility. At present, the speech enhancement method is mainly divided into three types, namely spectral subtraction, subspace algorithm and statistical model-based algorithm. With the development of deep learning, neural networks have been applied to the field of speech enhancement.
The spectral subtraction shown in fig. 1 is one of the earliest denoising techniques in speech enhancement techniques. Spectral subtraction denoising is based on the following principle: assuming that the noise is additive noise, i.e., y (m) ═ x (m) + n (m), where y (m) is the signal containing the noise, x (m) is the clean speech signal, and n (m) is additive noise; by subtracting the estimate of the noise spectrum from the noise-containing speech signal, a clean speech signal is obtained. A prerequisite for this assumption is that the noise signal is stationary, so that between speech segments where the target signal is not present, the noise signal can be estimated and updated.
Spectral subtraction is a relatively simple speech enhancement algorithm whose principle is to subtract an estimated noise amplitude spectral value from an amplitude spectral value of an input mixed speech signal, and to synthesize a final spectrally subtracted speech signal by directly applying phase angle information before spectral subtraction to the spectrally subtracted information, taking advantage of the insensitivity of the human ear to phase. Since the spectral subtraction method only includes one fourier change and one inverse fourier change, it is computationally inexpensive and easy to implement. However, in reality, many noises are unstable signals, and therefore after a speech signal is enhanced by using spectral subtraction, the enhanced speech signal often has a large amount of music noises, which causes distortion of the speech signal, and thus the intelligibility and speech quality of the signal are poor.
Disclosure of Invention
The invention aims to solve the problems of speech signal distortion, poor signal intelligibility and speech quality and the like in the speech enhancement process based on spectral subtraction. In order to achieve the above object, the present invention provides a speech enhancement method based on DNN-CLSTM network, which is characterized by comprising the following steps:
the method comprises the following steps: acquiring at least two paths of noisy voice signals, wherein the noisy voice signals are formed by adding pure voice signals and noise signals:
y(m)=x(m)+n(m)
wherein y (m) is a noisy speech signal containing noise, x (m) is a clean speech signal, n (m) is a noise signal, and m is a discrete time sequence;
step two: framing and windowing are performed, and the amplitude and the phase of the pure voice signal and the noisy voice signal are obtained as first characteristics: carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier transform; meanwhile, in the voice signal section which does not contain the target signal and only contains noise, the noise is estimated, and the amplitude of the noise signal is calculated;
subtracting the noise signal amplitude from the amplitude of the noise-containing voice signal to obtain a spectrum-subtracted voice signal amplitude as a second characteristic;
step four, obtaining MFCC as a third characteristic;
step five: establishing a DNN-CLSTM network model for training;
inputting the two characteristics of the voice signal amplitude and the MFCC after the noise-containing voice spectrum subtraction into a DNN-CLSTM network for training to obtain an enhanced estimated amplitude and an MFCC; and respectively calculating the minimum mean square error of the estimated amplitude value and the MFCC and the pure amplitude value and the pure MFCC, and inputting the obtained error into a neural network as an adjusting signal to optimize the network so as to obtain the trained network.
The concrete process of the step four is as follows:
(1) pretreatment: the preprocessing comprises pre-emphasis, framing and windowing functions;
pre-emphasis processing: the method is realized by a first-order high-pass filter, and the transfer function of the filter is as follows:
H(z)=1-az-1
wherein a is a pre-emphasis coefficient, and is generally 0.98;
the result of the pre-emphasis processing of the speech signal x (n) is:
y(n)=x(n)-ax(n-1)
framing and windowing: the overlapped part between two adjacent frames is the frame shift and is set to be 10 ms; windowing function: carrying out Hamming window windowing processing on each frame of voice signals: y (n) is processed by frame division and window addition to obtain yi(n), which is defined as:
Figure RE-RE-GDA0002977011930000031
where ω (n) is a Hamming window whose expression is
Figure RE-RE-GDA0002977011930000032
Wherein, yi(n) represents the ith frame of speech signal, n represents the number of sample points, and L represents the frame length;
(2) fast Fourier Transform (FFT)
For each frame of speech signal yi(n) performing fast Fourier transform to obtainThe spectrum of each frame of signal is expressed as follows:
Y(i,k)=FFT[yi(n)]
wherein k represents the kth spectral line in the frequency domain;
(3) calculating spectral line energy
The energy E (i, k) of each frame of speech signal spectral line in the frequency domain is represented as:
E(i,k)=[Y(i,k)]2
(4) calculating the energy passed through the Mel Filter
The energy S (i, m) of each frame of spectral line energy passing through the Mel filter is defined as:
Figure RE-RE-GDA0002977011930000041
wherein N represents the number of points of FFT;
transfer function H of each filterm(k) Is composed of
Figure RE-RE-GDA0002977011930000042
Wherein, f (M) is the center frequency of the mth filter, M is the mth filter, and M is the number of the filters;
(5) computing MFCC
And (3) calculating discrete cosine transform after taking logarithm of energy of the Mel filter to obtain MFCC characteristic parameters, wherein the following formula is as follows:
Figure RE-RE-GDA0002977011930000043
where j is the DCT spectral line.
3. The concrete process of the step five is as follows:
(1) DNN network establishment
An input layer: inputting the speech amplitude and MFCC characteristics after spectral subtraction into a DNN network, wherein the number of the nodes of the neurons of an input layer is 128;
full connection layer: setting 32 nodes, setting the discarding rate to 0.5, and setting an activation function to RELU;
full connection layer: setting 128 nodes, setting the discard rate value to 0.5, and setting the activation function to RELU;
full connection layer: setting 512 nodes, setting a discarding rate value to 0.5, and setting an activation function to RELU;
(2) multi-target feature fusion:
combining the enhanced amplitude and MFCC features of the DNN network with the amplitude and MFCC features of the original noisy speech;
Figure RE-RE-GDA0002977011930000051
wherein
Figure RE-RE-GDA0002977011930000052
And
Figure RE-RE-GDA0002977011930000053
respectively representing the MFCC characteristics and the speech amplitude subjected to DNN prediction in the k-th space domain;
Figure RE-RE-GDA0002977011930000054
respectively representing MFCC features and speech amplitude of original noisy speech in the kth space domain;
(3) C-LSTM network:
(a)CNN:
and (3) rolling layers: performing convolution on the result obtained by the DNN network, setting the number of nodes as 64 nodes, setting the step length as 1, taking 5 × 1 as a convolution kernel, and setting an activation function as an SELU;
BN layer: normalizing the data;
and (3) rolling layers: the number of nodes is set to 64 nodes, the step length is set to 1, the convolution kernel takes 3 x 1, and the activation function is set to SELU;
BN layer: normalizing data
And (3) rolling layers: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel takes 5 x 1,
(b) residual error network
Performing convolution on the result obtained by the DNN network, setting the number of nodes as 128 nodes, setting the step length as 1, and taking 5 x 1 as a convolution kernel;
combining the data obtained by the residual network with the data obtained by the CNN network, and then using an SELU activation function;
max Pooling layer: step size is set to 1, pooling layer size is set to 2
(c) LSTM network:
the bidirectional network nodes of the long-time and short-time memory network are all selected as 128 nodes, the activation function is a Sigmoid function,
(4) an output layer:
using two feedforward neural networks as output layers to output the predicted speech signal amplitude and MFCC; optimizing network parameters by adopting an Adam optimizer through the network model; all the convolution layers adopt an edge filling mode.
(5) Calculating a minimum mean square error objective function
Figure RE-RE-GDA0002977011930000061
Wherein T is 2, the ratio of T to T is 2,
Figure RE-RE-GDA0002977011930000062
MFCC feature vector and predicted magnitude feature representing spatial prediction of kth acoustic feature, respectively
Figure RE-RE-GDA0002977011930000063
Respectively representing the k-th acoustic feature space clean MFCC feature vector and the clean amplitude feature.
The invention has the following advantages: the enhanced voice signal is stable, and has high fidelity and good voice quality.
The present invention will be described in detail below with reference to the accompanying drawings and examples.
Drawings
Fig. 1 is a flow chart of a conventional spectral subtraction method.
Fig. 2 is a flow chart of a DNN-CLSTM network based speech enhancement method in the training phase.
Fig. 3 is a flow chart of a DNN-CLSTM network-based speech enhancement method in a test phase.
FIG. 4 is a MFCC derivation flow diagram.
FIG. 5 is a process architecture diagram for building a DNN-CLSTM neural network.
FIG. 6 is a spectrogram of clean speech
FIG. 7 is a speech spectrogram containing noise
FIG. 8 is a speech spectrogram after DNN speech enhancement
FIG. 9 is a speech spectrogram after processing by CNN speech enhancement method
FIG. 10 is a speech spectrogram after processing by LSTM speech enhancement method
FIG. 11 is a speech spectrogram after processing by GRU speech enhancement method
FIG. 12 is a speech spectrogram processed by DNN-CLSTM speech enhancement method
Detailed Description
In order to overcome the defect that after a speech signal is enhanced by using spectral subtraction, the enhanced speech signal often has a large amount of music noise, so that the speech signal is distorted, and the intelligibility of the signal and the speech quality are poor, the embodiment provides a speech enhancement method based on a DNN-CLSTM network (as shown in fig. 2 and 3), which includes the following steps:
acquiring two paths of noise-containing voice signals (or acquiring more than two paths of noise-containing voice signals, and selecting the signals according to actual needs); noisy speech signals consist of clean speech signals and noise signals:
y(m)=x(m)+n(m)
where y (m) is a speech signal containing noise, x (m) is a clean speech signal, and n (m) is a noise signal.
A training stage:
1. framing and windowing
Obtaining the amplitude and phase of the clean speech signal and the noisy speech signal, taking the sum processing of the noisy speech as an example:
carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier transform as first characteristics;
windowing and framing a noisy speech signal using a hamming window:
yw(m)=w(m)y(m)=w(m)[x(m)+n(m)]=xw(m)+nw(m)
the windowing operation is represented in the frequency domain as:
Yw(f)=W(f)*Y(f)=Xw(f)+Nw(f)
the subscript w of the signal is omitted for simplicity, assuming that the signal is windowed.
Will Yw(f) Expressed in polar coordinate form:
Figure RE-RE-GDA0002977011930000081
where | Y (f) | is the amplitude spectrum, φy(f) Is a Phase signal Phase [ Y (f)]。
2. Noise estimation:
estimating noise in a voice signal segment which does not contain a target signal and only contains noise, and solving the amplitude of a noise signal; the invention selects the first five frames of the voice signal as the noise signal segment. Since the magnitude spectrum | N (f) of the noise is unknown, but | N (f) | can be replaced by an estimate of the average magnitude spectrum in the absence of speech activity, the phase of the noise is φn(f) Can be derived from the phase phi of a noisy speech signaly(f) Instead of this. When the voice signal is not existed and only has noise, obtaining average noise amplitude spectrum
Figure RE-RE-GDA0002977011930000082
The calculation process is as follows:
Figure RE-RE-GDA0002977011930000083
wherein, | Ni(f) I is the spectrum of the frame of the ith noise and k is the number of frames in the pure noise signal period.
3. Spectral subtraction method
Subtracting the noise signal amplitude from the amplitude of the noisy speech signal to obtain a spectral subtraction speech signal amplitude as a second feature (using the enhanced speech signal obtained by spectral subtraction as a spectral subtraction speech signal, the corresponding amplitude being referred to as a spectral subtraction speech signal amplitude to distinguish from the enhanced speech signal finally obtained in this implementation);
the amplitude of the noise signal is subtracted from the amplitude of the noise-containing speech signal, and the calculation is as follows:
Figure RE-RE-GDA0002977011930000084
wherein the content of the first and second substances,
Figure RE-RE-GDA0002977011930000085
representing the amplitude of the spectrally subtracted speech signal, | Y (f) luminancebRepresenting the amplitude of the noisy speech signal,
Figure RE-RE-GDA0002977011930000086
represents the average of the noise statistics representing the noise segment. α represents a spectral subtraction noise figure. b is a power exponent, and is a magnitude spectrum subtraction when the exponent b is 1, and is a power spectrum subtraction when the exponent b is 2.
Since the estimation of the noise signal may generate errors, the amplitude spectrum of the estimated signal is caused
Figure RE-RE-GDA0002977011930000087
A negative value may be possible. Generally the value of the amplitude spectrum should not be negative in order to avoid
Figure RE-RE-GDA0002977011930000091
And (3) carrying out half-wave rectification on the differential spectrum when the differential spectrum is a negative value, wherein the calculation process is as follows:
Figure RE-RE-GDA0002977011930000092
after spectral subtraction, the speech signal after spectral subtraction needs to be subjected to power reduction, so as to obtain the speech signal amplitude after the spectral subtraction stage, i.e. the amplitude of the spectral subtraction speech signal
Figure RE-RE-GDA0002977011930000093
The calculation process is as follows:
Figure RE-RE-GDA0002977011930000094
4. extracting MFCC as a second feature;
(1) pretreatment of
The pre-processing comprises pre-emphasis, framing and windowing functions, the pre-emphasis processing is realized by a first-order high-pass filter, and the transfer function of the filter is as follows:
H(z)=1-az-1
wherein a is a pre-emphasis coefficient and is generally 0.98
The result of the pre-emphasis processing of the speech signal x (n) is:
y(n)=x(n)-ax(n-1)
framing refers to the process of generating speech in which the speech signal is known to be a non-stationary time-varying signal. The speech signal in a short time can be considered as a steady time-invariant signal, the short time is usually 10-30 ms, and is taken as 20ms in this text. Therefore, a speech signal is usually analyzed and processed by a short-time analysis technique, many frames are used to analyze their characteristic parameters, and in order to make a smooth transition between frames, a portion where there is an overlap between two adjacent frames, i.e., a frame shift, is set to 10 ms. The purpose of the windowing function is to reduce the spectral leakage in the frequency domain, and each frame of speech signal is windowed, usually with a hamming window, which has a smaller spectral leakage than a rectangular window. y (n) is processed by frame division and window addition to obtain yi(n), which is defined as:
Figure RE-RE-GDA0002977011930000101
where ω (n) is a Hamming window whose expression is
Figure RE-RE-GDA0002977011930000102
Wherein, yi(n) represents the i-th frame speech signal, n represents the number of samples, and L represents the frame length.
(2) Fast Fourier Transform (FFT)
Since the characteristics of the speech signal are generally not readily apparent from the transformation of the speech signal in the time domain, it is often analyzed by transforming it into the frequency domain, with the different frequency spectra representing the different characteristics of the speech signal. Thus for each frame of speech signal yi(n) performing fast fourier transform to obtain a frequency spectrum of each frame of signal, as shown in the following formula:
Y(i,k)=FFT[yi(n)]
where k denotes the kth spectral line in the frequency domain.
(3) Calculating spectral line energy
The energy E (i, k) of each frame of speech signal spectral line in the frequency domain can be expressed as:
E(i,k)=[Y(i,k)]2
(4) calculating the energy passed through the Mel Filter
The energy S (i, m) of each frame of spectral line energy passing through the Mel filter can be defined as:
Figure RE-RE-GDA0002977011930000103
where N represents the number of points of the FFT.
Transfer function H of each filterm(k) Is composed of
Figure RE-RE-GDA0002977011930000111
Where f (M) is the center frequency of the mth filter, M is the mth filter, and M is the number of filters, and is usually set to 24.
(5) Calculating the MFCC, referring to FIG. 4, taking the logarithm of the energy of the Mel filter, and then calculating the Discrete Cosine Transform (DCT) to obtain the characteristic parameters of the MFCC, as shown in the following formula:
Figure RE-RE-GDA0002977011930000112
where j is the DCT spectral line.
8. Training based on a DNN-CLSTM network model;
inputting the two characteristics of the voice signal amplitude and the MFCC after the noise-containing voice spectrum subtraction into a DNN-CLSTM network for training to obtain an enhanced estimated amplitude and an MFCC; and respectively calculating the minimum mean square error of the estimated amplitude value and the MFCC and the pure amplitude value and the pure MFCC, and inputting the obtained error into a neural network as an adjusting signal to optimize the network so as to obtain the trained network.
And (3) a testing stage:
1. carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier transform as first characteristics;
2. the noise-containing voice signal is subjected to framing and windowing to obtain the voice signal amplitude after the spectral subtraction stage, namely the spectral subtraction voice signal amplitude
Figure RE-RE-GDA0002977011930000113
As a second feature;
3. performing framing and windowing on the noisy speech signal to obtain MFCC as a third characteristic;
4. performing time-frequency decomposition on the noisy speech signal, and extracting the characteristics to obtain EMRACC(j, m) and Δ EMRACC(j, m) as a fourth feature of the input to the deep neural network.
5. Extracting the amplitude of the noisy speech, the amplitude of the spectrally subtracted speech signal, MFCC and the features to obtain EMRACC(j, m) and Δ EMRACC(j, m) signals, etcInputting the characteristics into a trained DNN-CLSTM network to obtain an enhanced estimated amplitude, MFCC and a masking threshold;
6. and (3) combining the amplitude of the enhanced voice signal with the phase of the noisy voice signal obtained in the step (1), and performing inverse Fourier transform to obtain a final enhanced voice signal. Preliminary enhanced speech signal amplitude obtained after neural network training
Figure RE-RE-GDA0002977011930000121
The time domain signal is restored to obtain the final enhanced speech signal, and the amplitude of the initial enhanced speech signal is firstly enhanced
Figure RE-RE-GDA0002977011930000122
The phase phi of the noisy speech signal extracted in step 1 is requiredy(f) Combined and then converted to a time domain signal using an inverse fourier transform
Figure RE-RE-GDA0002977011930000123
Thereby resulting in a final enhanced speech signal. The enhanced MFCC and the masking threshold obtained here do not participate in waveform recovery, and the network is optimized in the network processing process.
7. Network establishment
The DNN-CLSTM Network includes a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a Residual Network (Residual Network), and a Long-term Memory Network (Bidirectional Long-term Memory Network), and a specific process of establishing the DNN-CLSTM Neural Network is shown in fig. 5:
(1) DNN network establishment
An input layer: the amplitude of the spectrum-reduced voice signal obtained after the spectrum reduction is used as input, and the input nodes are 128 neurons;
full connection layer: setting 32 nodes, setting the discarding rate to 0.5, and setting an activation function to RELU;
full connection layer: setting 128 nodes, setting the discard rate value to 0.5, and setting the activation function to RELU;
full connection layer: setting 512 nodes, setting a discarding rate value to 0.5, and setting an activation function to RELU;
(2) multi-target feature fusion:
combining the enhanced amplitude and MFCC features of the DNN network with the amplitude and MFCC features of the original noisy speech;
Figure RE-RE-GDA0002977011930000131
wherein
Figure RE-RE-GDA0002977011930000132
And
Figure RE-RE-GDA0002977011930000133
respectively representing the MFCC characteristics and the speech amplitude subjected to DNN prediction in the k-th space domain;
Figure RE-RE-GDA0002977011930000134
respectively representing MFCC features and speech amplitude of original noisy speech in the kth space domain;
(a)CNN:
and (3) rolling layers: performing convolution on the result obtained by the DNN network, setting the number of nodes as 64 nodes, setting the step length as 1, taking 5 × 1 as a convolution kernel, and setting an activation function as an SELU;
BN layer: the data is normalized by the number of pixels in the sample,
and (3) rolling layers: the number of nodes is set to 64 nodes, the step length is set to 1, the convolution kernel takes 3 x 1, and the activation function is set to SELU;
BN layer: normalizing data
And (3) rolling layers: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel takes 5 x 1,
(b) residual error network
Performing convolution on the result obtained by the DNN network, setting the number of nodes as 128 nodes, setting the step length as 1, and taking 5 x 1 as a convolution kernel;
combining the data obtained by the residual network with the data obtained by the CNN network, and then using an SELU activation function;
max Pooling layer: step size is set to 1, pooling layer size is set to 2
(3) LSTM network:
the bidirectional network nodes of the long-time and short-time memory network are all selected as 128 nodes, the activation function is a Sigmoid function,
(4) an output layer:
using two feedforward neural networks as an output layer to output the enhanced speech signal amplitude and MFCC; optimizing network parameters by adopting an Adam optimizer through the network model; all the convolution layers adopt an edge filling mode.
(5) Calculating a minimum mean square error objective function
Figure RE-RE-GDA0002977011930000141
Wherein T is 2, the ratio of T to T is 2,
Figure RE-RE-GDA0002977011930000142
MFCC feature vector and predicted magnitude feature representing spatial prediction of kth acoustic feature, respectively
Figure RE-RE-GDA0002977011930000143
Respectively representing the k-th acoustic feature space clean MFCC feature vector and the clean amplitude feature.
[ Experimental examples ]
The speech data used in the experiment were from the TIMIT dataset and the Noise dataset was from the Nonspeech Noise library and the Noise-15 Noise library. In this experiment, the total of the TIMIT data contained 6300 pieces of speech. About 80% of the speech was used as training set and the other 20% as test speech. All speech was resampled to 16 kHz. For the model proposed by the present invention, several typical neural network speech enhancement models are selected to compare with the method proposed by the present invention, including (a) dnn, (b) cnn, (c) lstm, (d) GRU. The method comprises the steps of obtaining a neural network model, obtaining a DNN (deep neural network) -based speech enhancement algorithm, obtaining a CNN (convolutional neural network) -based speech enhancement algorithm, obtaining an LSTM (long term memory) based speech enhancement algorithm, and obtaining a GRU (GRU) based speech enhancement algorithm.
All models were trained at SNR of-5 dB, 0dB,5dB, 10dB, 15dB,20dB and performance was evaluated at matched signal-to-noise ratios. To test the robustness of the speech enhancement model, the performance was evaluated under mismatched signal-to-noise conditions. PESQ and LSD are two important indexes for evaluating voice, PESQ refers to subjective voice quality evaluation index, and the higher the LESQ score is, the better the voice quality is; LSD refers to a log spectral distance index, the lower the LSD score, the better the speech quality. Table 1 shows the results of the tests in comparison with the other four algorithms (DNN, CNN, LSTM, GRU) under matched noise conditions, with the best performing algorithm results shown in bold. Table 2 shows the results of the tests in comparison with the other four algorithms (DNN, CNN, LSTM, GRU) under mismatched noise conditions, with the best performing algorithm results shown in bold.
TABLE 1 test results under matched noise conditions, best performing ones are in bold
Figure RE-RE-GDA0002977011930000161
TABLE 2 test results under mismatched noise conditions, best performing ones are in bold
Figure RE-RE-GDA0002977011930000162

Claims (3)

1. A speech enhancement method based on DNN-CLSTM network is characterized by comprising the following steps:
the method comprises the following steps: and acquiring a noisy voice signal. The noisy speech signal is formed by adding a clean speech signal and a noise signal:
y(m)=x(m)+n(m)
wherein y (m) is a noisy speech signal, x (m) is a clean speech signal, n (m) is a noise signal, and m is a discrete time sequence;
step two: performing frame windowing to obtain the amplitude and phase of the pure voice signal and the noisy voice signal;
and carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier change. Using the first five frame signals in the voice section as noise estimation to calculate the amplitude of the noise signal;
step three: subtracting the noise signal amplitude from the amplitude of the noise-containing voice signal to obtain a spectrum-subtracted voice signal amplitude as a first characteristic;
step four: obtaining the MFCC of the voice signal as a second characteristic;
step five: establishing a DNN-CLSTM network model for training;
inputting the two characteristics of the voice signal amplitude and the MFCC after the noise-containing voice spectrum subtraction into a DNN-CLSTM network for training to obtain a predicted amplitude and an MFCC; and respectively calculating the Minimum Mean Square Error (MMSE) of the predicted amplitude and the MFCC and the pure amplitude and the pure MFCC, and inputting the obtained error into a neural network as an adjusting signal to optimize the network so as to obtain the trained network.
2. The speech enhancement method based on the DNN-CLSTM network of claim 1, characterized in that the concrete process of said step four is:
(1) pretreatment: the preprocessing comprises pre-emphasis, framing and windowing functions;
pre-emphasis processing: the method is realized by a first-order high-pass filter, and the transfer function of the filter is as follows:
H(z)=1-az-1
wherein a is a pre-emphasis coefficient, and is generally 0.98; the result of the pre-emphasis processing of the speech signal x (n) is:
y(n)=x(n)-ax(n-1)
framing and windowing: the overlapped part between two adjacent frames is the frame shift and is set to be 10 ms; windowing function: carrying out Hamming window windowing processing on each frame of voice signals: y (n) is processed by frame division and window addition to obtain yi(n), which is defined as:
Figure FDA0002793741570000021
where ω (n) is a Hamming window whose expression is
Figure FDA0002793741570000022
Wherein, yi(n) represents the ith frame of speech signal, n represents the number of sample points, and L represents the frame length;
(2) fast Fourier Transform (FFT)
For each frame of speech signal yi(n) performing fast Fourier transform to obtain the frequency spectrum of each frame of signal, wherein the expression is as follows:
Y(i,k)=FFT[yi(n)]
wherein k represents the kth spectral line in the frequency domain;
(3) calculating spectral line energy
The energy E (i, k) of each frame of speech signal spectral line in the frequency domain is represented as:
E(i,k)=[Y(i,k)]2
(4) calculating the energy passed through the Mel Filter
The energy S (i, m) of each frame of spectral line energy passing through the Mel filter is defined as:
Figure FDA0002793741570000023
wherein, N represents the number of FFT points, and M is the number of filters;
transfer function H of each filterm(k) Is composed of
Figure FDA0002793741570000031
Wherein f (m) is the center frequency of the mth filter, and m is the mth filter;
(5) computing MFCC
Logarithm is taken to the energy of the Mel filter, and then discrete cosine transform is calculated to obtain MFCC characteristic parameters, which are shown as the following formula:
Figure FDA0002793741570000032
where j is the spectral line after Discrete Cosine Transform (DCT).
3. The speech enhancement method based on the DNN-CLSTM network as claimed in claim 1, characterized in that the concrete procedure of said step five is:
(1) DNN network establishment
An input layer: inputting the speech amplitude and MFCC characteristics after spectral subtraction into a DNN network, wherein the number of the nodes of the neurons of an input layer is 128;
full connection layer: setting 32 nodes, setting the discarding rate to 0.5, and setting an activation function to RELU;
full connection layer: setting 128 nodes, setting the discard rate value to 0.5, and setting the activation function to RELU;
full connection layer: setting 512 nodes, setting a discarding rate value to 0.5, and setting an activation function to RELU; (2) multi-target feature fusion:
combining the enhanced amplitude and MFCC features of the DNN network with the amplitude and MFCC features of the original noisy speech;
Figure FDA0002793741570000041
wherein
Figure FDA0002793741570000042
And
Figure FDA0002793741570000043
respectively representing the MFCC characteristics and the speech amplitude subjected to DNN prediction in the k-th space domain;
Figure FDA0002793741570000044
MFCC features and speech amplitude values respectively representing original noisy speech in the kth spatial domain;
(3) C-LSTM network:
(a)CNN:
and (3) rolling layers: performing convolution on the result obtained by the DNN network, setting the number of nodes as 64 nodes, setting the step length as 1, taking 5 × 1 as a convolution kernel, and setting an activation function as an SELU;
BN layer: normalizing the data;
and (3) rolling layers: the number of nodes is set to 64 nodes, the step length is set to 1, the convolution kernel takes 3 x 1, and the activation function is set to SELU;
BN layer: normalizing data
And (3) rolling layers: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel takes 5 x 1,
(b) residual error network
Performing convolution on the result obtained by the DNN network, setting the number of nodes as 128 nodes, setting the step length as 1, and taking 5 x 1 as a convolution kernel;
combining the data obtained by the residual network with the data obtained by the CNN network, and then using an SELU activation function;
max Pooling layer: step size is set to 1, pooling layer size is set to 2
(c) LSTM network:
the bidirectional network nodes of the long-time and short-time memory network are all selected as 128 nodes, the activation function is a Sigmoid function,
(4) an output layer:
using two feedforward neural networks as output layers to output the predicted speech signal amplitude and MFCC; optimizing network parameters by adopting an Adam optimizer through the network model; all the convolution layers adopt an edge filling mode.
(5) Calculating a minimum mean square error objective function
Figure FDA0002793741570000051
Wherein T is 2, the ratio of T to T is 2,
Figure FDA0002793741570000052
MFCC feature vector and predicted magnitude feature representing spatial prediction of kth acoustic feature, respectively
Figure FDA0002793741570000053
Respectively representing the k-th acoustic feature space clean MFCC feature vector and the clean amplitude feature.
CN202011323987.2A 2020-11-23 2020-11-23 Speech enhancement method based on DNN-CLSTM network Active CN112735456B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011323987.2A CN112735456B (en) 2020-11-23 2020-11-23 Speech enhancement method based on DNN-CLSTM network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011323987.2A CN112735456B (en) 2020-11-23 2020-11-23 Speech enhancement method based on DNN-CLSTM network

Publications (2)

Publication Number Publication Date
CN112735456A true CN112735456A (en) 2021-04-30
CN112735456B CN112735456B (en) 2024-01-16

Family

ID=75597716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011323987.2A Active CN112735456B (en) 2020-11-23 2020-11-23 Speech enhancement method based on DNN-CLSTM network

Country Status (1)

Country Link
CN (1) CN112735456B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192520A (en) * 2021-07-01 2021-07-30 腾讯科技(深圳)有限公司 Audio information processing method and device, electronic equipment and storage medium
CN113269305A (en) * 2021-05-20 2021-08-17 郑州铁路职业技术学院 Feedback voice strengthening method for strengthening memory
CN113314136A (en) * 2021-05-27 2021-08-27 西安电子科技大学 Voice optimization method based on directional noise reduction and dry sound extraction technology
CN113611323A (en) * 2021-05-07 2021-11-05 北京至芯开源科技有限责任公司 Voice enhancement method and system based on dual-channel convolution attention network
CN114093379A (en) * 2021-12-15 2022-02-25 荣耀终端有限公司 Noise elimination method and device
CN114283829A (en) * 2021-12-13 2022-04-05 电子科技大学 Voice enhancement method based on dynamic gate control convolution cyclic network
CN117193391A (en) * 2023-11-07 2023-12-08 北京铁力山科技股份有限公司 Intelligent control desk angle adjustment system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109410976A (en) * 2018-11-01 2019-03-01 北京工业大学 Sound enhancement method based on binaural sound sources positioning and deep learning in binaural hearing aid
CN110060704A (en) * 2019-03-26 2019-07-26 天津大学 A kind of sound enhancement method of improved multiple target criterion study
WO2020024452A1 (en) * 2018-07-31 2020-02-06 平安科技(深圳)有限公司 Deep learning-based answering method and apparatus, and readable storage medium
CN110930997A (en) * 2019-12-10 2020-03-27 四川长虹电器股份有限公司 Method for labeling audio by using deep learning model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020024452A1 (en) * 2018-07-31 2020-02-06 平安科技(深圳)有限公司 Deep learning-based answering method and apparatus, and readable storage medium
CN109410976A (en) * 2018-11-01 2019-03-01 北京工业大学 Sound enhancement method based on binaural sound sources positioning and deep learning in binaural hearing aid
CN110060704A (en) * 2019-03-26 2019-07-26 天津大学 A kind of sound enhancement method of improved multiple target criterion study
CN110930997A (en) * 2019-12-10 2020-03-27 四川长虹电器股份有限公司 Method for labeling audio by using deep learning model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姚远;王秋菊;周伟;鲍程毅;彭磊;: "改进谱减法结合神经网络的语音增强研究", 电子测量技术, no. 07 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113611323A (en) * 2021-05-07 2021-11-05 北京至芯开源科技有限责任公司 Voice enhancement method and system based on dual-channel convolution attention network
CN113611323B (en) * 2021-05-07 2024-02-20 北京至芯开源科技有限责任公司 Voice enhancement method and system based on double-channel convolution attention network
CN113269305A (en) * 2021-05-20 2021-08-17 郑州铁路职业技术学院 Feedback voice strengthening method for strengthening memory
CN113269305B (en) * 2021-05-20 2024-05-03 郑州铁路职业技术学院 Feedback voice strengthening method for strengthening memory
CN113314136A (en) * 2021-05-27 2021-08-27 西安电子科技大学 Voice optimization method based on directional noise reduction and dry sound extraction technology
CN113192520A (en) * 2021-07-01 2021-07-30 腾讯科技(深圳)有限公司 Audio information processing method and device, electronic equipment and storage medium
CN114283829A (en) * 2021-12-13 2022-04-05 电子科技大学 Voice enhancement method based on dynamic gate control convolution cyclic network
CN114283829B (en) * 2021-12-13 2023-06-16 电子科技大学 Voice enhancement method based on dynamic gating convolution circulation network
CN114093379A (en) * 2021-12-15 2022-02-25 荣耀终端有限公司 Noise elimination method and device
CN114093379B (en) * 2021-12-15 2022-06-21 北京荣耀终端有限公司 Noise elimination method and device
CN117193391A (en) * 2023-11-07 2023-12-08 北京铁力山科技股份有限公司 Intelligent control desk angle adjustment system
CN117193391B (en) * 2023-11-07 2024-01-23 北京铁力山科技股份有限公司 Intelligent control desk angle adjustment system

Also Published As

Publication number Publication date
CN112735456B (en) 2024-01-16

Similar Documents

Publication Publication Date Title
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
Ghanbari et al. A new approach for speech enhancement based on the adaptive thresholding of the wavelet packets
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
Zhao et al. Late reverberation suppression using recurrent neural networks with long short-term memory
Cui et al. Speech enhancement based on simple recurrent unit network
Strake et al. Separated noise suppression and speech restoration: LSTM-based speech enhancement in two stages
Tu et al. A hybrid approach to combining conventional and deep learning techniques for single-channel speech enhancement and recognition
Tabibian et al. Speech enhancement using a wavelet thresholding method based on symmetric Kullback–Leibler divergence
CN111192598A (en) Voice enhancement method for jump connection deep neural network
CN111899750B (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN110808057A (en) Voice enhancement method for generating confrontation network based on constraint naive
Martín-Doñas et al. Dual-channel DNN-based speech enhancement for smartphones
Nie et al. Deep Noise Tracking Network: A Hybrid Signal Processing/Deep Learning Approach to Speech Enhancement.
CN115497492A (en) Real-time voice enhancement method based on full convolution neural network
Chen Noise reduction of bird calls based on a combination of spectral subtraction, Wiener filtering, and Kalman filtering
CN114283835A (en) Voice enhancement and detection method suitable for actual communication condition
Ajay et al. Comparative study of deep learning techniques used for speech enhancement
CN113066483B (en) Sparse continuous constraint-based method for generating countermeasure network voice enhancement
Kim et al. iDeepMMSE: An improved deep learning approach to MMSE speech and noise power spectrum estimation for speech enhancement.
Huang et al. Teacher-Student Training Approach Using an Adaptive Gain Mask for LSTM-Based Speech Enhancement in the Airborne Noise Environment
CN108573698B (en) Voice noise reduction method based on gender fusion information
Jelčicová et al. PeakRNN and StatsRNN: Dynamic pruning in recurrent neural networks
Heitkaemper et al. Neural Network Based Carrier Frequency Offset Estimation From Speech Transmitted Over High Frequency Channels
Zhou Research on English speech enhancement algorithm based on improved spectral subtraction and deep neural network
CN114401168B (en) Voice enhancement method applicable to short wave Morse signal under complex strong noise environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant