CN112735456A - Speech enhancement method based on DNN-CLSTM network - Google Patents
Speech enhancement method based on DNN-CLSTM network Download PDFInfo
- Publication number
- CN112735456A CN112735456A CN202011323987.2A CN202011323987A CN112735456A CN 112735456 A CN112735456 A CN 112735456A CN 202011323987 A CN202011323987 A CN 202011323987A CN 112735456 A CN112735456 A CN 112735456A
- Authority
- CN
- China
- Prior art keywords
- network
- amplitude
- speech
- signal
- mfcc
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000003595 spectral effect Effects 0.000 claims abstract description 46
- 230000006870 function Effects 0.000 claims abstract description 40
- 238000012545 processing Methods 0.000 claims abstract description 18
- 238000013528 artificial neural network Methods 0.000 claims abstract description 16
- 238000009432 framing Methods 0.000 claims abstract description 16
- 238000012549 training Methods 0.000 claims abstract description 12
- 230000015654 memory Effects 0.000 claims abstract description 4
- 230000004913 activation Effects 0.000 claims description 21
- 238000001228 spectrum Methods 0.000 claims description 21
- 238000013527 convolutional neural network Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 12
- 235000004257 Cordia myxa Nutrition 0.000 claims description 9
- 244000157795 Cordia myxa Species 0.000 claims description 9
- 238000005096 rolling process Methods 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 230000037433 frameshift Effects 0.000 claims description 3
- 230000004927 fusion Effects 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000000873 masking effect Effects 0.000 abstract description 3
- 238000011156 evaluation Methods 0.000 abstract 1
- 238000004422 calculation algorithm Methods 0.000 description 11
- 230000007787 long-term memory Effects 0.000 description 9
- 238000012360 testing method Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 4
- 238000011410 subtraction method Methods 0.000 description 3
- 239000000654 additive Substances 0.000 description 2
- 230000000996 additive effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000003416 augmentation Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000006854 communication Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
Abstract
The invention relates to a speech enhancement method based on a deep neural network and a residual error long-time memory network (DNN-CLSTM). The method inputs the voice amplitude feature obtained by spectral subtraction and the voice Mel cepstrum coefficient (MFCC) feature obtained by fast Fourier transform into a DNN-CLSTM network model, and achieves the purpose of voice enhancement. Firstly, carrying out time-frequency masking and windowing framing processing on noisy voices, acquiring amplitude and phase characteristics of the noisy voices by utilizing fast Fourier transform, and estimating noise amplitude of the noisy voices; then, the estimated noise signal amplitude is subtracted from the noisy speech amplitude to obtain a spectrally subtracted speech signal amplitude as a first feature of the neural network input. Secondly, Fast Fourier Transform (FFT) is carried out on the noisy speech, spectral line energy of the speech signal is obtained, and then MFCC characteristics of the noisy speech are obtained and serve as second characteristics of the speech signal. Inputting the two characteristics into a DNN-CLSTM network for training to obtain a network model, and evaluating the effectiveness of the model by adopting a Minimum Mean Square Error (MMSE) loss function evaluation index. And finally, inputting the actual noisy speech set into a speech enhancement network model which completes training, predicting the enhanced estimated amplitude and MFCC, and obtaining a final enhanced speech signal by adopting inverse Fourier transform. The invention has high fidelity of voice.
Description
Technical Field
The invention belongs to the technical field of voice enhancement, and particularly relates to a voice enhancement method based on a DNN-CLSTM network.
Background
With the development of technology, voice not only plays a role in information transmission between people, but also applies a large amount of voice signals in human-computer interaction. However, in our communication process, the speech signal is often accompanied by a large amount of noise signals, such as factory noise, car noise, or background noise such as restaurant noise. A speech signal containing a large amount of noise may cause a receiving side to generate a large amount of interference when extracting useful information contained in the speech signal. In response to this problem, speech signal enhancement techniques have gained much attention.
Voice augmentation refers to a process of separating noise from a voice signal when real voice is disturbed by noise. Speech enhancement technology is widely used in many fields, such as mobile communication field, speech recognition field, etc. The main purpose of speech enhancement techniques is to improve speech quality and speech intelligibility. At present, the speech enhancement method is mainly divided into three types, namely spectral subtraction, subspace algorithm and statistical model-based algorithm. With the development of deep learning, neural networks have been applied to the field of speech enhancement.
The spectral subtraction shown in fig. 1 is one of the earliest denoising techniques in speech enhancement techniques. Spectral subtraction denoising is based on the following principle: assuming that the noise is additive noise, i.e., y (m) ═ x (m) + n (m), where y (m) is the signal containing the noise, x (m) is the clean speech signal, and n (m) is additive noise; by subtracting the estimate of the noise spectrum from the noise-containing speech signal, a clean speech signal is obtained. A prerequisite for this assumption is that the noise signal is stationary, so that between speech segments where the target signal is not present, the noise signal can be estimated and updated.
Spectral subtraction is a relatively simple speech enhancement algorithm whose principle is to subtract an estimated noise amplitude spectral value from an amplitude spectral value of an input mixed speech signal, and to synthesize a final spectrally subtracted speech signal by directly applying phase angle information before spectral subtraction to the spectrally subtracted information, taking advantage of the insensitivity of the human ear to phase. Since the spectral subtraction method only includes one fourier change and one inverse fourier change, it is computationally inexpensive and easy to implement. However, in reality, many noises are unstable signals, and therefore after a speech signal is enhanced by using spectral subtraction, the enhanced speech signal often has a large amount of music noises, which causes distortion of the speech signal, and thus the intelligibility and speech quality of the signal are poor.
Disclosure of Invention
The invention aims to solve the problems of speech signal distortion, poor signal intelligibility and speech quality and the like in the speech enhancement process based on spectral subtraction. In order to achieve the above object, the present invention provides a speech enhancement method based on DNN-CLSTM network, which is characterized by comprising the following steps:
the method comprises the following steps: acquiring at least two paths of noisy voice signals, wherein the noisy voice signals are formed by adding pure voice signals and noise signals:
y(m)=x(m)+n(m)
wherein y (m) is a noisy speech signal containing noise, x (m) is a clean speech signal, n (m) is a noise signal, and m is a discrete time sequence;
step two: framing and windowing are performed, and the amplitude and the phase of the pure voice signal and the noisy voice signal are obtained as first characteristics: carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier transform; meanwhile, in the voice signal section which does not contain the target signal and only contains noise, the noise is estimated, and the amplitude of the noise signal is calculated;
subtracting the noise signal amplitude from the amplitude of the noise-containing voice signal to obtain a spectrum-subtracted voice signal amplitude as a second characteristic;
step four, obtaining MFCC as a third characteristic;
step five: establishing a DNN-CLSTM network model for training;
inputting the two characteristics of the voice signal amplitude and the MFCC after the noise-containing voice spectrum subtraction into a DNN-CLSTM network for training to obtain an enhanced estimated amplitude and an MFCC; and respectively calculating the minimum mean square error of the estimated amplitude value and the MFCC and the pure amplitude value and the pure MFCC, and inputting the obtained error into a neural network as an adjusting signal to optimize the network so as to obtain the trained network.
The concrete process of the step four is as follows:
(1) pretreatment: the preprocessing comprises pre-emphasis, framing and windowing functions;
pre-emphasis processing: the method is realized by a first-order high-pass filter, and the transfer function of the filter is as follows:
H(z)=1-az-1
wherein a is a pre-emphasis coefficient, and is generally 0.98;
the result of the pre-emphasis processing of the speech signal x (n) is:
y(n)=x(n)-ax(n-1)
framing and windowing: the overlapped part between two adjacent frames is the frame shift and is set to be 10 ms; windowing function: carrying out Hamming window windowing processing on each frame of voice signals: y (n) is processed by frame division and window addition to obtain yi(n), which is defined as:
where ω (n) is a Hamming window whose expression is
Wherein, yi(n) represents the ith frame of speech signal, n represents the number of sample points, and L represents the frame length;
(2) fast Fourier Transform (FFT)
For each frame of speech signal yi(n) performing fast Fourier transform to obtainThe spectrum of each frame of signal is expressed as follows:
Y(i,k)=FFT[yi(n)]
wherein k represents the kth spectral line in the frequency domain;
(3) calculating spectral line energy
The energy E (i, k) of each frame of speech signal spectral line in the frequency domain is represented as:
E(i,k)=[Y(i,k)]2
(4) calculating the energy passed through the Mel Filter
The energy S (i, m) of each frame of spectral line energy passing through the Mel filter is defined as:
wherein N represents the number of points of FFT;
transfer function H of each filterm(k) Is composed of
Wherein, f (M) is the center frequency of the mth filter, M is the mth filter, and M is the number of the filters;
(5) computing MFCC
And (3) calculating discrete cosine transform after taking logarithm of energy of the Mel filter to obtain MFCC characteristic parameters, wherein the following formula is as follows:
where j is the DCT spectral line.
3. The concrete process of the step five is as follows:
(1) DNN network establishment
An input layer: inputting the speech amplitude and MFCC characteristics after spectral subtraction into a DNN network, wherein the number of the nodes of the neurons of an input layer is 128;
full connection layer: setting 32 nodes, setting the discarding rate to 0.5, and setting an activation function to RELU;
full connection layer: setting 128 nodes, setting the discard rate value to 0.5, and setting the activation function to RELU;
full connection layer: setting 512 nodes, setting a discarding rate value to 0.5, and setting an activation function to RELU;
(2) multi-target feature fusion:
combining the enhanced amplitude and MFCC features of the DNN network with the amplitude and MFCC features of the original noisy speech;
whereinAndrespectively representing the MFCC characteristics and the speech amplitude subjected to DNN prediction in the k-th space domain;respectively representing MFCC features and speech amplitude of original noisy speech in the kth space domain;
(3) C-LSTM network:
(a)CNN:
and (3) rolling layers: performing convolution on the result obtained by the DNN network, setting the number of nodes as 64 nodes, setting the step length as 1, taking 5 × 1 as a convolution kernel, and setting an activation function as an SELU;
BN layer: normalizing the data;
and (3) rolling layers: the number of nodes is set to 64 nodes, the step length is set to 1, the convolution kernel takes 3 x 1, and the activation function is set to SELU;
BN layer: normalizing data
And (3) rolling layers: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel takes 5 x 1,
(b) residual error network
Performing convolution on the result obtained by the DNN network, setting the number of nodes as 128 nodes, setting the step length as 1, and taking 5 x 1 as a convolution kernel;
combining the data obtained by the residual network with the data obtained by the CNN network, and then using an SELU activation function;
max Pooling layer: step size is set to 1, pooling layer size is set to 2
(c) LSTM network:
the bidirectional network nodes of the long-time and short-time memory network are all selected as 128 nodes, the activation function is a Sigmoid function,
(4) an output layer:
using two feedforward neural networks as output layers to output the predicted speech signal amplitude and MFCC; optimizing network parameters by adopting an Adam optimizer through the network model; all the convolution layers adopt an edge filling mode.
(5) Calculating a minimum mean square error objective function
Wherein T is 2, the ratio of T to T is 2,MFCC feature vector and predicted magnitude feature representing spatial prediction of kth acoustic feature, respectivelyRespectively representing the k-th acoustic feature space clean MFCC feature vector and the clean amplitude feature.
The invention has the following advantages: the enhanced voice signal is stable, and has high fidelity and good voice quality.
The present invention will be described in detail below with reference to the accompanying drawings and examples.
Drawings
Fig. 1 is a flow chart of a conventional spectral subtraction method.
Fig. 2 is a flow chart of a DNN-CLSTM network based speech enhancement method in the training phase.
Fig. 3 is a flow chart of a DNN-CLSTM network-based speech enhancement method in a test phase.
FIG. 4 is a MFCC derivation flow diagram.
FIG. 5 is a process architecture diagram for building a DNN-CLSTM neural network.
FIG. 6 is a spectrogram of clean speech
FIG. 7 is a speech spectrogram containing noise
FIG. 8 is a speech spectrogram after DNN speech enhancement
FIG. 9 is a speech spectrogram after processing by CNN speech enhancement method
FIG. 10 is a speech spectrogram after processing by LSTM speech enhancement method
FIG. 11 is a speech spectrogram after processing by GRU speech enhancement method
FIG. 12 is a speech spectrogram processed by DNN-CLSTM speech enhancement method
Detailed Description
In order to overcome the defect that after a speech signal is enhanced by using spectral subtraction, the enhanced speech signal often has a large amount of music noise, so that the speech signal is distorted, and the intelligibility of the signal and the speech quality are poor, the embodiment provides a speech enhancement method based on a DNN-CLSTM network (as shown in fig. 2 and 3), which includes the following steps:
acquiring two paths of noise-containing voice signals (or acquiring more than two paths of noise-containing voice signals, and selecting the signals according to actual needs); noisy speech signals consist of clean speech signals and noise signals:
y(m)=x(m)+n(m)
where y (m) is a speech signal containing noise, x (m) is a clean speech signal, and n (m) is a noise signal.
A training stage:
1. framing and windowing
Obtaining the amplitude and phase of the clean speech signal and the noisy speech signal, taking the sum processing of the noisy speech as an example:
carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier transform as first characteristics;
windowing and framing a noisy speech signal using a hamming window:
yw(m)=w(m)y(m)=w(m)[x(m)+n(m)]=xw(m)+nw(m)
the windowing operation is represented in the frequency domain as:
Yw(f)=W(f)*Y(f)=Xw(f)+Nw(f)
the subscript w of the signal is omitted for simplicity, assuming that the signal is windowed.
Will Yw(f) Expressed in polar coordinate form:
where | Y (f) | is the amplitude spectrum, φy(f) Is a Phase signal Phase [ Y (f)]。
2. Noise estimation:
estimating noise in a voice signal segment which does not contain a target signal and only contains noise, and solving the amplitude of a noise signal; the invention selects the first five frames of the voice signal as the noise signal segment. Since the magnitude spectrum | N (f) of the noise is unknown, but | N (f) | can be replaced by an estimate of the average magnitude spectrum in the absence of speech activity, the phase of the noise is φn(f) Can be derived from the phase phi of a noisy speech signaly(f) Instead of this. When the voice signal is not existed and only has noise, obtaining average noise amplitude spectrumThe calculation process is as follows:
wherein, | Ni(f) I is the spectrum of the frame of the ith noise and k is the number of frames in the pure noise signal period.
3. Spectral subtraction method
Subtracting the noise signal amplitude from the amplitude of the noisy speech signal to obtain a spectral subtraction speech signal amplitude as a second feature (using the enhanced speech signal obtained by spectral subtraction as a spectral subtraction speech signal, the corresponding amplitude being referred to as a spectral subtraction speech signal amplitude to distinguish from the enhanced speech signal finally obtained in this implementation);
the amplitude of the noise signal is subtracted from the amplitude of the noise-containing speech signal, and the calculation is as follows:
wherein the content of the first and second substances,representing the amplitude of the spectrally subtracted speech signal, | Y (f) luminancebRepresenting the amplitude of the noisy speech signal,represents the average of the noise statistics representing the noise segment. α represents a spectral subtraction noise figure. b is a power exponent, and is a magnitude spectrum subtraction when the exponent b is 1, and is a power spectrum subtraction when the exponent b is 2.
Since the estimation of the noise signal may generate errors, the amplitude spectrum of the estimated signal is causedA negative value may be possible. Generally the value of the amplitude spectrum should not be negative in order to avoidAnd (3) carrying out half-wave rectification on the differential spectrum when the differential spectrum is a negative value, wherein the calculation process is as follows:
after spectral subtraction, the speech signal after spectral subtraction needs to be subjected to power reduction, so as to obtain the speech signal amplitude after the spectral subtraction stage, i.e. the amplitude of the spectral subtraction speech signalThe calculation process is as follows:
4. extracting MFCC as a second feature;
(1) pretreatment of
The pre-processing comprises pre-emphasis, framing and windowing functions, the pre-emphasis processing is realized by a first-order high-pass filter, and the transfer function of the filter is as follows:
H(z)=1-az-1
wherein a is a pre-emphasis coefficient and is generally 0.98
The result of the pre-emphasis processing of the speech signal x (n) is:
y(n)=x(n)-ax(n-1)
framing refers to the process of generating speech in which the speech signal is known to be a non-stationary time-varying signal. The speech signal in a short time can be considered as a steady time-invariant signal, the short time is usually 10-30 ms, and is taken as 20ms in this text. Therefore, a speech signal is usually analyzed and processed by a short-time analysis technique, many frames are used to analyze their characteristic parameters, and in order to make a smooth transition between frames, a portion where there is an overlap between two adjacent frames, i.e., a frame shift, is set to 10 ms. The purpose of the windowing function is to reduce the spectral leakage in the frequency domain, and each frame of speech signal is windowed, usually with a hamming window, which has a smaller spectral leakage than a rectangular window. y (n) is processed by frame division and window addition to obtain yi(n), which is defined as:
where ω (n) is a Hamming window whose expression is
Wherein, yi(n) represents the i-th frame speech signal, n represents the number of samples, and L represents the frame length.
(2) Fast Fourier Transform (FFT)
Since the characteristics of the speech signal are generally not readily apparent from the transformation of the speech signal in the time domain, it is often analyzed by transforming it into the frequency domain, with the different frequency spectra representing the different characteristics of the speech signal. Thus for each frame of speech signal yi(n) performing fast fourier transform to obtain a frequency spectrum of each frame of signal, as shown in the following formula:
Y(i,k)=FFT[yi(n)]
where k denotes the kth spectral line in the frequency domain.
(3) Calculating spectral line energy
The energy E (i, k) of each frame of speech signal spectral line in the frequency domain can be expressed as:
E(i,k)=[Y(i,k)]2
(4) calculating the energy passed through the Mel Filter
The energy S (i, m) of each frame of spectral line energy passing through the Mel filter can be defined as:
where N represents the number of points of the FFT.
Transfer function H of each filterm(k) Is composed of
Where f (M) is the center frequency of the mth filter, M is the mth filter, and M is the number of filters, and is usually set to 24.
(5) Calculating the MFCC, referring to FIG. 4, taking the logarithm of the energy of the Mel filter, and then calculating the Discrete Cosine Transform (DCT) to obtain the characteristic parameters of the MFCC, as shown in the following formula:
where j is the DCT spectral line.
8. Training based on a DNN-CLSTM network model;
inputting the two characteristics of the voice signal amplitude and the MFCC after the noise-containing voice spectrum subtraction into a DNN-CLSTM network for training to obtain an enhanced estimated amplitude and an MFCC; and respectively calculating the minimum mean square error of the estimated amplitude value and the MFCC and the pure amplitude value and the pure MFCC, and inputting the obtained error into a neural network as an adjusting signal to optimize the network so as to obtain the trained network.
And (3) a testing stage:
1. carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier transform as first characteristics;
2. the noise-containing voice signal is subjected to framing and windowing to obtain the voice signal amplitude after the spectral subtraction stage, namely the spectral subtraction voice signal amplitudeAs a second feature;
3. performing framing and windowing on the noisy speech signal to obtain MFCC as a third characteristic;
4. performing time-frequency decomposition on the noisy speech signal, and extracting the characteristics to obtain EMRACC(j, m) and Δ EMRACC(j, m) as a fourth feature of the input to the deep neural network.
5. Extracting the amplitude of the noisy speech, the amplitude of the spectrally subtracted speech signal, MFCC and the features to obtain EMRACC(j, m) and Δ EMRACC(j, m) signals, etcInputting the characteristics into a trained DNN-CLSTM network to obtain an enhanced estimated amplitude, MFCC and a masking threshold;
6. and (3) combining the amplitude of the enhanced voice signal with the phase of the noisy voice signal obtained in the step (1), and performing inverse Fourier transform to obtain a final enhanced voice signal. Preliminary enhanced speech signal amplitude obtained after neural network trainingThe time domain signal is restored to obtain the final enhanced speech signal, and the amplitude of the initial enhanced speech signal is firstly enhancedThe phase phi of the noisy speech signal extracted in step 1 is requiredy(f) Combined and then converted to a time domain signal using an inverse fourier transformThereby resulting in a final enhanced speech signal. The enhanced MFCC and the masking threshold obtained here do not participate in waveform recovery, and the network is optimized in the network processing process.
7. Network establishment
The DNN-CLSTM Network includes a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a Residual Network (Residual Network), and a Long-term Memory Network (Bidirectional Long-term Memory Network), and a specific process of establishing the DNN-CLSTM Neural Network is shown in fig. 5:
(1) DNN network establishment
An input layer: the amplitude of the spectrum-reduced voice signal obtained after the spectrum reduction is used as input, and the input nodes are 128 neurons;
full connection layer: setting 32 nodes, setting the discarding rate to 0.5, and setting an activation function to RELU;
full connection layer: setting 128 nodes, setting the discard rate value to 0.5, and setting the activation function to RELU;
full connection layer: setting 512 nodes, setting a discarding rate value to 0.5, and setting an activation function to RELU;
(2) multi-target feature fusion:
combining the enhanced amplitude and MFCC features of the DNN network with the amplitude and MFCC features of the original noisy speech;
whereinAndrespectively representing the MFCC characteristics and the speech amplitude subjected to DNN prediction in the k-th space domain;respectively representing MFCC features and speech amplitude of original noisy speech in the kth space domain;
(a)CNN:
and (3) rolling layers: performing convolution on the result obtained by the DNN network, setting the number of nodes as 64 nodes, setting the step length as 1, taking 5 × 1 as a convolution kernel, and setting an activation function as an SELU;
BN layer: the data is normalized by the number of pixels in the sample,
and (3) rolling layers: the number of nodes is set to 64 nodes, the step length is set to 1, the convolution kernel takes 3 x 1, and the activation function is set to SELU;
BN layer: normalizing data
And (3) rolling layers: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel takes 5 x 1,
(b) residual error network
Performing convolution on the result obtained by the DNN network, setting the number of nodes as 128 nodes, setting the step length as 1, and taking 5 x 1 as a convolution kernel;
combining the data obtained by the residual network with the data obtained by the CNN network, and then using an SELU activation function;
max Pooling layer: step size is set to 1, pooling layer size is set to 2
(3) LSTM network:
the bidirectional network nodes of the long-time and short-time memory network are all selected as 128 nodes, the activation function is a Sigmoid function,
(4) an output layer:
using two feedforward neural networks as an output layer to output the enhanced speech signal amplitude and MFCC; optimizing network parameters by adopting an Adam optimizer through the network model; all the convolution layers adopt an edge filling mode.
(5) Calculating a minimum mean square error objective function
Wherein T is 2, the ratio of T to T is 2,MFCC feature vector and predicted magnitude feature representing spatial prediction of kth acoustic feature, respectivelyRespectively representing the k-th acoustic feature space clean MFCC feature vector and the clean amplitude feature.
[ Experimental examples ]
The speech data used in the experiment were from the TIMIT dataset and the Noise dataset was from the Nonspeech Noise library and the Noise-15 Noise library. In this experiment, the total of the TIMIT data contained 6300 pieces of speech. About 80% of the speech was used as training set and the other 20% as test speech. All speech was resampled to 16 kHz. For the model proposed by the present invention, several typical neural network speech enhancement models are selected to compare with the method proposed by the present invention, including (a) dnn, (b) cnn, (c) lstm, (d) GRU. The method comprises the steps of obtaining a neural network model, obtaining a DNN (deep neural network) -based speech enhancement algorithm, obtaining a CNN (convolutional neural network) -based speech enhancement algorithm, obtaining an LSTM (long term memory) based speech enhancement algorithm, and obtaining a GRU (GRU) based speech enhancement algorithm.
All models were trained at SNR of-5 dB, 0dB,5dB, 10dB, 15dB,20dB and performance was evaluated at matched signal-to-noise ratios. To test the robustness of the speech enhancement model, the performance was evaluated under mismatched signal-to-noise conditions. PESQ and LSD are two important indexes for evaluating voice, PESQ refers to subjective voice quality evaluation index, and the higher the LESQ score is, the better the voice quality is; LSD refers to a log spectral distance index, the lower the LSD score, the better the speech quality. Table 1 shows the results of the tests in comparison with the other four algorithms (DNN, CNN, LSTM, GRU) under matched noise conditions, with the best performing algorithm results shown in bold. Table 2 shows the results of the tests in comparison with the other four algorithms (DNN, CNN, LSTM, GRU) under mismatched noise conditions, with the best performing algorithm results shown in bold.
TABLE 1 test results under matched noise conditions, best performing ones are in bold
TABLE 2 test results under mismatched noise conditions, best performing ones are in bold
Claims (3)
1. A speech enhancement method based on DNN-CLSTM network is characterized by comprising the following steps:
the method comprises the following steps: and acquiring a noisy voice signal. The noisy speech signal is formed by adding a clean speech signal and a noise signal:
y(m)=x(m)+n(m)
wherein y (m) is a noisy speech signal, x (m) is a clean speech signal, n (m) is a noise signal, and m is a discrete time sequence;
step two: performing frame windowing to obtain the amplitude and phase of the pure voice signal and the noisy voice signal;
and carrying out windowing and framing processing on the noise-containing voice signal, and obtaining the amplitude and the phase of the noise-containing voice signal by using discrete Fourier change. Using the first five frame signals in the voice section as noise estimation to calculate the amplitude of the noise signal;
step three: subtracting the noise signal amplitude from the amplitude of the noise-containing voice signal to obtain a spectrum-subtracted voice signal amplitude as a first characteristic;
step four: obtaining the MFCC of the voice signal as a second characteristic;
step five: establishing a DNN-CLSTM network model for training;
inputting the two characteristics of the voice signal amplitude and the MFCC after the noise-containing voice spectrum subtraction into a DNN-CLSTM network for training to obtain a predicted amplitude and an MFCC; and respectively calculating the Minimum Mean Square Error (MMSE) of the predicted amplitude and the MFCC and the pure amplitude and the pure MFCC, and inputting the obtained error into a neural network as an adjusting signal to optimize the network so as to obtain the trained network.
2. The speech enhancement method based on the DNN-CLSTM network of claim 1, characterized in that the concrete process of said step four is:
(1) pretreatment: the preprocessing comprises pre-emphasis, framing and windowing functions;
pre-emphasis processing: the method is realized by a first-order high-pass filter, and the transfer function of the filter is as follows:
H(z)=1-az-1
wherein a is a pre-emphasis coefficient, and is generally 0.98; the result of the pre-emphasis processing of the speech signal x (n) is:
y(n)=x(n)-ax(n-1)
framing and windowing: the overlapped part between two adjacent frames is the frame shift and is set to be 10 ms; windowing function: carrying out Hamming window windowing processing on each frame of voice signals: y (n) is processed by frame division and window addition to obtain yi(n), which is defined as:
where ω (n) is a Hamming window whose expression is
Wherein, yi(n) represents the ith frame of speech signal, n represents the number of sample points, and L represents the frame length;
(2) fast Fourier Transform (FFT)
For each frame of speech signal yi(n) performing fast Fourier transform to obtain the frequency spectrum of each frame of signal, wherein the expression is as follows:
Y(i,k)=FFT[yi(n)]
wherein k represents the kth spectral line in the frequency domain;
(3) calculating spectral line energy
The energy E (i, k) of each frame of speech signal spectral line in the frequency domain is represented as:
E(i,k)=[Y(i,k)]2
(4) calculating the energy passed through the Mel Filter
The energy S (i, m) of each frame of spectral line energy passing through the Mel filter is defined as:
wherein, N represents the number of FFT points, and M is the number of filters;
transfer function H of each filterm(k) Is composed of
Wherein f (m) is the center frequency of the mth filter, and m is the mth filter;
(5) computing MFCC
Logarithm is taken to the energy of the Mel filter, and then discrete cosine transform is calculated to obtain MFCC characteristic parameters, which are shown as the following formula:
where j is the spectral line after Discrete Cosine Transform (DCT).
3. The speech enhancement method based on the DNN-CLSTM network as claimed in claim 1, characterized in that the concrete procedure of said step five is:
(1) DNN network establishment
An input layer: inputting the speech amplitude and MFCC characteristics after spectral subtraction into a DNN network, wherein the number of the nodes of the neurons of an input layer is 128;
full connection layer: setting 32 nodes, setting the discarding rate to 0.5, and setting an activation function to RELU;
full connection layer: setting 128 nodes, setting the discard rate value to 0.5, and setting the activation function to RELU;
full connection layer: setting 512 nodes, setting a discarding rate value to 0.5, and setting an activation function to RELU; (2) multi-target feature fusion:
combining the enhanced amplitude and MFCC features of the DNN network with the amplitude and MFCC features of the original noisy speech;
whereinAndrespectively representing the MFCC characteristics and the speech amplitude subjected to DNN prediction in the k-th space domain;MFCC features and speech amplitude values respectively representing original noisy speech in the kth spatial domain;
(3) C-LSTM network:
(a)CNN:
and (3) rolling layers: performing convolution on the result obtained by the DNN network, setting the number of nodes as 64 nodes, setting the step length as 1, taking 5 × 1 as a convolution kernel, and setting an activation function as an SELU;
BN layer: normalizing the data;
and (3) rolling layers: the number of nodes is set to 64 nodes, the step length is set to 1, the convolution kernel takes 3 x 1, and the activation function is set to SELU;
BN layer: normalizing data
And (3) rolling layers: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel takes 5 x 1,
(b) residual error network
Performing convolution on the result obtained by the DNN network, setting the number of nodes as 128 nodes, setting the step length as 1, and taking 5 x 1 as a convolution kernel;
combining the data obtained by the residual network with the data obtained by the CNN network, and then using an SELU activation function;
max Pooling layer: step size is set to 1, pooling layer size is set to 2
(c) LSTM network:
the bidirectional network nodes of the long-time and short-time memory network are all selected as 128 nodes, the activation function is a Sigmoid function,
(4) an output layer:
using two feedforward neural networks as output layers to output the predicted speech signal amplitude and MFCC; optimizing network parameters by adopting an Adam optimizer through the network model; all the convolution layers adopt an edge filling mode.
(5) Calculating a minimum mean square error objective function
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011323987.2A CN112735456B (en) | 2020-11-23 | 2020-11-23 | Speech enhancement method based on DNN-CLSTM network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011323987.2A CN112735456B (en) | 2020-11-23 | 2020-11-23 | Speech enhancement method based on DNN-CLSTM network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112735456A true CN112735456A (en) | 2021-04-30 |
CN112735456B CN112735456B (en) | 2024-01-16 |
Family
ID=75597716
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011323987.2A Active CN112735456B (en) | 2020-11-23 | 2020-11-23 | Speech enhancement method based on DNN-CLSTM network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112735456B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113192520A (en) * | 2021-07-01 | 2021-07-30 | 腾讯科技(深圳)有限公司 | Audio information processing method and device, electronic equipment and storage medium |
CN113269305A (en) * | 2021-05-20 | 2021-08-17 | 郑州铁路职业技术学院 | Feedback voice strengthening method for strengthening memory |
CN113314136A (en) * | 2021-05-27 | 2021-08-27 | 西安电子科技大学 | Voice optimization method based on directional noise reduction and dry sound extraction technology |
CN113611323A (en) * | 2021-05-07 | 2021-11-05 | 北京至芯开源科技有限责任公司 | Voice enhancement method and system based on dual-channel convolution attention network |
CN114093379A (en) * | 2021-12-15 | 2022-02-25 | 荣耀终端有限公司 | Noise elimination method and device |
CN114283829A (en) * | 2021-12-13 | 2022-04-05 | 电子科技大学 | Voice enhancement method based on dynamic gate control convolution cyclic network |
CN117193391A (en) * | 2023-11-07 | 2023-12-08 | 北京铁力山科技股份有限公司 | Intelligent control desk angle adjustment system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109410976A (en) * | 2018-11-01 | 2019-03-01 | 北京工业大学 | Sound enhancement method based on binaural sound sources positioning and deep learning in binaural hearing aid |
CN110060704A (en) * | 2019-03-26 | 2019-07-26 | 天津大学 | A kind of sound enhancement method of improved multiple target criterion study |
WO2020024452A1 (en) * | 2018-07-31 | 2020-02-06 | 平安科技(深圳)有限公司 | Deep learning-based answering method and apparatus, and readable storage medium |
CN110930997A (en) * | 2019-12-10 | 2020-03-27 | 四川长虹电器股份有限公司 | Method for labeling audio by using deep learning model |
-
2020
- 2020-11-23 CN CN202011323987.2A patent/CN112735456B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020024452A1 (en) * | 2018-07-31 | 2020-02-06 | 平安科技(深圳)有限公司 | Deep learning-based answering method and apparatus, and readable storage medium |
CN109410976A (en) * | 2018-11-01 | 2019-03-01 | 北京工业大学 | Sound enhancement method based on binaural sound sources positioning and deep learning in binaural hearing aid |
CN110060704A (en) * | 2019-03-26 | 2019-07-26 | 天津大学 | A kind of sound enhancement method of improved multiple target criterion study |
CN110930997A (en) * | 2019-12-10 | 2020-03-27 | 四川长虹电器股份有限公司 | Method for labeling audio by using deep learning model |
Non-Patent Citations (1)
Title |
---|
姚远;王秋菊;周伟;鲍程毅;彭磊;: "改进谱减法结合神经网络的语音增强研究", 电子测量技术, no. 07 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113611323A (en) * | 2021-05-07 | 2021-11-05 | 北京至芯开源科技有限责任公司 | Voice enhancement method and system based on dual-channel convolution attention network |
CN113611323B (en) * | 2021-05-07 | 2024-02-20 | 北京至芯开源科技有限责任公司 | Voice enhancement method and system based on double-channel convolution attention network |
CN113269305A (en) * | 2021-05-20 | 2021-08-17 | 郑州铁路职业技术学院 | Feedback voice strengthening method for strengthening memory |
CN113269305B (en) * | 2021-05-20 | 2024-05-03 | 郑州铁路职业技术学院 | Feedback voice strengthening method for strengthening memory |
CN113314136A (en) * | 2021-05-27 | 2021-08-27 | 西安电子科技大学 | Voice optimization method based on directional noise reduction and dry sound extraction technology |
CN113192520A (en) * | 2021-07-01 | 2021-07-30 | 腾讯科技(深圳)有限公司 | Audio information processing method and device, electronic equipment and storage medium |
CN114283829A (en) * | 2021-12-13 | 2022-04-05 | 电子科技大学 | Voice enhancement method based on dynamic gate control convolution cyclic network |
CN114283829B (en) * | 2021-12-13 | 2023-06-16 | 电子科技大学 | Voice enhancement method based on dynamic gating convolution circulation network |
CN114093379A (en) * | 2021-12-15 | 2022-02-25 | 荣耀终端有限公司 | Noise elimination method and device |
CN114093379B (en) * | 2021-12-15 | 2022-06-21 | 北京荣耀终端有限公司 | Noise elimination method and device |
CN117193391A (en) * | 2023-11-07 | 2023-12-08 | 北京铁力山科技股份有限公司 | Intelligent control desk angle adjustment system |
CN117193391B (en) * | 2023-11-07 | 2024-01-23 | 北京铁力山科技股份有限公司 | Intelligent control desk angle adjustment system |
Also Published As
Publication number | Publication date |
---|---|
CN112735456B (en) | 2024-01-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112735456B (en) | Speech enhancement method based on DNN-CLSTM network | |
Ghanbari et al. | A new approach for speech enhancement based on the adaptive thresholding of the wavelet packets | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
Zhao et al. | Late reverberation suppression using recurrent neural networks with long short-term memory | |
Cui et al. | Speech enhancement based on simple recurrent unit network | |
Strake et al. | Separated noise suppression and speech restoration: LSTM-based speech enhancement in two stages | |
Tu et al. | A hybrid approach to combining conventional and deep learning techniques for single-channel speech enhancement and recognition | |
Tabibian et al. | Speech enhancement using a wavelet thresholding method based on symmetric Kullback–Leibler divergence | |
CN111192598A (en) | Voice enhancement method for jump connection deep neural network | |
CN111899750B (en) | Speech enhancement algorithm combining cochlear speech features and hopping deep neural network | |
CN110808057A (en) | Voice enhancement method for generating confrontation network based on constraint naive | |
Martín-Doñas et al. | Dual-channel DNN-based speech enhancement for smartphones | |
Nie et al. | Deep Noise Tracking Network: A Hybrid Signal Processing/Deep Learning Approach to Speech Enhancement. | |
CN115497492A (en) | Real-time voice enhancement method based on full convolution neural network | |
Chen | Noise reduction of bird calls based on a combination of spectral subtraction, Wiener filtering, and Kalman filtering | |
CN114283835A (en) | Voice enhancement and detection method suitable for actual communication condition | |
Ajay et al. | Comparative study of deep learning techniques used for speech enhancement | |
CN113066483B (en) | Sparse continuous constraint-based method for generating countermeasure network voice enhancement | |
Kim et al. | iDeepMMSE: An improved deep learning approach to MMSE speech and noise power spectrum estimation for speech enhancement. | |
Huang et al. | Teacher-Student Training Approach Using an Adaptive Gain Mask for LSTM-Based Speech Enhancement in the Airborne Noise Environment | |
CN108573698B (en) | Voice noise reduction method based on gender fusion information | |
Jelčicová et al. | PeakRNN and StatsRNN: Dynamic pruning in recurrent neural networks | |
Heitkaemper et al. | Neural Network Based Carrier Frequency Offset Estimation From Speech Transmitted Over High Frequency Channels | |
Zhou | Research on English speech enhancement algorithm based on improved spectral subtraction and deep neural network | |
CN114401168B (en) | Voice enhancement method applicable to short wave Morse signal under complex strong noise environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |