CN111192598A

CN111192598A - Voice enhancement method for jump connection deep neural network

Info

Publication number: CN111192598A
Application number: CN202010012435.3A
Authority: CN
Inventors: 兰朝凤; 刘春东; 苏崎木; 郭思诚; 陈小艳
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2020-05-22

Abstract

A speech enhancement method of a jump connection deep neural network solves the problems of speech loss and low intelligibility of the traditional deep neural network DNN speech enhancement method in a low signal-to-noise ratio scene, and belongs to the field of speech enhancement. The invention comprises the following steps: extracting time-frequency domain characteristics according to the time-domain voice signals; determining a training target, and sending the training target and the extracted time-frequency domain characteristics into a Skip-DNN model for training to obtain a Skip-DNN voice enhancement model; jump connection is adopted among an input layer, a hidden layer and an output layer of the Skip-DNN model; s3, extracting the characteristics of the voice with noise, inputting the characteristics into a Skip-DNN voice enhancement model, and estimating a target voice; and S4, synthesizing the target voice and the voice with the noise to obtain an enhanced pure voice signal.

Description

Voice enhancement method for jump connection deep neural network

Technical Field

The invention relates to a deep neural network, in particular to a multi-resolution cochlear image voice enhancement method for improving a jump deep neural network, and belongs to the field of voice enhancement.

Background

Speech enhancement is a front-end technology of speech recognition, and has important applications in the fields of communication and human-computer interaction. Two people in the communication field are in different auditory scenes, if noise exists around one or both of the people, the communication is difficult, especially in the complex scene in the military field, the working task of communication becomes more difficult, and the transmitted voice quality is higher due to the special requirement of the complex scene, and at the moment, the quality and intelligibility of the voice with noise can be improved by adopting the voice enhancement technology; in the field of human-computer interaction, some intelligent devices begin to use a voice substitution keyboard as an input end in recent years, but in actual life scenes, many complex noises are not avoided, so that the voice recognition rate is low, and a general processing method is to add a voice enhancement algorithm at the front end of voice recognition to improve the voice quality. Therefore, in the communication field and the human-computer interaction field, the speech enhancement algorithm can improve the intelligibility of the speech with noise and enhance the speech quality, and the method is a research subject with wide application prospect.

Speech enhancement is mainly divided into supervised and unsupervised speech enhancement algorithms. Spectral subtraction and wiener filtering are common methods in unsupervised speech enhancement, but spectral subtraction additionally produces "musical noise" while enhancing speech quality. Ephraim and Malah assume that the noise signal is a stable Gaussian noise, and effectively enhance the speech signal by using a minimum mean square error estimation algorithm, and simultaneously reduce the interference of music noise. However, most of noise signals in real life are non-stationary signals, and Martin proposes a speech enhancement algorithm based on minimum statistics for the non-stationary noise signals. Research by numerous scholars shows that the unsupervised speech enhancement has a good enhancement effect in the environment with high signal-to-noise ratio and stable noise, but has a general enhancement effect in the environment with low signal-to-noise ratio and non-stable noise signals.

With the development of technology, researchers have attempted to process noisy speech signals using supervised speech enhancement algorithms for non-stationary noise signals, including shallow neural networks and deep neural networks. The non-negative matrix speech enhancement algorithm in the shallow neural network trains pure speech and noise respectively under the condition of assuming that the pure speech and the noise are independent, and a certain speech enhancement effect is obtained. Because the training data and the number of layers of the shallow neural network are less, the fitting effect of the test data is poor, and only some simple features can be extracted, so that the voice enhancement effect is poor. With the continuous development of deep learning, the deep neural network is gradually applied to the field of speech enhancement. Wang et al use DNN (Deep Neural Network, DNN) to train a time-frequency mask between clean speech and noise, which greatly improves intelligibility of speech signals. Xu et al uses the DNN model to establish a non-linear relationship between the noisy speech power spectrum and the clean speech power spectrum, uses a drop algorithm Dropout to prevent the overfitting phenomenon, and uses a least-batched stochastic gradient descent algorithm to accelerate the training speed. Williamson et al estimate the real and imaginary components of the complex floating mask using a DNN speech enhancement model to estimate the amplitude and phase of the speech and improve the phase offset due to noise. Chen et al provide a voice feature of a multi-resolution cochlear image for a low signal-to-noise ratio environment, obtain global and local features of a voice signal, and improve a voice enhancement effect in a low signal-to-noise ratio scene. Chen et al artificially improves the generalization ability of the DNN speech enhancement model, adds disturbance to noise in the training process to make the noise have diversity, and improves the speech enhancement effect. Tu et al use the output layer of the DNN to estimate the target and the interference separately, and the accuracy of the estimated target speech is significantly improved in speech recognition. Tu et al propose Skip-DNN speech enhancement model, and adopt MEL-magnitude of frequency as the network input and speech characteristic of output, can better solve the problem of gradient disappearance, and carry more speech information in the course of training, improve speech enhancement performance. Tseng et al extract speech signal features using sparse nonnegative matrix decomposition, estimate IBM using DNN speech enhancement model, and improve speech intelligibility to some extent under low signal-to-noise ratio conditions.

According to the analysis, most DNN-based speech enhancement algorithms only adopt a full-connection structure mode, and under the environment of low signal-to-noise ratio, partial pure speech features are easily ignored in the training process of the full-connection structure, and the phenomenon of speech loss occurs.

Disclosure of Invention

In order to solve the problems of speech loss and low intelligibility of the traditional deep neural network DNN speech enhancement method in a low signal-to-noise ratio scene, the invention provides a speech enhancement method of a jump connection deep neural network.

The invention discloses a voice enhancement method of a jump connection deep neural network, which comprises the following steps:

s1, extracting time-frequency domain characteristics according to the time-domain voice signals;

s2, determining a training target, and sending the training target and the extracted time-frequency domain characteristics into a Skip-DNN model for training to obtain a Skip-DNN voice enhancement model;

the Skip-DNN model comprises an input layer, a hidden layer and an output layer, wherein the input layer, the hidden layer and the output layer are in jump connection; the first module comprises two hidden layers, and the number of nodes of the second hidden layer is the same as that of the nodes of the input layer;

the second module and the third module have the same structure as the first module, the fourth module only has one hidden layer, and the number of nodes of the hidden layer is the same as that of the nodes of the input layer;

s3, extracting the characteristics of the voice with noise, inputting the characteristics into a Skip-DNN voice enhancement model, and estimating a target voice;

and S4, synthesizing the target voice and the voice with the noise to obtain an enhanced pure voice signal.

Preferably, the first module to the third module each further include a drop algorithm Dropout layer, the Dropout layer is disposed between the two hidden layers in the first module, and the Dropout layer sets the node value of the hidden layer to 0 in proportion in the Skip-DNN model forward propagation process.

Preferably, the features extracted in S1 and S3 are cochlear patterns.

The invention has the beneficial effects that the invention adopts a mode with jump connection to connect two nonadjacent layers in the deep neural network together to form a Skip connections-DNN (Skip connections-DNN) model. The model can improve the phenomenon of voice information loss while solving the problem of gradient disappearance, a drop algorithm (Dropout) is added into the Skip-DNN to prevent the overfitting phenomenon of the model, and a Multi-Resolution cochleogram (MRCG) is used as the input characteristic of the Skip-DNN. 150 pure voices in the TIMIT voice library are selected, and 4 signal-to-noise ratios and 4 noise signals are analyzed respectively. The experimental result shows that under the condition of signal-to-noise ratio of-5 dB, the Perceptual Evaluation (PESQ) of average voice quality is 1.16145, the average short-time objective intelligibility (STOI) is 0.70843 when MRCG is used as the characteristic input of the Skip-DNN model, and the perceptual evaluation is respectively 9% and 27% higher than that when MEL-frequency (MEL-frequency) is used as the characteristic input of the Skip-DNN. Research shows that MRCG is used as the input feature of the Skip-DNN model, the trained voice enhancement effect is better than that of MEL-frequency as the input feature, the problem of voice loss in the low signal-to-noise ratio environment can be solved, and the voice enhancement effect with robustness can be obtained.

Drawings

FIG. 1 is a schematic diagram of the Skip-DNN model;

FIG. 2 is a schematic diagram of the principles of the present invention;

FIG. 3 is a MEL-frequency logarithmic feature graph for pure speech;

fig. 4 is a diagram of MRCG features for pure speech, where (a) is a CG1 feature, (b) is a CG2 feature, fig. (c) is a CG3 feature, and fig. (d) is a CG4 feature;

FIG. 5 shows the evaluation results of the DNN + MRCG and Skip-DNN + MRCG speech enhancement algorithms, wherein (a) shows the average PESQ value of the enhanced speech and (b) shows the average STOI value of the enhanced speech;

FIG. 6 is a graph of speech enhancement effectiveness evaluation using MRCG and MEL-frequency for feature inputs, where (a) is the average PESQ value for the enhanced speech, where (b) is the average STOI value for the enhanced speech, and where (c) is the average SegSNR value for the enhanced speech;

FIG. 7 shows the evaluation results of the Skip-DNN speech enhancement model by the Dropout layer, where (a) shows the average PESQ value for the enhanced speech and (b) shows the average STOI value for the enhanced speech;

FIG. 8 is a diagram of the evaluation of the Dropout layer on Skip-DNN speech enhancement model training process in a-5 dB environment, where (a) is the MSE values of the training set and validation set with the Dropout layer added, and (b) is the MSE values of the training set and validation set without the Dropout layer added;

fig. 9 shows the evaluation of Skip-DNN speech enhancement effect under four signal-to-noise ratios, wherein (a) shows the average PESQ value of the enhanced speech and (b) shows the average STOI value of the enhanced speech.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

The embodiment provides a voice enhancement method of a jump connection deep neural network, which comprises three processing processes, namely feature extraction, model establishment and training target. In the feature extraction process, time-domain speech signals are generally converted into time-frequency domains, the selection of a training target has important influence on a training result, the extracted features and the training target are sent into a Skip-DNN model, and the nonlinear relation between noisy speech features and the training target is learned to obtain a Skip-DNN speech enhancement model; in the enhancement stage, firstly, the noisy speech features are extracted and input into a Skip-DNN speech enhancement model for estimating a target, then, enhanced speech is obtained in a speech synthesis mode, and finally, the quality of the enhanced speech is evaluated by using a speech evaluation means. As shown in fig. 2, the method of the present embodiment includes:

s1, extracting time-frequency domain characteristics according to the time-domain voice signals; in the embodiment, the time-frequency domain characteristics can be acquired by methods such as a logarithmic spectrum, a Mel cepstrum coefficient (MFCC), an MEL-frequency and MRCG;

s2, determining a training target, and sending the training target and the extracted time-frequency domain characteristics into a Skip-DNN model for training to obtain a Skip-DNN voice enhancement model; the present embodiment may use a clean speech power spectrum, an Ideal Binary Mask (IBM), or an Ideal Ratio Mask (IRM), etc. as a training target;

the Skip-DNN model of the embodiment comprises an input layer, four modules and an output layer, wherein the input layer, the four modules and the output layer are in jump connection; the first module comprises two hidden layers, and the number of nodes of the second hidden layer is the same as that of the nodes of the input layer;

DNN is used for learning the nonlinear relation between input features and training targets in speech enhancement due to strong nonlinear processing capability of DNN, useful information in speech features can be extracted automatically through a hierarchical structure of DNN, but under the environment of low signal-to-noise ratio, noise energy is large, so that part of useful speech information is covered, DNN cannot learn the nonlinear relation between the two features well, and the speech enhancement effect is not obvious. The DNN is generally composed of an input layer, a plurality of hidden layers, and one or more output layers in a full-connected manner, the Skip-DNN model is different from the conventional DNN, and the model of the embodiment adds jump connections between layers in the DNN structure model to reduce information loss, so as to form the Skip-DNN speech enhancement model.

In a preferred embodiment, the structure of the Skip-DNN model is shown in fig. 1. The first module of the present embodiment is composed of two hidden layers and a Dropout layer; the number of nodes of the second hidden layer should be the same as that of the input layer, so as to solve the problem of dimension mismatch between the hidden layer and the input layer. The node value of the hidden layer is 0 according to the proportion in the forward propagation process of Skip-DNN by the Dropout layer, so that the model is not dependent on some local characteristics in the training process, and the over-fitting phenomenon is prevented; the second module and the third module have the same network structure as the first module; the fourth module is slightly different from the first three modules, only has one hidden layer, and the number of nodes of the fourth module is still the same as that of the input layer.

Mathematical definition of the forward propagation algorithm of the Skip-DNN model under the action of the activation function (taking the first module in the Skip-DNN model as an example):

where n denotes the number of nodes of the first hidden layer, m denotes the number of nodes of the input layer, a⁽¹⁾Representing the estimated value of input data obtained by the first hidden layer, the expression of the nonlinear activation function ReLU is max (x,0), [ y₁,…,y_m]Representing the input m-dimensional noisy speech signal,

as the weight of the first hidden layer, [ b ]₁,…,b_n]Is the bias of the first hidden layer.

Wherein, a⁽²⁾Representing the estimated value of the input data obtained through the second hidden layer,

is the second oneWeight of hidden layer, [ b ]₁,…,b_m]Is the bias of the second hidden layer.

Wherein the content of the first and second substances,

and the estimated value of the first module can be obtained by adding the estimated value obtained by the second hidden layer and the input m-dimensional noisy speech signal and then activating by using the ReLU function.

In the Skip-DNN model, a Mean Square Error (MSE) is adopted to update the weight and the weight bias in the hidden layer in the back propagation process, the predicted value of model training is subtracted from real data of a training target, the square of the predicted value is averaged to obtain the MSE, and the MSE can be expressed as:

wherein, N represents the number of input time frames, and y (t, f) and y' (t, f) represent the training target and the neural network prediction value on the time-frequency unit, respectively.

To illustrate the influence of the hopping connection on the data transmission process, a fourth module in the Skip-DNN model is taken as an example for a principle explanation, as follows:

assuming that the input speech signal features are Y, the features learned by the Skip-DNN model are denoted as i (Y), and Y is denoted as f (Y) through nonlinear transformation, the features obtained through the jump-join training can be denoted as i (Y) ═ f (Y) + Y, and a redundancy layer appears in ResNet because the network is deep, but the number of network layers is not deep in Skip-DNN, that is, f (Y) is not equal to 0. Since F (Y) is not equal to 0, the hidden layer learns new characteristics in addition to the input characteristics in the backward propagation process of Skip-DNN, thereby avoiding the phenomenon of speech loss and solving the problem of gradient disappearance.

In a preferred embodiment, the noise in this embodiment is additive noise, and the mixed speech signal can be expressed as:

Y(t)＝S(t)+N(t)

where Y (t) represents a noisy speech signal, S (t) represents a clean speech signal, and N (t) represents a noise signal.

According to the time-frequency domain variation relationship, the mixed speech of the time-frequency domain can be expressed as:

Y(t,f)＝S(t,f)+N(t,f)

the corresponding relations of the time-frequency domain obtained by gamma transform domain transformation of Y (t), S (t) and N (t) are Y (t, f), S (t, f) and N (t, f), which respectively represent the time-frequency domain signals of noisy speech, pure speech and noise. f is the frequency index and t is the frame index.

The speech belongs to a short-time stationary signal, the short-time duration of the speech signal is generally 10-30ms, and in order to reduce the overall unsteady and time-varying influence of the speech signal, the speech signal is usually framed in the speech signal processing process. The MEL-frequency characteristic is obtained based on a human auditory perception experiment, a human ear just like a group of filter banks focuses on signals of a specific frequency band, and in the processing process, the voice signals can be subjected to framing and frequency division to a Fourier transform domain and then pass through the MEL filter banks to obtain the MEL-frequency characteristic which accords with the auditory characteristic of the human ear. MRCG is different from MEL-frequency, and is based on a cochlear image adopting a Gamma filter bank, and acquires cochlear images with four resolutions by framing and windowing voice signals and utilizing the difference of window length and window shift. The method comprises the steps that one high-resolution cochlear image acquires local features (cochleagram1, CG1), three low-resolution cochlear images acquire global features (CG2, CG3 and CG4) of different degrees, CG1, CG2, CG3 and CG4 are spliced along a time domain, and then dynamic voice features after feature fusion are acquired. The features extracted by the MRCG in the noisy speech have robustness, so the MRCG is more suitable to be used as the features of a speech enhancement algorithm in a low signal-to-noise ratio environment.

Compared with a MEL filter bank, the Gamma filter bank is more in line with the auditory characteristics of human ears, so that the MRCG of the Gamma transform domain can extract voice characteristics more accurately, and therefore, in order to improve the voice enhancement effect, especially the voice enhancement effect under low signal to noise ratio, the embodiment adopts the MRCG characteristic parameters as Skip-DNN voice enhancement model input, contrasts and analyzes the voice enhancement effects under different characteristic parameters and different voice enhancement models for real voice data under different environments, and evaluates the quality and intelligibility of enhanced voice.

In a preferred embodiment, in the embodiment, a time-frequency masking IRM is used as a training target, and the IRM belongs to [0,1], which represents the proportion of pure speech energy in mixed speech energy. According to the time-frequency domain transform relationship, the IRM can be expressed as:

wherein | S (t, f) & gtY²Representing the energy of pure speech in time-frequency units, | N (t, f) & ltY & gt²The training effect is best when the energy representing the noise, β is a scale factor, and 12 is taken according to the experience in the embodiment β.

Multiplying the obtained IRM by the noisy speech to obtain an estimated target speech amplitude value, which is:

where, represents the multiplication,

representing the target speech amplitude of the time-frequency domain estimate.

Then, the estimated clean speech signal amplitude and the noisy speech phase are reconstructed to obtain an estimated clean speech signal, which can be expressed as:

wherein ∠ Y (t, f) represents the phase of the noisy speech,

representing the estimated magnitude of the target speech sound,

representing the reconstructed clean speech signal.

The method of the embodiment further comprises voice evaluation, and the voice quality is evaluated by selecting STOI, PESQ and Segment signal-to-noise ratio (Segment SNR). Wherein, STOI is the objective index evaluation method of speech intelligibility, adopt the short-time unit, and eliminate the silence unit, the STOI value is in the range of 0 to 1, the larger the STOI value is to represent the intelligibility is higher. PESQ and SegSNR are objective evaluation methods of voice quality, the evaluation range of PESQ is between 1.0 and 4.5, and the higher the score is, the better the voice quality is; SegSNR is an improvement in SNR because the speech signal is a non-stationary signal and therefore needs to be divided into short stationary signals to calculate the SNR of the short units and then average the whole speech unit. The SegSNR is calculated as follows:

wherein the content of the first and second substances,

denotes enhanced speech, x (N) denotes clean speech, M denotes the number of frames, N denotes the frame length, and M is 0,1, …, M-1. The present embodiment takes into account the presence of the mute unit, so SegSNR is only focused on [ -10,35 [ ]]Frames within dB. In the embodiment, the three evaluation methods are utilized to evaluate the quality of Skip-DNN voice enhancement, and the evaluation results are compared with the evaluation results of the traditional model to give analysis results.

Experimental data processing and analysis:

research on feature extraction effect of MEL-frequency and MRCG

In order to analyze the influence of MEL-frequency and MRCG on the extraction effect of feature extraction parameters, the following experimental study was performed.

Selecting and using si839.wav voice signals in a TIMIT voice library, performing framing operation on the voice signals, setting the frame length to be 20ms, moving the frame to be 10ms, adding a Hanning window to obtain short-time voice signals, performing FFT (fast Fourier transform) on the short-time voice signals, and passing the frequency domain signals after the FFT to a 64-channel Mel filter bank to obtain Mel-frequency characteristics of pure voice, wherein a logarithmic characteristic diagram of the Mel-frequency characteristics is shown in FIG. 3.

In order to obtain MRCG characteristics of si839.wav in a voice library, firstly, performing time-frequency decomposition on a voice signal, performing frequency division operation on the voice by using a 64-channel 4-order Gamma-tone filter group to obtain 64 subband voice signals, wherein the frame length is 20ms, the frame shift is 10ms, and a high-resolution cochlear image CG1 is obtained, the characteristic size is 465 x 64, meanwhile, CG1 is linearly converted by adopting a logarithmic compression mode to make the characteristic more accord with the auditory characteristic of human ears, a low-resolution cochlear image CG2 is obtained, the process is basically the same as that of CG1, except that the frame length is changed to 200ms, the CG2 has a characteristic size of 465 x 64, and (3) smoothing the CG1 by an average filter with a window length of 11 multiplied by 11 to obtain a low-resolution cochlear image CG3, filling zero if the window exceeds the given cochlear image, repeating the CG3 method, changing the window length to 23 multiplied by 23 to obtain a low-resolution cochlear image CG4, and similarly, the characteristic sizes of CG3 and CG4 are 465 x 64. And performing bottom layer feature fusion splicing on the CG1, the CG2, the CG3 and the CG4 along the time domain direction, and then combining the bottom layer feature fusion splicing with the first-order difference and the second-order difference to obtain the MRCG, wherein a feature diagram of the MRCG is shown in figure 4. In the drawings, a CG1 is shown in fig. (a), a CG2 is shown in fig. (b), a CG3 is shown in fig. (c), and a CG4 is shown in fig. (d).

As can be seen from comparison between fig. 4 and fig. 5, the CG1 in the MRCG includes a large number of features in the MEL-frequency, and adds global features of CG2, CG3 and CG4 on the basis of the features, so that the MRCG is more abundant and robust than the speech features extracted by the MEL-frequency.

(II) selection and setting of experiment parameters

The experimental speech data were from the TIMIT Speech library and the noise was from the NoiseX-92 noise library. In the experimental process of the implementation mode, 150 (300 s in total) voices with the format of wav are selected and are down-sampled to 8kHZ to serve as pure voice signals selected by the experiment. The noise bank of NoiseX-92 has 15 noise data files, the sampling rate is 19.98KHz, wherein all the noise data files last 235 seconds, 4 kinds of noise are selected in the experimental process of the paper, namely white, babble, ping and factor, wherein white is stationary noise, and the rest of the noise is non-stationary noise, and similarly, the noise is down-sampled to 8kHZ, and is applied to a pure voice signal to generate a voice signal with noise. In the experimental process, the noise truncation selected in the NoiseX-92 noise library is equal to TIMIT voice time, then voice signals and noise signals are mixed to construct 600 pieces of voice with noise (1200 s in total), wherein 420 pieces of voice are used as training data, and the other 180 pieces of voice are used as test data.

In the experimental process, the signal-to-noise ratios are set to be-5 dB, 0dB, 5dB and 10dB in the process of mixing the pure voice signal and the noise signal; parameter setting of the neural network: MSE is selected as a loss function, and the iteration times are 30 times; a random gradient descent algorithm is selected to improve the training process of the network, the number of nodes of a hidden layer is set to be 1024, and the discarding rate is 0.2; since the ReLU helps to solve the convergence problem of the deep neural network, the present experiment selects the ReLU as the activation function, with the input layer speech features set to 3 frames and the output layer speech features set to 1 frame. In the following experimental data processing procedure, both the speech signal and the noisy signal are selected according to the above criteria.

In the experiment of the embodiment, white, babble, ping and factor are respectively used as background noise and introduced into a pure voice signal, wherein (1), (2) and (3) analyze voice enhancement effects of different models, different characteristic parameters and model optimization, the evaluation results are obtained by summing evaluation scores of different noise environments and then averaging the evaluation results, and (4) and (5) analyze voice enhancement effects of a voice signal with noise under different signal-to-noise ratios, different noise environments and different models, and the evaluation results are evaluation scores of each noise environment.

(III) results and analysis of the experiments

(1) Performance comparison study of DNN and Skip-DNN models

In order to compare and analyze the voice enhancement effect of the Skip-DNN model of the present embodiment with that of the conventional DNN model, the MRCG extraction method is adopted for inputting the network characteristic parameters, and in the experiment process of different models, the experiment environment and parameter settings are the same, and the average PESQ and STOI evaluation experiment results of voice enhancement in the environments with different signal-to-noise ratios are shown in fig. 5.

As can be seen from FIG. 5(a), when the SNR is-5 dB, 0dB, 5dB, 10dB, the average PESQ values calculated by the DNN speech enhancement model are 1.15397, 1.29828, 1.56400, 2.00583 respectively; the calculated PESQ values are 1.16145, 1.31933, 1.60492 and 2.09808 by using a Skip-DNN speech enhancement model, wherein the PESQ of Skip-DNN + MRCG is 0.06 percent higher than that of DNN + MRCG when the signal-to-noise ratio is-5 dB. As can be seen from fig. 5(b), when the snr is-5 dB, 0dB, 5dB, 10dB, the average STOI values calculated by using the DNN speech enhancement model are 0.69780, 0.78993, 0.86395, 0.91973, respectively, and the STOI values calculated by using the Skip-DNN speech enhancement model are 0.70843, 0.79408, 0.87060, 0.9291, and when the snr is-5 dB, the STOI of Skip-DNN + MRCG is higher by 2% than that of DNN + MRCG.

Therefore, under four signal-to-noise ratio environments, the objective evaluation score of the enhanced voice of Skip-DNN + MRCG is higher than that of DNN + MRCG, namely when MRCG is used as an input characteristic, the Skip-DNN model provided by the embodiment is used for improving the voice intelligibility more than that of the DNN model.

(2) Influence of feature extraction method on Skip-DNN model performance

In order to compare and analyze the influence of MEL-frequency and MRCG as a feature extraction mode on the voice enhancement effect of the network model, and (1) the voice enhancement algorithm of Skip-DNN is better than DNN, so the Skip-DNN model is adopted in the experiment to research the voice enhancement effect in different feature extraction modes. The experimental environment and the parameter setting are the same, and the evaluation experimental results of the average PESQ, STOI and SegSNR of the speech enhancement under the environment with different signal-to-noise ratios are shown in fig. 6.

As can be seen from FIG. 6(a), when the SNR is-5 dB, 0dB, 5dB, 10dB, the calculated average PESQ values are 1.06288, 1.13223, 1.20358, 1.55903 respectively when the Skip-DNN speech enhancement model with MEL-frequency characteristics is input; the input characteristics of MRCG gave PESQ values of 1.16145, 1.31933, 1.60492, 2.09808, wherein the input characteristics of MRCG gave a PESQ 9% higher than MEL-frequency when the signal to noise ratio was-5 dB. As can be seen from fig. 6(b), when the signal-to-noise ratio is-5 dB, 0dB, 5dB, 10dB, the Skip-DNN speech enhancement model with MEL-frequency is input, the calculated average STOI values are 0.55470, 0.66173, 0.76218, 0.84300, respectively, the input characteristic is MRCG, and the obtained STOI values are 0.70843, 0.79408, 0.87630, 0.92910, wherein when the signal-to-noise ratio is-5 dB, the STOI with MRCG is 27% higher than MEL-frequency; as can be seen from fig. 6(c), when the snr is-5 dB, 1dB, 5dB, and 10dB, the average SegSNR values obtained by estimating the Skip-DNN speech enhancement model with MEL-frequency are-4.08, -0.85, 5.24, and 8.93, respectively, and the SegSNR values obtained by inputting the speech enhancement model with MRCG are 1.14, 2.23, 6.55, and 9.49, respectively.

Therefore, under four signal-to-noise ratio environments, the objective evaluation scores of the enhanced voice adopting MRCG are higher than the quality enhancement effect of the voice adopting MEL-frequency in the input characteristics, and therefore the influence of MRCG on the Skip-DNN model performance is better than that of MEL-frequency.

(3) Effect of Dropout layer on Skip-DNN network model Speech enhancement Effect

From (1) and (2), it can be seen that MRCG has the best effect on Skip-DNN speech enhancement as a characteristic input. In order to analyze the influence of the Dropout layer introduction on the Skip-DNN model performance, the experimental environment and the parameter setting are the same, and the average PESQ and STOI evaluation experimental results of the speech enhancement under different signal-to-noise ratio environments are shown in FIG. 7.

As can be seen from fig. 7(a), when the snr is-5 dB, 0dB, 5dB, and 10dB, the estimated average PESQ values of the Skip-DNN speech enhancement model without Dropout layer are 1.16145, 1.31933, 1.60492, and 2.09808, respectively, and the estimated PESQ values of the Skip-DNN speech enhancement model with Dropout layer are 1.18175, 1.3434, 1.63258, and 2.15255, wherein when the snr is-5 dB, the PESQ with Dropout layer is 2% higher than that without Dropout layer; as can be seen from FIG. 7(b), when the SNR is-5 dB, 1dB, 5dB, 10dB, the average STOI values obtained by estimating the Skip-DNN speech enhancement model without Dropout layer are 0.70843, 0.79408, 0.87060, 0.92910 respectively, and the STOI values obtained by adding the Skip-DNN speech enhancement model with Dropout layer are 0.71820, 0.80595, 0.87630, 0.93073, wherein when the SNR is-5 dB, the STOI added to Dropout layer is 1% higher than that of the Skip-DNN speech enhancement model without Dropout layer.

Therefore, under four signal-to-noise ratio environments, the objective evaluation scores of the Skip-DNN model speech with the Dropout layer are higher than those without the Dropout layer, and the speech enhancement effect of adding the Dropout layer into the Skip-DNN speech enhancement model is better than that of adding no Dropout layer.

In order to further determine the influence of adding a Dropout layer on the MSE value in the network model training process, a curve of the relationship between MSE and iteration times is drawn, as shown in fig. 8. Fig. 8(a) shows the MSE value of the network model to which the Dropout layer network model is added, and fig. 8(b) shows the MSE value of the network model to which the Dropout layer is not added.

As can be seen from fig. 8(a), the MSE of the network model with the Dropout layer decreases faster and is smooth than that of the network model without the Dropout layer, and the MSE of the verification set is always smaller than that of the training set. As can be seen from fig. 8(b), the MSE of the training set is still in a descending trend after 27 steps in the Skip-DNN model without the Dropout layer, while the MSE of the verification set is no longer descending and almost coincides with the training set, so that it can be concluded that the Skip-DNN model with the Dropout layer can prevent the over-fitting phenomenon from occurring.

(4) Evaluation of voice enhancement effect of Skip-DNN network model in low signal-to-noise ratio and different noise environments

In order to analyze the voice enhancement effect of the Skip-DNN voice enhancement model in different noise environments, white noise, factory noise, ping noise and babble noise are selected in the experiment, the signal-to-noise ratios are different, and the average PESQ and STOI evaluations of the voice enhancement in different noise environments are shown in fig. 9.

FIG. 9 evaluation of Skip-DNN speech enhancement effect at four signal-to-noise ratios

As can be seen from fig. 9(a), when the noise is white, the estimated average PESQ values of the Skip-DNN speech enhancement model with the Dropout layer are 1.2190, 1.4064, 1.7323 and 2.2226 in-5 dB, 0dB, 5dB and 10dB, respectively; when the noise is factory, the PESQ values are 1.1503, 1.2896, 1.5276 and 2.0579 respectively; when the noise is ping, the PESQ values are 1.1869, 1.3400, 1.6596 and 2.1870 respectively; when the noise is babble, the PESQ values are 1.1708, 1.3376, 1.6108 and 2.1427 respectively; as shown in FIG. 9(b), when the noise is white, the Skip-DNN speech enhancement model with the Dropout layer added thereto has estimated average STOI values of 0.7657, 0.8366, 0.8923 and 0.9369 in-5 dB, 0dB, 5dB and 10dB, respectively; when the noise is factory, the STOI values are 0.6957, 0.7868, 0.8647 and 0.9252 respectively; when the noise is ping, the STOI values are 0.7274, 0.8154, 0.8841 and 0.9360 respectively; when the noise is babble, the STOI values are 0.6840, 0.7850, 0.8641, 0.9248, respectively.

From the above, in the Skip-DNN speech enhancement model, the noisy speech enhancement effect under the white noise environment is the best, and the noise enhancement effect is the ping noise, the babble noise and the factory noise in sequence, and when the signal-to-noise ratio increases, the difference of the enhancement effects gradually decreases, but the speech enhancement effect under the white noise environment is the best all the time.

(5) Evaluation of voice enhancement effect of different network models in different noise environments

In order to analyze the influence of different noises on the enhancement effect of different models, the experiment selects to train 4 network models under the condition of-5 dB signal-to-noise ratio, the noise-carrying voice in the test stage is the voice under the noise environment of white, factory, ping and babble, and the enhancement effect of the noise-carrying voice is shown in tables 1 and 2.

TABLE 1-5dB mean PESQ values for speech enhancement for different models for different noise environments

TABLE 2-5dB mean STOI values for speech enhancement for different models for different noise environments

As can be seen from tables 1 and 2, in the environment of a signal to noise ratio of-5 dB, all network models are the best in the noise-carrying speech enhancement effect in the white noise environment, and except that in the DNN speech enhancement model characterized as MRCG, average PESQ and STOI values obtained by the babble noise are smaller than the factary noise, the enhancement effects obtained by the other network models are sequentially the ping noise, the babble noise and the factary noise.

The feature input of the embodiment adopts MRCG and a Skip-DNN voice enhancement model optimized by adding a Dropout layer, and has the characteristics of carrying more voice information, preventing gradient from disappearing and containing global and local voice features. Experimental research shows that the Skip-DNN speech enhancement model has better enhancement effect than the DNN speech enhancement model when MRCG is used as characteristic input, and the intelligibility is improved by 2% as seen from objective evaluation; the Skip-DNN is adopted as a voice enhancement model, and MRCG is used as characteristic input, so that the voice quality is improved by 9 percent and the intelligibility is greatly improved compared with Mel-frequency as characteristic input; the Skip-DNN speech enhancement model added with the Dropout layer has better enhancement effect, and the MMSE values of the training set and the verification set are smoother, so that the occurrence of an overfitting phenomenon can be better prevented.

As can be seen from the above, the Skip-DNN speech enhancement model using the input features MRCG proposed in this embodiment is compared with other models, and this network model can obtain a speech enhancement effect with higher speech quality and speech intelligibility, and particularly, the enhancement effect of a noisy speech signal is more obvious in a low signal-to-noise environment.

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims

1. A method for speech enhancement in a hopping connected deep neural network, the method comprising:

2. The speech enhancement method of claim 1 wherein the first through third modules each further comprise a drop algorithm Dropout layer, wherein the Dropout layer is disposed between two hidden layers in the first module, and wherein the Dropout layer scales the node value of the hidden layers to 0 during Skip-DNN forward propagation.

3. The speech enhancement method of claim 1 or 2, wherein the features extracted in S1 and S3 are cochlear maps.

4. The speech enhancement method of claim 2 wherein the first hidden layer of the first module obtains an estimate a⁽¹⁾：

Where n denotes the number of nodes of the first hidden layer, m denotes the number of nodes of the input, max (x,0) denotes the nonlinear activation function ReLU, [ y ]₁,…,y_m]Representing the input m-dimensional noisy speech signal,

is the weight of the first hidden layer,[b₁,…,b_n]a bias for the first hidden layer;

the second hidden layer of the first module obtains an estimated value a⁽²⁾：

Wherein the content of the first and second substances,

is the weight of the second hidden layer, [ b ]₁,…,b_n]A bias for the second hidden layer;

the first module outputs an estimated value of:

。

5. the speech enhancement method of claim 2 wherein in S2, the training target is time-frequency mask IRM, IRM ∈ [0,1], which represents the proportion of pure speech energy in the mixed speech energy, and IRM is:

wherein | S (t, f) & gtY²Representing the energy of clean speech in the time-frequency domain, | N (t, f) & ltY & gt²Representing the energy of the noise, β is a scale factor, t represents time, and f represents amplitude.