CN114242099A

CN114242099A - Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network

Info

Publication number: CN114242099A
Application number: CN202111534489.7A
Authority: CN
Inventors: 邓立新; 徐琦
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-25

Abstract

The invention provides a voice enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, which is used for preprocessing training set voice data to obtain characteristic data of noisy voice and pure voice; introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm by combining the characteristic data of the noisy speech and the pure speech, and calculating a phase compensation factor by using an improved formula; building a full convolution neural network model, using the phase compensation factor and the logarithmic power spectrum of the pure voice as a training target of the full convolution neural network, and training the network model; inputting the test voice into the trained model to obtain an estimated value of a logarithmic power spectrum and a phase compensation function; and respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by utilizing the estimated value of the logarithmic power spectrum and the phase compensation function to obtain the final enhanced voice. The invention improves the noise elimination capability of the algorithm and simultaneously better ensures the speech intelligibility, thereby improving the overall effect of speech enhancement.

Description

Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network

Technical Field

The invention relates to a voice enhancement method, in particular to a voice enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, and belongs to the technical field of voice signal processing.

Background

It is known that voice is an important way of communicating information between people, but people always suffer from various noises during the communication and communication process by voice. The noisy speech not only increases the hearing fatigue of the person and degrades the speech communication quality, but also degrades the performance of the speech processing system based on feature parameter extraction. Therefore, to reduce the effect of background noise on speech quality, speech enhancement is required to suppress background noise.

The phase spectrum compensation is an enhancement algorithm for enhancing a voice signal by utilizing voice phase spectrum information, and the basic idea is as follows: respectively calculating the short-time amplitude spectrum of the voice signal with noise and the estimated short-time amplitude spectrum of the noise signal, calculating a compensation factor by using a phase spectrum compensation function, and then superposing the compensation factor and the frequency spectrum of the voice with noise to obtain a compensated voice frequency spectrum. When restoring the enhanced voice signal, the compensated voice frequency spectrum is used to calculate the compensated phase, and then the amplitude spectrum of the voice signal with noise is inserted to perform inverse discrete Fourier transform. The general form of the phase compensation function is:

for the estimation of the noise amplitude spectrum, Ψ (k) is an antisymmetric function, λ is a compensation factor, and a conventional phase compensation factor λ is an empirical value, typically taken to be 3.74.

The phase compensation has the advantages of small operand, easy realization and better enhancement effect. However, because the current speech enhancement algorithm processes non-stationary noise, the noise energy has uncertainty, and the fixed compensation factor cannot be dynamically adjusted to a proper value according to the change of the noise. The fixed parameters cannot fully exert the effect of the phase compensation algorithm, and cannot be combined with DNN to correct the voice phase spectrum. Therefore, optimizing a compensation function with fixed parameters, introducing supervised parameter learning, balancing and considering the enhanced speech distortion and denoising effect, and improving the phase compensation algorithm to fully exert the advantages of the phase compensation algorithm are important points.

Compared with the traditional method, the voice enhancement method based on the deep neural network obviously improves the voice enhancement performance under the non-stationary noise condition, and is a research hotspot in the field of voice enhancement in recent years. Existing work has been mainly directed to the design of training features and training goals and the development of improvements in network architecture. According to the training characteristics and the target design method, the speech enhancement method based on the deep neural network can be divided into two types, namely a time domain and a frequency domain. The speech enhancement in the frequency domain generally adopts the magnitude spectrum or the logarithmic power spectrum of the voice with noise as the training feature, and besides the magnitude spectrum and the logarithmic power spectrum, the magnitude spectrum can be masked as the training target. Time-domain speech enhancement generally employs time-domain waveforms of noisy speech and clean speech as training features and training targets, respectively. However, the existing solutions still have many problems. In recent years, researches show that the voice quality can be effectively improved by only enhancing the phase spectrum of the voice and keeping the amplitude spectrum of the voice with noise unchanged. Although the time-domain speech enhancement algorithm directly uses waveforms of noisy speech and clean speech as training features and training targets, the performance of the time-domain speech enhancement algorithm is very dependent on the design of a loss function, and the complex loss function greatly improves the training difficulty. On the contrary, the simple time domain minimum mean square error function is adopted as the loss function, a large amount of time is consumed for parameter adjustment, the problem of voice distortion is easily caused, the intelligibility of voice signals is influenced, the voice signals are damaged, and even the signal-to-noise ratio is reduced.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a speech enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, which overcomes the defects of the prior art, adopts the logarithmic power spectrum of the noisy speech as the training characteristic, utilizes the improved phase spectrum compensation algorithm to carry out phase spectrum estimation on the noisy signal, obtains a phase spectrum compensation factor as one of the training targets of the network, and uses the logarithmic power spectrum of the pure speech signal as a common training target by matching with the unique loss function design. Considering that the training features have correlation in both time and frequency, the present invention uses convolutional neural networks to obtain better training effect, so as to enhance the speech signal.

The invention provides a voice enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, which comprises the following steps:

step 1, preprocessing the voice data of a training set to obtain characteristic data of noise-containing voice and pure voice;

step 2, combining the calculated characteristic data of the voice with noise and the pure voice, introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm, and calculating a phase compensation factor by using the improved formula;

step 3, building a full convolution neural network model, using the solved phase compensation factor and the log power spectrum of the pure voice as a training target of a Full Convolution Neural Network (FCNN), and training the full convolution neural network model;

step 4, inputting the test voice into the trained model to obtain an estimated value of a logarithmic power spectrum and a phase compensation function;

and 5, respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by using the estimated value of the logarithmic power spectrum and the phase compensation function obtained in the previous step to obtain the final enhanced voice. As a further technical solution of the present invention,

the further optimized technical scheme of the invention is as follows:

in the step 1, the voice with noise and the pure voice are preprocessed, and the characteristic parameters of the voice with noise and the pure voice are extracted, specifically, the operations are as follows:

step 1-1, let y (n) represent a noisy speech signal and y (n) ═ d (n) + x (n), where d (n) is a noise signal and x (n) is a clean signal. Let x (n) and d (n) be statistically independent and have a zero mean. Transforming the noisy speech into the frequency domain by windowing M samples of Y (n), wherein the window function is w (n), and performing FFT (fast fourier transform) of M points, to obtain a frequency spectrum Y (l, k) of the noisy speech, wherein l is a frame number label, k represents a frequency component, and k is 0, 1, 2. The same spectrum X (l, k) of the clean speech X (n) and the spectrum D (l, k) of the noise signal D (n) can be obtained;

step 1-2, representing the frequency spectrum Y (l, k) of the voice with noise on a polar coordinate, and dividing the frequency spectrum into a magnitude spectrum and a phase spectrum, namely

Y(l，k)＝|Y(l，k)|e^j∠Y(l，k)

step 1-3, calculating the log power spectrum S (n) of the noisy speech and the log power spectrum T (n) of the clean speech by using the following formula,

S(n)＝[log_e(|Y(n，1)|²)，log_e(|Y(n，2)|²)，...，log_e(|Y(n，k)|²)，...，log_e(|Y(n，M-1)|²)]

T(n)＝[log_e(|X(n，1)|²)，log_e(|X(n，2)|²)，...，log_e(|X(n，k)|²)，...，log_e(|X(n，M-1)|²)]

wherein | Y (n, 1) & gtY²For the power of the 1 st frequency band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, 2) & gtY²For the power of the 2 nd band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, k) & gtY²For the power of the kth frequency band of the nth frame obtained by short-time Fourier transform of the noisy speech, | Y (n, M-1) & gt²Obtaining the power of the (M-1) th frequency band of the nth frame of the voice with noise through short-time Fourier transform; | X (n, 1) emittingphosphor²(ii) power of 1 st band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, 2) & ltY |)²(ii) power of 2 nd band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, k) & ltY |²Power of kth frequency band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, M-1) & ltY & gt²The power of the (M-1) th frequency band of the nth frame obtained by short-time Fourier transform for pure speech.

In step 2, the phase compensation function is optimized, specifically, the operations are as follows:

step 2-1, the conventional phase compensation function is expressed as

Wherein Λ (l, k) is a phase compensation function,

for noise estimation, the amplitude spectrum Y (l, k) of the noisy speech can be generally substituted, λ is an empirical value, Ψ (k) is an antisymmetric function for phase correction, and denoted as

The action of Λ (l, k) is shown by the following formula,

Y_Λ(l，k)＝Y(l，k)+Λ(l，k)

wherein, Y_Λ(l, k) is a compensated spectrum, and a phase spectrum < Y shown in the following formula is obtained by extracting the phase of the compensated spectrum_Λ(l，k)，

∠Y_Λ(l，k)＝arg(Y_Λ(l，k))

Is less than Y_Λ(l, k) is combined with the magnitude spectrum Y (l, k) of the noisy speech to obtain the spectral expression of the enhanced speech as

Wherein the exponential form of the complex number is represented by the Euler formula re^jαWhere r is amplitude and α represents amplitude (phase), where j is a complex field representation;

step 2-2, introducing frame signal-to-noise ratio to optimize a phase compensation function to obtain an optimized phase compensation function, namely

Wherein,c is an empirical value, typically set to 2.7, Λ (l, k) with SNR_lWhen the current frame is a voice frame, the influence of the phase compensation factor Λ (l, k) on the noise voice is reduced, and more voice details are reserved. SNR_lThe signal-to-noise ratio of the l-th frame.

The step 2 further comprises: step 2-3, in order to simplify the training target of the subsequent neural network, the optimized phase compensation function is abbreviated as the following formula,

Λ(l，k)＝Ψ(k)×Q(l，k)

the above equation is a product of a new compensation factor and a noise magnitude spectrum estimation value, and may also be referred to as a new phase compensation factor for short, and is also one of training targets of a subsequent neural network. Q (l, k) is a new phase compensation factor, an antisymmetric function is removed, and the Q (l, k) is used as one of training targets of a following neural network. The formula for Q (l, k) is expressed as,

. Q (l, k) is a total of N frames, Q (l, k) is M long two-dimensional data per frame in N frames (the symbols of all bands (l, k) are of this dimension). In actual operation, N frames of data with length of M in each frame are brought into operation, namely Q (l, k), with which Λ (l, k) can be solved, and spectrum S of enhanced voice can be obtained according to step 2-1_Λ(l, k), and then performing ISTFT to obtain the time domain waveform of the enhanced voice.

In the step 3, a design of a training target and a loss function for integrating phase spectrum compensation is provided, and the specific method is as follows:

step 3-1, for the voice enhancement of the frequency domain, adopting the logarithmic power spectrum of the voice with noise and the logarithmic power spectrum of the pure voice as the training characteristic and the training target of the neural network respectively, namely

Wherein,

is a pure voice logarithm power spectrum T obtained by the noisy voice logarithm power spectrum through the neural network training_nOf the approximation value of S_nIs a logarithmic power spectrum of the noisy speech; then combines the phase alpha of the n-th frame of the noisy speech_nTo obtain the time domain waveform of the n-th frame of the enhanced voice

Namely, it is

Step 3-2, on the basis of voice enhancement of the frequency domain, a phase compensation algorithm is fused to serve as a new training target, and the new training target comprises a logarithmic power spectrum T of pure voice_nAnd a phase compensation factor Q_nTwo parts, i.e.

Wherein Q is_nRepresents data of the n-th frame of Q (l, k) and Q_n＝Q(l，k)_l＝nQ (n, 0), Q (n, 1), Q (n, M-1)), by a phase compensation factor

Then the phase compensation function is obtained, and the enhanced phase of the nth frame is obtained by the phase compensation function, so that the phase alpha of the voice with noise can be replaced_nTo achieve better voice enhancement effect;

step 3-3, for the training target fusing the phase spectrum compensation and provided in the last step, defining the loss function of the neural network as

Where MSE is logarithmic workThe combined mean square error of the rate spectrum and phase spectrum compensation, the log power spectrum and the phase spectrum compensation jointly influence the parameter. M is the frame length, T (n, k) is the log power spectrum of the nth frame of clean speech,

the approximate value of the clean voice logarithmic power spectrum obtained by training the logarithmic power spectrum of the nth frame of the voice with noise through the neural network,

obtaining an approximate value of the phase compensation factor through neural network training; beta is a modulation parameter and is a constant, and is used for balancing the influence of the logarithmic power spectrum and the phase compensation on the network;

3-4, building a full convolution neural network model which is mainly divided into an input layer, an encoder layer, a decoder layer and an output layer;

and 3-5, taking the logarithmic power spectrum of the phase compensation factor combined with the pure voice as a training target of the full convolution neural network, and training the full convolution neural network model to obtain the trained full convolution neural network model.

The neural network training method adopts a conventional back propagation algorithm to train the neural network, adopts the combined mean square error of the step 3-3 as a loss function, and uses an Adam algorithm to carry out optimization, wherein the initial learning rate is set to be 0.05, the learning rate discarding period is set to be 10, the discarding parameter is set to be 0.2, the small batch (Batchsize) is set to be 128, and the iteration number (epoch) is set to be 100. The model trained by the method is obtained after the training.

In the step 3-4, the input layer is a matrix of 5 × 257, the length of the context representing the input features is 5 frames, each frame has 257 data points, and the input data is composed of the log power spectrum of the noisy speech of the adjacent 5 frames, so that the correlation of the input data in the two dimensions of time and frequency is increased.

The encoder layer consists of two convolution layers and two dropout layers, wherein the convolution layers have three hyper-parameters which are convolution kernel size, moving step length and filling mode respectively. The parameter settings of the convolutional layer 1 are as follows: the convolution kernel size is 5; the moving step length is 1; the filling mode is a same mode; the number of convolution kernels is designated 32; the activation function is selected from a ReLU (reconstructed Linear Unit, Re-LU) function. The adoption of the ReLU activation function can greatly reduce the calculation amount. After passing through convolution layer 1, the weight loss ratio is set to 0.2. The parameter settings of the convolutional layer 2 are as follows: the convolution kernel size is 7; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The encoder is used for preliminarily extracting shallow layer characteristics of the noisy speech through the convolutional layer 1 and further extracting more abstract deep layer speech characteristics through the convolutional layer 2 for encoding.

The decoder layer is also composed of two convolutional layers and two dropout layers, and the network structure of the decoder layer is symmetrical to that of the encoder layer. The parameter settings of the convolution layer 3 are as follows: the convolution kernel size is 7; the filling mode is set to be a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The convolutional layer 4 parameters are set as follows: the convolution kernel size is 5; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 32; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The decoder acts in reverse to the encoder to restore the features compressed by the encoder to produce clean speech features.

The output layer is a full connection layer, is directly connected with a decoder and is provided with 257 neurons, the activation function selects a linear function, and the output is 1 frame of voice characteristic data and 1 frame of phase spectrum compensation factors.

In the step 4, the characteristic extraction is carried out on the voice with noise to be tested according to the method in the step 1, and the extracted characteristic parameters are input into the trained full convolution neural network model, so that the logarithmic power spectrum estimation value and the phase compensation factor of the voice signal can be obtained.

In the invention, the logarithmic power spectrum S of the voice with noise is input into a trained neural network model to carry out forward propagation once, so that the logarithmic power spectrum estimated value and the phase compensation factor of the voice signal can be obtained. The speech enhancement process based on the neural network can be divided into two parts of training and enhancing, and when the model is trained (the model can be trained)Consider a laboratory environment) with all the information of noisy and clean speech, which are used to train a model, i.e., S → (T, Q), which is a function of the model (arrow, or function f). When enhancing speech (it can be considered as practical application), only speech information with noise is needed, and information of pure speech is estimated through the trained model obtained in the training stage, that is, information of pure speech is estimated

This step results in an approximation or estimate of the clean speech information

And a compensation factor

And (4) restoring pure voice (enhanced voice) through the approximation value, and directly substituting the estimation value into the step 2-2 for calculation.

In the step 5, the amplitude spectrum and the phase spectrum of the voice signal are reconstructed by using the logarithmic power spectrum and the phase compensation factor (S is finally obtained according to the method of the step 2-1)_Λ(l, k)), and a final enhanced speech is obtained by short time inverse fourier transform (ISTFT).

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

(1) aiming at the problems that the phase spectrum compensation algorithm parameters are fixed and cannot be dynamically adjusted, the invention introduces the frame signal-to-noise ratio, improves the phase compensation function, overcomes the defect that the traditional phase compensation can not change along with the change of noise, can quickly correspond to the change of the noise spectrum, is more suitable for the training and estimation of a neural network, exerts the performance of the phase compensation algorithm to the maximum extent and achieves better enhancement effect;

(2) an improved phase compensation factor is introduced as one of training targets of the neural network, the defect that the phase is neglected in a traditional voice enhancement algorithm based on a frequency domain is overcome, and a better enhancement effect is achieved through simple and effective loss function design under the condition of the same data sample number.

In a word, the invention improves the noise elimination capability of the algorithm and simultaneously better ensures the speech intelligibility, thereby improving the overall effect of speech enhancement.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a block diagram of a full convolution neural network in accordance with the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings: the present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection authority of the present invention is not limited to the following embodiments.

The embodiment proposes a speech enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, as shown in fig. 1, which includes the following steps:

step 1, preprocessing the voice data of the training set to obtain the characteristic data of the voice with noise and the pure voice.

The speech used for training in the experiment was from the training set of TIMIT. 15 kinds of noise of NOISEX-92 were selected as the noise set. Under the conditions that signal-to-noise ratios are-5 dB, 0dB and 5dB, 100 pure voices and noise sets in the TIMIT training set are randomly mixed, 4500 voices containing noise are generated under each signal-to-noise ratio, and the voices are divided into a training set and a testing set according to the ratio of 9:1 to train a network model.

Preprocessing the voice with noise and the pure voice and extracting characteristic parameters of the voice with noise and the pure voice, wherein the method specifically comprises the following steps:

step 1-1, let y (n) represent a noisy speech signal and y (n) ═ d (n) + x (n), where d (n) is a noise signal and x (n) is a clean signal. Let x (n) and d (n) be statistically independent and have a zero mean. The noisy speech is transformed into the frequency domain by windowing M samples of Y (n) with w (n) and performing an FFT (fast fourier transform) of M points, resulting in a frequency spectrum Y (l, k) of the noisy speech, where l is a frame number marker, k represents a frequency component and k is 0, 1, 2.

The same spectrum X (l, k) of clean speech X (n) and the spectrum D (l, k) of noise signal D (n) are obtained. Namely:

window w (n) on M samples of X (n), and perform FFT of M points to transform the pure voice to frequency domain, so as to obtain the frequency spectrum X (l, k) of the pure voice;

and (3) windowing w (n) to the M samples of D (n), performing FFT of M points, and transforming the noise signal to a frequency domain to obtain a frequency spectrum D (l, k) of the noise signal.

Y(l，k)＝|Y(l，k)|e^j∠Y(l，k)

Wherein, | Y (l, k) | is the short-time amplitude spectrum of the noisy speech Y (n), and the phase spectrum thereof is ≈ Y (l, k). Similarly, a short-time amplitude spectrum | X (l, k) | of the clean speech X (n) and a short-time amplitude spectrum | D (l, k) | of the noise signal D (n) can be obtained. Namely:

the frequency spectrum X (l, k) of clean speech is represented on polar coordinates and can be divided into a magnitude spectrum and a phase spectrum, i.e. a spectrum with a high frequency

X(l，k)＝|X(l，k)|e^j∠X(l，k)

I X (l, k) I is a short-time amplitude spectrum of pure speech X (n), and a phase spectrum thereof is X (l, k); representing the frequency spectrum D (l, k) of the noise signal in polar coordinates, it can be divided into a magnitude spectrum and a phase spectrum, i.e.

D(l，k)＝|D(l，k)|e^j∠D(l，k)

| D (l, k) | represents a short-time amplitude spectrum of the noise signal D (n), and the phase spectrum is ≈ D (l, k).

Step 1-3, calculating the log power spectrum S (n) of the voice with noise and the log power spectrum T (n) of the pure voice by using the following formula

And 2, combining the calculated characteristic data of the voice with noise and the pure voice, introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm, and calculating a phase compensation factor by using the improved formula.

The introduction of the frame signal-to-noise ratio improves the phase compensation function, and the specific operation is as follows:

step 2-1, the conventional phase compensation function can be expressed as

Wherein Λ (l, k) is a phase compensation function, λ is a phase compensation factor,

for the estimation of noise, the amplitude spectrum | Y (l, k) | of the voice with noise can be used to replace the estimation; λ is an empirical value, which is originally a constant, modified later as

When model training is carried out, the model training device will

Substituted by the actual value Z of the amplitude of the noise used directly

The new compensation factor is

Ψ (k) is an inverse symmetric function for phase correction, denoted

The action of Λ (l, k) is shown by the following formula,

Y_Λ(l，k)＝Y(l，k)+Λ(l，k)

∠Y_Λ(l，k)＝arg(Y_Λ(l，k))

Wherein c is a warpA value, typically set to 2.7, Λ (l, k) with SNR_lWhen the current frame is a voice frame, the influence of the phase compensation function lambda (l, k) on the noise-carrying voice is reduced, and more voice details are reserved. SNR_lThe signal-to-noise ratio of the signal is the l frame.

Step 2-3, in order to simplify the training target of the subsequent neural network, the optimized phase compensation function is abbreviated as the following formula,

Λ(l，k)＝Ψ(k)×Q(l，k)

wherein Q (l, k) is a new compensation factor. And eliminating an antisymmetric function, and taking Q (l, k) as one of training targets of the following neural network. The formula for Q (l, k) is expressed as,

and 3, building a full convolution neural network model, using the solved phase compensation factor and the log power spectrum of the pure voice as a training target of a Full Convolution Neural Network (FCNN), and training the full convolution neural network model.

The design of a training target and a loss function for integrating phase spectrum compensation is provided, and the specific method comprises the following steps:

step 3-1, for the voice enhancement of the frequency domain, adopting the logarithmic power spectrum of the voice with noise and the logarithmic power spectrum of the pure voice as the training characteristic and the training target of the neural network respectively

Wherein,

is a pure voice logarithm power spectrum T obtained by the noisy voice logarithm power spectrum through the neural network training_nOf the approximation value of S_nIs the log power spectrum of noisy speech. Then combines the phase alpha of the n-th frame of the noisy speech_nTo obtain the time domain waveform of the n-th frame of the enhanced voice

Wherein the phase compensation factor is adjusted

Then the phase compensation function is obtained, and the enhanced phase of the nth frame is obtained by the phase compensation function, so that the phase alpha of the voice with noise can be replaced_nAnd a well-modified voice enhancement effect is achieved.

Wherein, MSE is the joint mean square error of the compensation of the log power spectrum and the phase spectrum, and the log power spectrum and the phase spectrum jointly influence the parameter. M is the frame length, T (n, k) is the log power spectrum of the nth frame of pure speech,

is a phase compensation factorAnd obtaining an approximate value through neural network training. Beta is a modulation parameter and is a constant for balancing the influence of the logarithmic power spectrum and the phase compensation on the network.

Step 3-4, constructing a full convolution neural network model for training

The invention adopts a complete convolution neural network (FCNN), which is different from the convolution neural network, the FCNN replaces a complete connection layer with a new convolution layer, and the convolution layer is utilized to extract the voice characteristics of deeper abstraction. The FCNN model is largely divided into an input layer, an encoder layer, a decoder layer, and an output layer.

The input layer of the model is a matrix of 5 multiplied by 257, the length of the context representing the input characteristics is 5 frames, each frame has 257 data points, and the input data is composed of the logarithmic power spectrum of the noisy speech of the adjacent 5 frames, so that the correlation of the input data in two dimensions of time and frequency is increased.

The encoder layer of the model consists of two convolution layers and two dropout layers, wherein the convolution layers have three hyper-parameters which are respectively the size of a convolution kernel, a moving step length and a filling mode. The parameter settings of the convolutional layer 1 are as follows: the convolution kernel size is 5; the moving step length is 1; the filling mode is a same mode; the number of convolution kernels is designated 32; the activation function is selected from a ReLU (reconstructed Linear Unit, Re-LU) function. The adoption of the ReLU activation function can greatly reduce the calculation amount. After passing through convolution layer 1, the weight loss ratio is set to 0.2. The parameter settings of the convolutional layer 2 are as follows: the convolution kernel size is 7; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The encoder is used for preliminarily extracting shallow layer characteristics of the noisy speech through the convolutional layer 1 and further extracting more abstract deep layer speech characteristics through the convolutional layer 2 for encoding.

The decoder layer of the model also consists of two convolution layers and two dropout layers, and the network structure of the model is symmetrical to that of the encoder layer. The parameter settings of the convolution layer 3 are as follows: the convolution kernel size is 7; the filling mode is set to be a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The convolutional layer 4 parameters are set as follows: the convolution kernel size is 5; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 32; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The decoder acts in reverse to the encoder to restore the features compressed by the encoder to produce clean speech features.

The output layer of the model is a full connection layer which is directly connected with a decoder and comprises 257 neurons, the activation function selects a linear function, and the output is 1 frame of voice characteristic data and 1 frame of phase spectrum compensation factors.

And 4, inputting the test voice into the trained model to obtain an estimated value of the logarithmic power spectrum and a phase compensation function.

And (2) extracting the characteristics of the voice with noise to be tested according to the method in the step (1), and inputting the extracted characteristic parameters into a trained full convolution neural network model to obtain a logarithmic power spectrum estimation value and a phase compensation factor of the voice signal.

And 5, respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by using the estimated value of the logarithmic power spectrum and the phase compensation function obtained in the previous step to obtain the final enhanced voice.

And reconstructing the magnitude spectrum and the phase spectrum of the voice signal by using the logarithmic power spectrum and the phase compensation factor, and obtaining the final enhanced voice through the ISTFT.

Examples

In order to verify the effectiveness of the method, the traditional PSC voice enhancement algorithm, the traditional FCNN algorithm and the method are compared and tested, and the algorithm performance under the environments of different types of interference noise and different input signal-to-noise ratios is tested. The speech quality Perceptual Evaluation (PESQ) and the short-time objective intelligibility (STOI) are selected as the speech evaluation indexes. PESQ is obtained by comparing an enhanced speech signal with a clean speech signal, and takes a value of [ -0.5, 4.5], and a larger value indicates a higher speech quality. The STOI is also obtained by comparing the enhanced speech signal with the pure speech signal, and takes a value of [0, 1], and the larger the value is, the higher the intelligibility of speech is, which is specifically as follows:

the method comprises the following steps: PSC

The second method comprises the following steps: FCNN algorithm

The third method comprises the following steps: the method of the invention

The three methods are respectively used for enhancing the noise-carrying voice with signal-to-noise ratios of-5 dB, 0dB and 5dB, and the noise types are factory2(fac), babble (bab) and buccaneer1 (buc). The results are shown in tables 1 to 3.

TABLE-5 dB noise

TABLE two 0dB noise

TABLE III 5dB noise

As can be seen from the above results, compared with the conventional phase compensation algorithm (PSC), the method of the present invention can effectively suppress various types of noise in the speech signal, and better ensure the speech intelligibility. Compared with the FCNN algorithm, the method adds estimation of the phase compensation factor, and improves the effect under fac and buc noise conditions, so that the overall effect of speech enhancement is improved.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A speech enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, comprising the steps of:

step 2, combining the characteristic data of the voice with noise and the pure voice, introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm, and calculating a phase compensation factor by using the improved formula;

step 3, building a full convolution neural network model, taking the phase compensation factor and the logarithmic power spectrum of the pure voice as a training target of the full convolution neural network, and training the full convolution neural network model;

and 5, respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by utilizing the estimated value of the logarithmic power spectrum and the phase compensation function to obtain the final enhanced voice.

2. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 1, wherein in step 1, the noisy speech and the clean speech are preprocessed and the feature parameters of the noisy speech and the clean speech are extracted, and the operations are as follows:

step 1-1, let y (n) represent a noisy speech signal and y (n) ═ d (n) + x (n), where d (n) is a noise signal and x (n) is a clean signal; transforming the noisy speech to the frequency domain by windowing w (n) to M samples of Y (n), and performing FFT of M points to obtain a frequency spectrum Y (l, k) of the noisy speech, where l is a frame number label, k represents a frequency component, and k is 0, 1, 2. The same spectrum X (l, k) of the clean speech X (n) and the spectrum D (l, k) of the noise signal D (n) can be obtained;

Y(l，k)＝|Y(l，k)|e^j∠Y(l，k)

3. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network according to claim 2, wherein in the step 2, the phase compensation function is optimized as follows:

step 2-1, the conventional phase compensation function is expressed as

The action of Λ (l, k) is shown by the following formula,

Y_Λ(l，k)＝Y(l，k)+Λ(l，k)

∠Y_Λ(l，k)＝arg(Y_Λ(l，k))

Where c is an empirical value, SNR_lIs the signal-to-noise ratio of the l frame.

4. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 3, wherein the step 2 further comprises a step 2-3 of simplifying the training goal of the subsequent neural network by simplifying the optimized phase compensation function as the following formula,

Λ(l，k)＝Ψ(k)×Q(l，k)

wherein the calculation formula of Q (l, k) is expressed as,

5. the speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 4, wherein in the step 3, a training target and a loss function design for fusing the phase spectrum compensation are proposed, and the specific method is as follows:

Wherein,

Namely, it is

Wherein the phase compensation factor is adjusted

Then the phase compensation function is obtained, and the enhanced phase of the nth frame is obtained by the phase compensation function, so that the phase alpha of the voice with noise can be replaced_nAchieving the well-modified voice enhancement effect;

Where MSE is the combined mean square error of the log power spectrum and phase spectrum compensation, T (n, k) is the log power spectrum of the nth frame of pure speech,

the approximate value of the clean voice logarithmic power spectrum obtained by training the logarithmic power spectrum of the nth frame of the noisy voice through the neural network,

the method comprises the steps of obtaining an approximate value of a phase compensation factor through neural network training, wherein beta is a modulation parameter;

6. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 5, wherein in the step 3-4, the input layer is a 5 x 257 matrix, the length of the context representing the input features is 5 frames, each frame has 257 data points, the input data is composed of the logarithmic power spectrum of the noisy speech of the adjacent 5 frames, and the correlation of the input data in the two dimensions of time and frequency is increased;

the encoder layer consists of two convolution layers and two dropout layers, wherein the convolution layers have three hyper-parameters which are respectively the size of a convolution kernel, a moving step length and a filling mode;

the decoder layer also comprises two convolution layers and two dropout layers, and the network structure of the decoder layer is symmetrical to that of the encoder layer;

7. The speech enhancement algorithm based on the improved phase spectrum compensation and the full convolution neural network as claimed in claim 6, wherein in step 4, the characteristic extraction is performed on the noisy speech to be tested according to the method in step 1, and the extracted characteristic parameters are input into the trained full convolution neural network model, so as to obtain the logarithm power spectrum estimation value and the phase compensation factor of the speech signal.

8. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 7, wherein in the step 5, the magnitude spectrum and the phase spectrum of the speech signal are reconstructed by using the logarithmic power spectrum and the phase compensation factor, and the final enhanced speech is obtained by performing short-time inverse fourier transform (ISTFT).