CN110808057A

CN110808057A - Voice enhancement method for generating confrontation network based on constraint naive

Info

Publication number: CN110808057A
Application number: CN201911051607.1A
Authority: CN
Inventors: 袁丛琳; 孙成立
Original assignee: Nanchang Hangkong University
Current assignee: Nanchang Hangkong University
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-02-18

Abstract

The invention discloses a voice enhancement method for generating an confrontation network based on constraint naive, which comprises the following steps: 1) noisy data collection and labeling; 2) voice framing and windowing; 3) amplitude compression; 4) inputting constraint naive to generate confrontation network training; 5) amplitude decompression; 6) inverse short-time Fourier transform to generate enhanced speech. The invention has the advantages that: continuously enhancing the capability of generating a model generation sample by generating countermeasure learning between a generation model and a discrimination model in a countermeasure network, and finally obtaining the distribution of a clean voice sample; there is no assumption about the statistical distribution of speech or noise; the method of complex spectrum mapping is adopted, and phase information is added in the training sample. The invention skillfully solves the problem that the distribution of the voice and noise signals is difficult to estimate, is beneficial to improving the voice intelligibility and avoids phase distortion.

Description

Voice enhancement method for generating confrontation network based on constraint naive

Technical Field

The invention relates to the technical field of voice processing, in particular to a voice enhancement method for generating an confrontation network based on constraint naive.

Background

Voice has played an important role in the fields of mobile communication, multimedia technology, and the like as a main medium for human communication. Under the great background of the rise of artificial intelligence, the wide application of technologies such as voice recognition and voiceprint recognition also puts higher requirements on the quality of voice signals. However, in an actual voice capturing and dialogue communication scenario, a voice signal is often interfered by various noises, mainly including background noise, channel noise and interference noise. Speech enhancement is an effective technique to address noise pollution.

The traditional speech enhancement methods mainly have four types: (1) the spectral subtraction is to subtract the power spectrum of the noise-containing speech signal from the power spectrum of the noise-free speech signal by using the short-time stationarity of the speech to obtain the power spectrum estimation of the pure speech signal. This method is prone to the "musical noise" problem; (2) the wiener filter method is to estimate the spectral coefficient of speech from given noisy speech by a wiener filter under the condition of supposing that the speech and additive noise both obey Gaussian distribution. When the adjustment of the filter parameters reaches the limit or is in an unsteady noise environment, the effect of the wiener filtering method is poor; (3) the method is based on the minimum mean square error estimation (MMSE) of spectral amplitude, and estimates the probability distribution of spectral coefficients through statistical learning assuming that the speech amplitude spectrum satisfies a certain distribution, such as gaussian distribution, gamma distribution, etc. However, the assumed distribution and the true distribution are often not consistent; (4) the subspace method is to place clean speech in a low rank signal subspace, place noise signals in a noise subspace, and the signal subspace and the noise subspace are orthogonal to each other. The method obtains a pure voice signal by setting the noise subspace to zero and then filtering the signal subspace. This method does not take into account the prior knowledge of speech and noise, making it difficult to completely remove the noise subspace.

Disclosure of Invention

The invention aims to solve the problems that: the method for enhancing the voice of the countermeasure network based on constraint naive generation is provided, the problem that the distribution of voice and noise signals is difficult to estimate is solved skillfully, the improvement of the voice intelligibility is facilitated, and the phase distortion is avoided.

The technical scheme provided by the invention for solving the problems is as follows: a method for generating a voice enhancement of a countermeasure network based on constraint naive, the method comprising the steps of,

(1) noise data collection and labeling;

(2) voice frame windowing;

(3) compressing the amplitude;

(4) inputting constraint naive to generate confrontation network training;

(5) amplitude decompression;

(6) and performing inverse short-time Fourier transform to generate enhanced voice.

Preferably, the noise data collection and labeling in step (1) specifically includes the following steps:

(1.1) data collection: the method comprises the steps of adopting the voice of a NOIZE library as pure voice, adopting noise in a NOISEX-92 noise library as a noise signal, and enabling the sampling frequency to be 8 KHz;

(1.2) data tagging: each noise is superimposed on the clean speech with a signal-to-noise ratio of-5 dB, 0dB, 5dB, 10dB and 15dB, respectively, as a noisy speech data set.

Preferably, the step (2) of windowing the speech frames refers to framing the noisy speech by using a hamming window with a length of 512 and a frame shift of 50%, and the number of short-time fourier transform points is 1024.

Preferably, the amplitude compression in the step (3) is to perform amplitude compression on the complex spectrum concatenation vector by using a hyperbolic tangent function, and the value range is limited to [ -1,1], and the hyperbolic tangent function is defined as

Preferably, the generation of the confrontation network training by inputting the constraint naive in the step (4) can be divided into a network model initialization, a training discriminator, a training generator and an output training model, and specifically comprises the following steps:

(4.1) initializing a network model: initializing a generator and a discriminator; the generator G is realized through a convolution layer and a deconvolution layer, and the activation function selects the PReLU; the discriminator D is realized by a convolution layer, and an activation function selects LeakyReLU; adopting a zero filling strategy of 'same' and adopting BatchNormalization to normalize each layer; the optimizer selects RMSprop, and the learning rate is 0.0002;

(4.2) trainingAn exercise discriminator: compressing the complex spectrum training of the pure voice sample obtained in the step 3) to ensure that

Approaching to 1; compressing the complex spectrum training of the noisy speech sample obtained in the step 3) to enhance the speech complex spectrum

And is

Approaching to 0;

(4.3), training generator: compressing the complex spectrum of the pure voice sample and the noisy voice sample obtained in the step 3), training, freezing the discriminator and training the generator, so that the discriminator D can enhance the complex spectrum of the voice

And is

Approaching to 1;

(4.4) outputting a training model: and (4.1) repeating the steps (4.1) to (4.3) until the model is converged, and outputting the generator G and the discriminator D.

Preferably, the amplitude decompression in the step (5) is amplitude decompression of the enhanced complex spectrum concatenation vector by using an inverse hyperbolic tangent function, where the inverse hyperbolic tangent function is defined as:

compared with the prior art, the invention has the advantages that: continuously enhancing the capability of generating a model generation sample by generating countermeasure learning between a generation model and a discrimination model in a countermeasure network, and finally obtaining the distribution of a clean voice sample; there is no assumption about the statistical distribution of speech or noise; the method of complex spectrum mapping is adopted, and phase information is added in the training sample. The invention skillfully solves the problem that the distribution of the voice and noise signals is difficult to estimate, is beneficial to improving the voice intelligibility and avoids phase distortion.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

Fig. 1 is a schematic diagram of the operation of the present invention.

Fig. 2 is a schematic block diagram of the present invention constraint naive generated countermeasure network.

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the accompanying drawings and examples, so that how to implement the embodiments of the present invention by using technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.

The invention adopts a flow chart of a voice enhancement method based on a constraint naive generation countermeasure network (CN-GAN) as shown in figure 1 to realize voice denoising under the environment with low signal-to-noise ratio, and the specific implementation steps are as follows:

1) noisy data collection and labeling

(1.1) data collection: the embodiment of the invention adopts sp 01-sp 30 voices of a NOIZE library as pure voices, adopts babble noise, white noise, hfchannel noise and buccaneer1 noise in a NOISEX-92 noise library as noise signals, and has the sampling frequency of 8 KHz;

(1.2) data tagging: and (3) superposing the four noises in the step (1.1) to pure voice with signal-to-noise ratios of-5 dB, 0dB, 5dB, 10dB and 15dB respectively to serve as a noisy voice data set. The data set was divided into a training set and a test set on a 3:1 scale.

2) Speech framing windowing

The method comprises the steps of framing noisy speech by adopting a Hamming window with the length of 512 and the frame shift of 50%, and connecting a real part and an imaginary part of a complex spectrum in series to form a vector by using the point number of short-time Fourier transform (STFT) of 1024 to obtain the complex spectrum of the noisy speech, wherein the point number is used as a network training target.

3) Amplitude compression

Performing amplitude compression on the complex spectrum concatenation vector obtained in the step 2) by using a hyperbolic tangent function, and performing amplitude compression on the noisy speech complex spectrum shown in the figure 1

Real part Z of_rAnd an imaginary part Z_iIs limited to the range of [ -1,1 [)]Then Z is_rAnd Z_iIs used as input of CN-GAN, and X is calculated by CN-GAN_rAnd X_iIs estimated value of

And

the hyperbolic tangent function is defined as shown in formula (1):

4) input constraint naive generation confrontation network training

(4.1) network model initialization: the generator and the arbiter are initialized. The generator G is implemented by a convolutional layer and a deconvolution layer, and the activation function selects the PReLU. The discriminator D is implemented by a convolutional layer, and the activation function selects the leakyreu. The zero padding strategy of "same" is adopted, and the normalization of each layer is adopted. The optimizer selects RMSprop with a learning rate of 0.0002. Constraint naive generation of complex spectrum mapping confronts a network objective function as shown in equation (2):

in the formula, there is X_c＝[X_r'X_i']，Z_c＝[Z_r'Z_i']λ represents the tuning weight, E [. cndot.)]Representing a computational mathematical expectation.

(4.2) training the arbiter: compressing the complex spectrum training of the pure voice sample obtained in the step 3) to ensure that

And is

Approaching 0.

(4.3) training generator: compressing the complex spectrum of the pure voice sample and the noisy voice sample obtained in the step 3), training, freezing the discriminator and training the generator, so that the discriminator D can enhance the complex spectrum of the voice

And is

Approaching to 1;

5) Amplitude decompression

Using inverse hyperbolic tangent function to real part of enhanced complex spectrum concatenation vector obtained in step 4)

And imaginary part

Performing amplitude decompression to obtainAndthe inverse hyperbolic tangent function is defined as shown in equation (3):

6) inverse short-time Fourier transform to generate enhanced speech

And (3) performing inverse short-time Fourier transform (ISTFT) on the enhanced voice complex spectrum obtained in the step 5) to obtain a time domain waveform of the noise-reduced voice, and finishing the voice enhancement process.

And repeating the step 6) on all the noisy speeches in the test set to obtain an enhanced speech data set.

The foregoing is merely illustrative of the preferred embodiments of the present invention and is not to be construed as limiting the claims. The present invention is not limited to the above embodiments, and the specific structure thereof is allowed to vary. All changes which come within the scope of the invention as defined by the independent claims are intended to be embraced therein.

Claims

1. A voice enhancement method for generating an antagonistic network based on constraint naive is characterized in that: the method comprises the following steps of,

(1) noise data collection and labeling;

(2) voice frame windowing;

(3) compressing the amplitude;

(4) inputting constraint naive to generate confrontation network training;

(5) amplitude decompression;

2. The method of claim 1 for speech enhancement based on constraint naive generation countermeasure network, wherein: the noise data collection and marking in the step (1) specifically comprises the following steps:

3. The method of claim 1 for speech enhancement based on constraint naive generation countermeasure network, wherein: the step (2) of windowing the voice frames refers to framing the noisy voice by adopting a Hamming window with the length of 512 and the frame shift of 50%, and the number of short-time Fourier transform points is 1024.

4. The method of claim 1 for speech enhancement based on constraint naive generation countermeasure network, wherein: the amplitude compression in the step (3) refers to amplitude compression of the complex spectrum concatenation vector by using a hyperbolic tangent function, the value range is limited to [ -1,1], and the hyperbolic tangent function is defined as

5. The method of claim 1 for speech enhancement based on constraint naive generation countermeasure network, wherein: the generation of confrontation network training by inputting constraint naive in the step (4) can be divided into network model initialization, training discriminator, training generator and output training model, and specifically comprises the following steps:

(4.2), training a discriminator: compressing the complex spectrum training of the pure voice sample obtained in the step 3) to ensure that

And is

Approaching to 0;

And is

Approaching to 1;

6. The method of claim 1 for speech enhancement based on constraint naive generation countermeasure network, wherein: the amplitude decompression in the step (5) refers to amplitude decompression of the enhanced complex spectrum concatenation vector by using an inverse hyperbolic tangent function, wherein the inverse hyperbolic tangent function is defined as: