CN111653288B

CN111653288B - Target person voice enhancement method based on conditional variation self-encoder

Info

Publication number: CN111653288B
Application number: CN202010557116.0A
Authority: CN
Inventors: 乐笑怀; 卢晶
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2023-05-09
Anticipated expiration: 2040-06-18
Also published as: CN111653288A

Abstract

The invention discloses a target person voice enhancement method based on a condition variation self-encoder. The method comprises the following steps: (1) Performing short-time Fourier transform on clear voice data of a target speaker to obtain a magnitude spectrum; (2) Training a conditional variation self-encoder as a speech model using the target speaker's distinct speech magnitude spectrum and the identity coding vector; (3) Performing short-time Fourier transform on the noise-containing voice signal to obtain an amplitude spectrum and a phase spectrum; (4) Inputting the noisy speech amplitude spectrum and the target speaker identity coding vector into a speech model, fixing the weight of a speech model decoder, and carrying out joint iterative optimization on the speech model and a non-negative matrix factorization model to obtain speech and noise amplitude spectrum estimation; (5) The amplitude spectrum estimation and the noise-containing voice phase spectrum are combined into a complex spectrum, and then the enhanced voice time domain signal is obtained through inverse short-time Fourier transform. The method can enhance the voice of the target person under various complex noise, and has higher robustness.

Description

Target person voice enhancement method based on conditional variation self-encoder

Technical Field

The invention belongs to the field of voice enhancement, and particularly relates to a target person voice enhancement method based on a conditional variation self-encoder.

Background

When a microphone is used to collect the voice signal of a speaker in a real environment, various interference signals are collected at the same time, and may be background noise, room reverberation and the like. These noise disturbances can degrade the quality of speech and severely degrade the speech recognition accuracy when the signal-to-noise ratio is low. A technique of extracting a target voice from noise interference is called a voice enhancement technique.

Spectral subtraction can be used to achieve speech enhancement (ball, s.f. (1979) Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, speech and Signal Processing,27, 113-120.). In chinese patent CN103594094a, speech is transformed to the time-frequency domain by short-time fourier transform, then the power spectrum of the speech signal of the current frame is subtracted from the estimated noise power spectrum by using a self-adaptive threshold spectral subtraction to obtain the power spectrum of the enhancement signal, and finally the enhancement signal of the time domain is obtained by short-time inverse fourier transform. However, this enhancement method has a large impairment of speech quality due to unreasonable assumptions made about speech and noise.

Non-negative matrix factorization algorithms have also been used for Speech enhancement (Wilson K W, raj B, smaragdis P, et al, spech denoising using nonnegative matrix factorization with priors [ C ]. Proceedings of the IEEE International Conference on Acoustics, spech, and Signal Processing, 2008.). By respectively carrying out non-negative matrix factorization on the short-time power spectrums of the voice and the noise, a dictionary of the voice and the noise can be obtained, and the dictionary is enhanced when the dictionary is enhanced. Chinese patent CN104505100a uses a non-negative matrix factorization algorithm that combines spectral subtraction and minimum mean square error for speech enhancement. However, non-negative matrix factorization only models the linearity of speech characteristics, which has poor modeling capabilities for the non-linearity of speech, limiting its performance.

Recently, various deep learning-based generative models have been used in speech modeling, where a variational self-encoder is a method of explicitly learning data distribution, which can be used to model speech non-linearly. Literature (S.Leglaive, L.Girin and R.Horaud, "AVARIANCE MODELING FRAMEWORK BASED ON VARIATIONAL AUTOENCODERS FOR SPEECH ENHANCEMENT,"2018IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP), aalborg,2018, pp.1-6, doi: 10.1109/MLSP.2018.856711.) uses a joint variable self-encoder and non-negative matrix factorization speech enhancement algorithm, wherein the variable self-encoder model is trained in advance with a clear speech short-time power spectrum, the non-negative matrix model is learned during enhancement, and the algorithm has a good enhancement effect on non-stationary noise on the premise of not damaging speech quality. However, this enhancement model is not good for the enhancement of human voice interference because the variant self-encoder model uses clear speech training.

In practical applications, the types of noise vary widely, and besides non-human noise, it is also significant to extract the voice of the target speaker from human voice interference.

Disclosure of Invention

In the prior art, when a variation self-encoder-non-negative matrix factorization model is used for voice enhancement in an environment full of human voice interference, the interference human voice is often reserved, and the enhancement effect is influenced. The invention provides a voice enhancement method based on a conditional variation self-encoder, which can effectively solve the problem of human voice interference and improve the voice enhancement performance.

The invention adopts the technical scheme that:

a method for target human speech enhancement based on a conditional variation self-encoder, comprising the steps of:

step 1, performing short-time Fourier transform on clear voice data of a target speaker to obtain a short-time amplitude spectrum;

step 2, constructing an identity coding vector of the target speaker, and training a condition variation self-encoder as a voice model by using the identity coding vector and the short-time amplitude spectrum obtained in the step 1; the input of the condition variation self-encoder is the voice amplitude spectrum of the target speaker and the identity coding vector thereof, and the output is the logarithm of the voice amplitude spectrum of the target speaker;

step 3, performing short-time Fourier transform on the noise-containing voice signal to obtain a short-time amplitude spectrum, and reserving a phase spectrum of the noise-containing voice signal;

step 4, inputting the short-time amplitude spectrum of the noisy speech signal obtained in the step 3 into the speech model, taking the target speaker identity coding vector as a speech model condition item, and fixing the weight of a decoder of the speech model; the method comprises the steps of performing joint iterative optimization on a voice model and a non-negative matrix factorization model to obtain amplitude spectrum estimation of voice and noise;

and 5, combining the amplitude spectrum estimation obtained in the step 4 and the phase spectrum of the noise-containing voice signal reserved in the step 3 into a complex spectrum, and obtaining an enhanced voice time domain signal through inverse short time Fourier transform.

Further, in the step 2, the condition-variable self-encoder uses the depth neural network as an encoder, the encoder maps the speech magnitude spectrum to a random variable z, and the decoder maps the random variable z to a clear speech.

Further, in the step 4, the specific steps of joint iterative optimization of the speech model and the non-negative matrix factorization model are as follows:

1) The encoder and decoder of the conditional variable self-encoder can be expressed in the form of:

z _t ～q _φ (z _t |x _t ,c)

x _t ～p _θ (x _t |z _t ,c)

wherein x_t For the t frame-wise magnitude spectrum of the input speech, z _t T frame output for encoderC represents the speaker identity vector, phi and theta represent the weights of the encoder and decoder, respectively, q _φ and p_θ Respectively representing the distribution of hidden variables generated by the encoder and the distribution of speech amplitude spectrum estimation generated by the decoder;

after training the encoder and decoder, the decoder p is fixed _θ (x _t |z _t The weight of c), only the encoder weight is back-propagated and trained during the voice enhancement, the voice amplitude spectrum output by the voice model is estimated as sigma (z) _t C) estimation of the power spectrum to sigma ² (z _t ,c)；

2) The non-negative matrix factorization may be expressed in the form of:

V＝WH

wherein

Noise short-time power spectrum estimation representing F-dimension T frame, R ₊ Representing the positive real number domain, decomposing it into two non-negative low rank matrices using a matrix decomposition algorithm>

and />

Wherein K is the rank of the two matrices after decomposition and is much smaller than the values of F and T;

3) During optimization, the amplitude spectrum x of the noise-containing voice is input _t And target speaker identity vector c, initializing parameters W, H for non-negative matrix factorization and gain vector for 1-dimensional T-frames

In each iteration, the following objective function is first optimized for the condition-variable self-encoder in step 1):

/>

wherein

Represents the K-L divergence between the two distributions,/and->

Representing expectations, where p (z _t ) Probability density representing a standard normal distribution;

the parameters W, H and a of the non-negative matrix factorization are then iterated using the iterative formula _t ：

wherein ,

the f-th row t column element of (2) is represented by the formula +.>

Indicating (I)>

Represents the sum of q _φ (z _t |x _t C) the sampled r sample; as indicated by matrix para-multiplication, ^T representing matrix transposition;

after a plurality of iterations, the clear speech estimate obtained is expressed as:

wherein ,x_ft And

f-line t-column elements respectively representing noisy speech spectrum and clear speech spectrum estimates, (W) _k H _k ) _ft Elements of f rows and t columns representing the noise power spectrum estimate.

Further, in step 3), the objective function is optimized using the following equation:

wherein ,

Itakura-Saito divergence, v representing two distributions _ft F-th row t column element representing noise short-term power spectrum estimate V, < >>

and />

Mean and variance vectors representing hidden variables of the encoder output.

Compared with the prior art, the invention has the beneficial effects that: the method can carry out voice enhancement under various complex noise scenes, and has strong capability of eliminating interference of non-target voice due to the fact that the target speaker information is introduced into the training process.

Drawings

FIG. 1 is a process flow diagram of a method for targeted human speech enhancement based on a conditional variance self-encoder of the present invention.

FIG. 2 is a schematic diagram of a variational self-encoder model employed in an embodiment of the present invention; the deep neural network used therein is a frame independent fully connected network, |s _t I represents the input clear speech magnitude spectrum, C represents the speechThe voice corresponds to the identity independent heat vector of the speaker, and the Embedding represents the network to reduce the dimension of the speaker identity vector to 10 dimensions, sigma (|z) _t I, c) represents the magnitude spectrum of the output speech.

Fig. 3 is a graph of the prior art variant self-encoder-non-negative matrix factorization algorithm versus the enhanced speech SDR values of the method of the present invention under different noise types.

FIG. 4 is a graph comparing the enhancement effect of the method of the present invention and the existing algorithm based on the variational self-encoder-non-negative matrix factorization model on target speech under the condition of multi-human voice mixing. (a) is a mixed voice short-time amplitude spectrum, (b) is an existing algorithm enhanced voice short-time amplitude spectrum, and (c) is an enhanced voice short-time amplitude spectrum of the method.

Detailed Description

The invention relates to a target person voice enhancement method based on a conditional variation self-encoder, which mainly comprises the following steps:

1. target person voice model training

1) Performing short-time Fourier transform on clear voice signal of target person

If the clear voice signal of the target person is X (T), performing short-time Fourier transform of N-point FFT to obtain a complex frequency spectrum of X= { X in F dimension (F=N/2+1) of T frame ₁ ,...,x _t}, wherein

|x _ft The i represents the magnitude of the f-th spectral component of the t-th frame.

2) Construction of target person identity vector

If there are clear voice data of M speakers, the identity of each speaker is marked as an M-dimensional one-hot vector, and if a certain target speaker is the ith bit in the data set, the ith dimension of the identity vector is 1, and the other dimensions are all 0.

3) Training of conditional variational self-encoders

The conditional variation self-Encoder model consists of an Encoder (Encoder) and a Decoder (Decoder). The goal of the encoder is to convert the speech magnitude spectrum |x _t Mapping of I to a random variable z _t WhileThe goal of the decoder is to map back the speech magnitude spectrum from this random variable, which is generally assumed to satisfy a gaussian distribution.

The model of the encoder and decoder can thus be expressed as:

z _t ～q _φ (z _t |x _t ,c) (2)

x _t ～p _θ (x _t |z _t ,c) (3)

where c is a conditional term, i.e. the target speaker identity vector in step 2), the coupling of which to the encoder and decoder can be seen with reference to fig. 2. In this embodiment, a neural network is first used to reduce the dimension of the M-dimension identity vector to 10 dimensions, and the reduced dimension output is spliced with each hidden layer output of the codec.

The training goal of the conditional variance self-encoder is to maximize the likelihood of the decoder output, i.e., the closer the speech spectrum output by the decoder is to the true speech spectrum, the better. So its objective function can be written as log likelihood:

the above objective function can be decomposed into the following equations, which are easier to calculate, by the variance inference method:

/>

therein, wherein

Representing the Kullback-Leibler (K-L) divergence between the two distributions. The first term above describes the hidden variable output from the encoder and the K-L divergence of the standard normal distribution, in particular, z will be output from the encoder _t Mean sum of (2)Variance, then z is obtained using the resampling method shown in fig. 2 _t And then input to a decoder. The second term is represented by x _t Input encoder gets hidden variable z _t ，z _t Inputting again the decoder to retrieve x _t Can be obtained in a neural network with an X input network and with its output as close as possible to X, in particular, it will calculate the Itakura-Saito (I-S) divergence of the network inputs and outputs:

the objective function can therefore be rewritten as:

wherein

Encoder network output z, respectively _t Mean and variance of (c). And optimizing the model to obtain the target human voice model. It is generally assumed that speech satisfies the following complex gaussian distribution, which can be used as a speech model:

2. speech enhancement using iterative algorithms

1) Performing short-time Fourier transform on noise-containing voice signal

If the noise-containing voice signal is X (t), the complex frequency spectrum of X= { X can be obtained by performing short-time Fourier transform of N-point FFT ₁ ,...,x _t }。

2) Modeling noise using non-negative matrix factorization model

Similar to the speech model (8), the noise model can also be described as the following distribution form:

wherein the matrix

Matrix, which is decomposed into the product of the following two low rank matrices:

V＝WH (10)

assuming that the noise is additive noise, the noisy speech signal x _ft Can be expressed as:

wherein x_ft 、s _ft and n_ft Respectively represent a noisy speech spectrum, a clear speech spectrum and a noise spectrum, a _t Representing the t-th element of the gain vector.

3) Iterative optimization

The parameters optimized in this embodiment are the parameters of the noise model { W, H, a } and the speech model, the optimization objective being to maximize the noisy speech likelihood of equation (11), which can typically be done by the expectation maximization algorithm (E-M). For the speech model, as long as the likelihood of noisy speech is maximized, the objective function is as follows:

the training-like objective function (7) is calculated as the noise-containing speech power spectrum |x _ft | ² And estimated noisy speech power spectrum a _t σ ² (z _t ,c)+v _ft ) The optimization process will fix the decoder weights θ, optimizing only phi.

Since the decoder is a suitable speech generation model, the present embodiment will fix the decoder weights and only optimize the encoder weights when optimizing the objective function. The optimization of the noise model may use the optimization-Minimization (MM) algorithm, which is accomplished in particular by the following iterative equation:

wherein ,

the f-th row t column element of (2) can be represented by the formula +.>

Indicating (I)>

Wherein +.is the multiplication of para-position +.>

Then represent from q _φ (z _t |x _t C) the sampled r sample.

Through the above iterations, the final objective of this embodiment is to calculate a clean speech estimate, which can be expressed as the following expectations:

examples

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings.

1. Training and testing samples and objective evaluation criteria

The training data and test samples of this embodiment are derived from the TIMIT voice library, which contains the voice data of 629 speakers. Taking 7/8 of the data as training data for each speaker, and constructing a 629-dimensional identity vector for each speaker. The test sample of the embodiment is formed by randomly mixing the remaining 1/8 data of the TIMIT voice library and 17 different scenes of noises of the DEMAND noise library according to the signal-to-noise ratio of-5 dB to 5dB, and 850 pieces are all used.

The condition variable self-encoder adopted in the embodiment is shown in fig. 2, wherein the encoder and the decoder are both composed of a full-connection depth neural network with independent frames, the hidden layer dimension is 512, and the adopted activation function is tanh. The input of the encoder is the short-time magnitude spectrum of the clean speech and the output is z _t Are combined into z by resampling method _t The result is input to a decoder whose output is the logarithm of the short-time magnitude spectrum of the clean speech. The speaker identity vector is first reduced to 10 dimensions by a dimension reduction layer (Embedding) and spliced with the hidden layer output of the codec. During training, the clear voice short-time amplitude spectrum for training is disturbed in the time frame direction and input into the network for calculation.

In this embodiment, SDR (Signal-to-failure) is used as an objective evaluation index for enhancing performance. It describes the signal-to-noise ratio of the target speech relative to the residual noise in the enhanced speech, calculated as follows:

wherein

s represents the enhanced speech signal and the clear speech signal, respectively, for->

Energy normalization to obtain s _target The snr is calculated again, so the SDR can be regarded as an energy scaled snr. In this embodiment, the SDR calculation is performed on the test sample after speech enhancement and clear speechTo performance evaluation.

2. Parameter setting

1) Short-time fourier transform of a signal

In this example, all signal sampling rates were 16kHz, a hanning window was used for short-time fourier transform, the window length was 1024 (64 ms), and the frame shift was 256 (16 ms).

2) Condition variable self-encoder

The condition variation employed in this embodiment is derived from the encoder, with input and output dimensions 513, hidden layer dimensions 512, hidden variable z _t Is 64 and the dimension of the speaker identity vector after compression is 10. Both the encoder and the decoder are fully connected networks with only one hidden layer. Training was optimized using Adam optimizer at a learning rate of 0.001.

3) Non-negative matrix factorization

The non-negative matrix factorization rank employed in this embodiment is 10.

4) Iteration parameters

In the iterative process, the encoder of the condition variation self-encoder is still optimized by using an Adam optimizer at a learning rate of 0.001. The number of iterations is 200, each iteration completes one back propagation training of the conditional variation from the encoder, and the noise model parameters are iterated using equations (13) (14) (15), where the number of samples r = 1, i.e., one output of the direct choice conditional variation from the encoder is substituted into the iteration. The expectation is calculated by only one sampling when finally obtaining the speech estimate, namely, the formula (16) is rewritten as:

3. specific implementation flow of method

Referring to fig. 1, the implementation of the method is largely divided into a training phase and an enhancement phase.

In the training stage, the clear speaker voice is used for performing short-time Fourier transform on the amplitude spectrum, and the identity vector of the speaker is input into the network for training. In the enhancement stage, noise-containing voice of a certain speaker is input, short-time Fourier transform is carried out on the noise-containing voice to obtain a magnitude spectrum, random numbers which are uniformly distributed between 0 and 1 are used for initializing W and H, and a is initialized to be all 1. The noisy speech magnitude spectrum and the identity vector of the speaker are input into a conditional variation self-encoder, and the iteration is started by fixing the weight of the decoder. 200 iterations were completed using equations (12) (13) (14) (15). After iteration convergence, the final obtained conditional variation is used for obtaining clear voice time spectrum estimation by substituting the conditional variation from the encoder, W, H and a into the (18), and finally, the enhanced voice can be obtained by performing short-time Fourier inverse transformation on the clear voice time spectrum.

To demonstrate the performance improvement of the present invention over prior methods, this example compares to a variable-based self-encoder-non-negative matrix factorization algorithm in literature (S.Leglaive, X.Alameda-Pineda, l.gilin and r.Horaud, "A Recurrent Variational Autoencoder for Speech Enhancement," ICASSP 2020-2020IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), barcelona, spain,2020, pp.371-375, doi: 10.1109/ICASSP40776.2020.9053164.) that omits the target person condition term in the present invention. FIG. 3 shows a comparison line graph of enhanced performance in the test sample described at point 1 for both methods. Wherein the horizontal axis is 17 different noise types and the vertical axis is the average SDR value of the enhanced speech under certain noise conditions for both approaches. The pentagonal star point curve CVAE in the figure represents the enhancement performance of the conditional variable self-encoder-non-negative matrix factorization algorithm of the present invention, and the dot curve VAE (amp) represents the enhancement performance of the existing variable self-encoder-non-negative matrix factorization algorithm. The method of the invention can be found to have a certain improvement in the enhancement performance in various noise scenes compared with the existing correlation algorithm. The performance improvement of the invention is more obvious in the noise environment (such as meeting, cafeter noise with poor enhancement effect in fig. 3, namely, the scene of meeting, coffee shop and the like filled with other speaker interference) with weak enhancement performance of the existing algorithm. Fig. 4 is an example of speech enhancement under a dual speaker mixing condition, where the target speech is female speech and the interfering speech is male speech, (a) the graph is a mixed speech short-term amplitude spectrum, (b) the graph is an existing algorithm enhanced speech short-term amplitude spectrum, (c) the graph is an inventive method enhanced speech short-term amplitude spectrum, it can be found that the existing algorithm will almost remain interfering with male speech, and the inventive method can better eliminate interfering male speech.

Claims

1. The target human voice enhancement method based on the conditional variation self-encoder is characterized by comprising the following steps of:

2. The method according to claim 1, wherein in the step 2, the condition-variable self-encoder uses a deep neural network as an encoder and a decoder, the encoder maps the speech magnitude spectrum to a random variable z, and the decoder maps the random variable z to a clear speech.

3. The method for target human voice enhancement based on the condition-variable self-encoder according to claim 1, wherein in the step 4, the specific step of joint iterative optimization of the voice model and the non-negative matrix factorization model is as follows:

1) The condition-variable self-encoder and decoder are expressed as follows:

z _t ～q _φ (z _t |x _t ,c)

x _t ～p _θ (x _t |z _t ,c)

wherein x_t For the t frame-wise magnitude spectrum of the input speech, z _t For the hidden variable of the t frame output by the encoder, c represents the speaker identity vector, phi and theta represent the weights of the encoder and decoder, respectively, q _φ and p_θ Respectively representing the distribution of hidden variables generated by the encoder and the distribution of speech amplitude spectrum estimation generated by the decoder;

2) The non-negative matrix factorization is expressed in the form:

V＝WH

wherein

and />

3) In the course of the optimization process,inputting amplitude spectrum x of noise-containing voice _t And target speaker identity vector c, initializing parameters W, H for non-negative matrix factorization and gain vector for 1-dimensional T-frames

wherein

Represents the K-L divergence between the two distributions,/and->

Representing expectations, where p (z _t ) Probability density representing a standard normal distribution; />

Wherein X is a complex frequency spectrum,

the f-th row and t-th column elements are composed of common elements->

The representation is made of a combination of a first and a second color,

wherein ,x_ft And

4. A method of target person speech enhancement based on a conditional variance self-encoder according to claim 3, characterized in that in step 3) the following equation is used to optimize the target function:

wherein ,

and />

Mean and variance vectors representing hidden variables of the encoder output. />