CN111653288B - Target person voice enhancement method based on conditional variation self-encoder - Google Patents

Target person voice enhancement method based on conditional variation self-encoder Download PDF

Info

Publication number
CN111653288B
CN111653288B CN202010557116.0A CN202010557116A CN111653288B CN 111653288 B CN111653288 B CN 111653288B CN 202010557116 A CN202010557116 A CN 202010557116A CN 111653288 B CN111653288 B CN 111653288B
Authority
CN
China
Prior art keywords
encoder
speech
voice
spectrum
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010557116.0A
Other languages
Chinese (zh)
Other versions
CN111653288A (en
Inventor
乐笑怀
卢晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202010557116.0A priority Critical patent/CN111653288B/en
Publication of CN111653288A publication Critical patent/CN111653288A/en
Application granted granted Critical
Publication of CN111653288B publication Critical patent/CN111653288B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a target person voice enhancement method based on a condition variation self-encoder. The method comprises the following steps: (1) Performing short-time Fourier transform on clear voice data of a target speaker to obtain a magnitude spectrum; (2) Training a conditional variation self-encoder as a speech model using the target speaker's distinct speech magnitude spectrum and the identity coding vector; (3) Performing short-time Fourier transform on the noise-containing voice signal to obtain an amplitude spectrum and a phase spectrum; (4) Inputting the noisy speech amplitude spectrum and the target speaker identity coding vector into a speech model, fixing the weight of a speech model decoder, and carrying out joint iterative optimization on the speech model and a non-negative matrix factorization model to obtain speech and noise amplitude spectrum estimation; (5) The amplitude spectrum estimation and the noise-containing voice phase spectrum are combined into a complex spectrum, and then the enhanced voice time domain signal is obtained through inverse short-time Fourier transform. The method can enhance the voice of the target person under various complex noise, and has higher robustness.

Description

Target person voice enhancement method based on conditional variation self-encoder
Technical Field
The invention belongs to the field of voice enhancement, and particularly relates to a target person voice enhancement method based on a conditional variation self-encoder.
Background
When a microphone is used to collect the voice signal of a speaker in a real environment, various interference signals are collected at the same time, and may be background noise, room reverberation and the like. These noise disturbances can degrade the quality of speech and severely degrade the speech recognition accuracy when the signal-to-noise ratio is low. A technique of extracting a target voice from noise interference is called a voice enhancement technique.
Spectral subtraction can be used to achieve speech enhancement (ball, s.f. (1979) Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, speech and Signal Processing,27, 113-120.). In chinese patent CN103594094a, speech is transformed to the time-frequency domain by short-time fourier transform, then the power spectrum of the speech signal of the current frame is subtracted from the estimated noise power spectrum by using a self-adaptive threshold spectral subtraction to obtain the power spectrum of the enhancement signal, and finally the enhancement signal of the time domain is obtained by short-time inverse fourier transform. However, this enhancement method has a large impairment of speech quality due to unreasonable assumptions made about speech and noise.
Non-negative matrix factorization algorithms have also been used for Speech enhancement (Wilson K W, raj B, smaragdis P, et al, spech denoising using nonnegative matrix factorization with priors [ C ]. Proceedings of the IEEE International Conference on Acoustics, spech, and Signal Processing, 2008.). By respectively carrying out non-negative matrix factorization on the short-time power spectrums of the voice and the noise, a dictionary of the voice and the noise can be obtained, and the dictionary is enhanced when the dictionary is enhanced. Chinese patent CN104505100a uses a non-negative matrix factorization algorithm that combines spectral subtraction and minimum mean square error for speech enhancement. However, non-negative matrix factorization only models the linearity of speech characteristics, which has poor modeling capabilities for the non-linearity of speech, limiting its performance.
Recently, various deep learning-based generative models have been used in speech modeling, where a variational self-encoder is a method of explicitly learning data distribution, which can be used to model speech non-linearly. Literature (S.Leglaive, L.Girin and R.Horaud, "AVARIANCE MODELING FRAMEWORK BASED ON VARIATIONAL AUTOENCODERS FOR SPEECH ENHANCEMENT,"2018IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP), aalborg,2018, pp.1-6, doi: 10.1109/MLSP.2018.856711.) uses a joint variable self-encoder and non-negative matrix factorization speech enhancement algorithm, wherein the variable self-encoder model is trained in advance with a clear speech short-time power spectrum, the non-negative matrix model is learned during enhancement, and the algorithm has a good enhancement effect on non-stationary noise on the premise of not damaging speech quality. However, this enhancement model is not good for the enhancement of human voice interference because the variant self-encoder model uses clear speech training.
In practical applications, the types of noise vary widely, and besides non-human noise, it is also significant to extract the voice of the target speaker from human voice interference.
Disclosure of Invention
In the prior art, when a variation self-encoder-non-negative matrix factorization model is used for voice enhancement in an environment full of human voice interference, the interference human voice is often reserved, and the enhancement effect is influenced. The invention provides a voice enhancement method based on a conditional variation self-encoder, which can effectively solve the problem of human voice interference and improve the voice enhancement performance.
The invention adopts the technical scheme that:
a method for target human speech enhancement based on a conditional variation self-encoder, comprising the steps of:
step 1, performing short-time Fourier transform on clear voice data of a target speaker to obtain a short-time amplitude spectrum;
step 2, constructing an identity coding vector of the target speaker, and training a condition variation self-encoder as a voice model by using the identity coding vector and the short-time amplitude spectrum obtained in the step 1; the input of the condition variation self-encoder is the voice amplitude spectrum of the target speaker and the identity coding vector thereof, and the output is the logarithm of the voice amplitude spectrum of the target speaker;
step 3, performing short-time Fourier transform on the noise-containing voice signal to obtain a short-time amplitude spectrum, and reserving a phase spectrum of the noise-containing voice signal;
step 4, inputting the short-time amplitude spectrum of the noisy speech signal obtained in the step 3 into the speech model, taking the target speaker identity coding vector as a speech model condition item, and fixing the weight of a decoder of the speech model; the method comprises the steps of performing joint iterative optimization on a voice model and a non-negative matrix factorization model to obtain amplitude spectrum estimation of voice and noise;
and 5, combining the amplitude spectrum estimation obtained in the step 4 and the phase spectrum of the noise-containing voice signal reserved in the step 3 into a complex spectrum, and obtaining an enhanced voice time domain signal through inverse short time Fourier transform.
Further, in the step 2, the condition-variable self-encoder uses the depth neural network as an encoder, the encoder maps the speech magnitude spectrum to a random variable z, and the decoder maps the random variable z to a clear speech.
Further, in the step 4, the specific steps of joint iterative optimization of the speech model and the non-negative matrix factorization model are as follows:
1) The encoder and decoder of the conditional variable self-encoder can be expressed in the form of:
z t ~q φ (z t |x t ,c)
x t ~p θ (x t |z t ,c)
wherein xt For the t frame-wise magnitude spectrum of the input speech, z t T frame output for encoderC represents the speaker identity vector, phi and theta represent the weights of the encoder and decoder, respectively, q φ and pθ Respectively representing the distribution of hidden variables generated by the encoder and the distribution of speech amplitude spectrum estimation generated by the decoder;
after training the encoder and decoder, the decoder p is fixed θ (x t |z t The weight of c), only the encoder weight is back-propagated and trained during the voice enhancement, the voice amplitude spectrum output by the voice model is estimated as sigma (z) t C) estimation of the power spectrum to sigma 2 (z t ,c);
2) The non-negative matrix factorization may be expressed in the form of:
V=WH
wherein
Figure BDA0002544713260000031
Noise short-time power spectrum estimation representing F-dimension T frame, R + Representing the positive real number domain, decomposing it into two non-negative low rank matrices using a matrix decomposition algorithm>
Figure BDA0002544713260000032
and />
Figure BDA0002544713260000033
Wherein K is the rank of the two matrices after decomposition and is much smaller than the values of F and T;
3) During optimization, the amplitude spectrum x of the noise-containing voice is input t And target speaker identity vector c, initializing parameters W, H for non-negative matrix factorization and gain vector for 1-dimensional T-frames
Figure BDA0002544713260000034
In each iteration, the following objective function is first optimized for the condition-variable self-encoder in step 1):
Figure BDA0002544713260000035
/>
wherein
Figure BDA0002544713260000036
Represents the K-L divergence between the two distributions,/and->
Figure BDA0002544713260000037
Representing expectations, where p (z t ) Probability density representing a standard normal distribution;
the parameters W, H and a of the non-negative matrix factorization are then iterated using the iterative formula t
Figure BDA0002544713260000038
Figure BDA0002544713260000041
Figure BDA0002544713260000042
wherein ,
Figure BDA0002544713260000043
the f-th row t column element of (2) is represented by the formula +.>
Figure BDA0002544713260000044
Indicating (I)>
Figure BDA0002544713260000045
Figure BDA0002544713260000046
Represents the sum of q φ (z t |x t C) the sampled r sample; as indicated by matrix para-multiplication, T representing matrix transposition;
after a plurality of iterations, the clear speech estimate obtained is expressed as:
Figure BDA0002544713260000047
wherein ,xft And
Figure BDA0002544713260000048
f-line t-column elements respectively representing noisy speech spectrum and clear speech spectrum estimates, (W) k H k ) ft Elements of f rows and t columns representing the noise power spectrum estimate.
Further, in step 3), the objective function is optimized using the following equation:
Figure BDA0002544713260000049
wherein ,
Figure BDA00025447132600000410
Itakura-Saito divergence, v representing two distributions ft F-th row t column element representing noise short-term power spectrum estimate V, < >>
Figure BDA00025447132600000411
and />
Figure BDA00025447132600000412
Mean and variance vectors representing hidden variables of the encoder output.
Compared with the prior art, the invention has the beneficial effects that: the method can carry out voice enhancement under various complex noise scenes, and has strong capability of eliminating interference of non-target voice due to the fact that the target speaker information is introduced into the training process.
Drawings
FIG. 1 is a process flow diagram of a method for targeted human speech enhancement based on a conditional variance self-encoder of the present invention.
FIG. 2 is a schematic diagram of a variational self-encoder model employed in an embodiment of the present invention; the deep neural network used therein is a frame independent fully connected network, |s t I represents the input clear speech magnitude spectrum, C represents the speechThe voice corresponds to the identity independent heat vector of the speaker, and the Embedding represents the network to reduce the dimension of the speaker identity vector to 10 dimensions, sigma (|z) t I, c) represents the magnitude spectrum of the output speech.
Fig. 3 is a graph of the prior art variant self-encoder-non-negative matrix factorization algorithm versus the enhanced speech SDR values of the method of the present invention under different noise types.
FIG. 4 is a graph comparing the enhancement effect of the method of the present invention and the existing algorithm based on the variational self-encoder-non-negative matrix factorization model on target speech under the condition of multi-human voice mixing. (a) is a mixed voice short-time amplitude spectrum, (b) is an existing algorithm enhanced voice short-time amplitude spectrum, and (c) is an enhanced voice short-time amplitude spectrum of the method.
Detailed Description
The invention relates to a target person voice enhancement method based on a conditional variation self-encoder, which mainly comprises the following steps:
1. target person voice model training
1) Performing short-time Fourier transform on clear voice signal of target person
If the clear voice signal of the target person is X (T), performing short-time Fourier transform of N-point FFT to obtain a complex frequency spectrum of X= { X in F dimension (F=N/2+1) of T frame 1 ,...,x t}, wherein
Figure BDA0002544713260000051
|x ft The i represents the magnitude of the f-th spectral component of the t-th frame.
2) Construction of target person identity vector
If there are clear voice data of M speakers, the identity of each speaker is marked as an M-dimensional one-hot vector, and if a certain target speaker is the ith bit in the data set, the ith dimension of the identity vector is 1, and the other dimensions are all 0.
3) Training of conditional variational self-encoders
The conditional variation self-Encoder model consists of an Encoder (Encoder) and a Decoder (Decoder). The goal of the encoder is to convert the speech magnitude spectrum |x t Mapping of I to a random variable z t WhileThe goal of the decoder is to map back the speech magnitude spectrum from this random variable, which is generally assumed to satisfy a gaussian distribution.
The model of the encoder and decoder can thus be expressed as:
Figure BDA0002544713260000052
z t ~q φ (z t |x t ,c) (2)
x t ~p θ (x t |z t ,c) (3)
where c is a conditional term, i.e. the target speaker identity vector in step 2), the coupling of which to the encoder and decoder can be seen with reference to fig. 2. In this embodiment, a neural network is first used to reduce the dimension of the M-dimension identity vector to 10 dimensions, and the reduced dimension output is spliced with each hidden layer output of the codec.
The training goal of the conditional variance self-encoder is to maximize the likelihood of the decoder output, i.e., the closer the speech spectrum output by the decoder is to the true speech spectrum, the better. So its objective function can be written as log likelihood:
Figure BDA0002544713260000061
the above objective function can be decomposed into the following equations, which are easier to calculate, by the variance inference method:
Figure BDA0002544713260000062
/>
therein, wherein
Figure BDA0002544713260000063
Representing the Kullback-Leibler (K-L) divergence between the two distributions. The first term above describes the hidden variable output from the encoder and the K-L divergence of the standard normal distribution, in particular, z will be output from the encoder t Mean sum of (2)Variance, then z is obtained using the resampling method shown in fig. 2 t And then input to a decoder. The second term is represented by x t Input encoder gets hidden variable z t ,z t Inputting again the decoder to retrieve x t Can be obtained in a neural network with an X input network and with its output as close as possible to X, in particular, it will calculate the Itakura-Saito (I-S) divergence of the network inputs and outputs:
Figure BDA0002544713260000064
the objective function can therefore be rewritten as:
Figure BDA0002544713260000065
wherein
Figure BDA0002544713260000066
Encoder network output z, respectively t Mean and variance of (c). And optimizing the model to obtain the target human voice model. It is generally assumed that speech satisfies the following complex gaussian distribution, which can be used as a speech model:
Figure BDA0002544713260000067
2. speech enhancement using iterative algorithms
1) Performing short-time Fourier transform on noise-containing voice signal
If the noise-containing voice signal is X (t), the complex frequency spectrum of X= { X can be obtained by performing short-time Fourier transform of N-point FFT 1 ,...,x t }。
2) Modeling noise using non-negative matrix factorization model
Similar to the speech model (8), the noise model can also be described as the following distribution form:
Figure BDA0002544713260000071
wherein the matrix
Figure BDA0002544713260000072
Matrix, which is decomposed into the product of the following two low rank matrices:
V=WH (10)
assuming that the noise is additive noise, the noisy speech signal x ft Can be expressed as:
Figure BDA0002544713260000073
wherein xft 、s ft and nft Respectively represent a noisy speech spectrum, a clear speech spectrum and a noise spectrum, a t Representing the t-th element of the gain vector.
3) Iterative optimization
The parameters optimized in this embodiment are the parameters of the noise model { W, H, a } and the speech model, the optimization objective being to maximize the noisy speech likelihood of equation (11), which can typically be done by the expectation maximization algorithm (E-M). For the speech model, as long as the likelihood of noisy speech is maximized, the objective function is as follows:
Figure BDA0002544713260000074
the training-like objective function (7) is calculated as the noise-containing speech power spectrum |x ft | 2 And estimated noisy speech power spectrum a t σ 2 (z t ,c)+v ft ) The optimization process will fix the decoder weights θ, optimizing only phi.
Since the decoder is a suitable speech generation model, the present embodiment will fix the decoder weights and only optimize the encoder weights when optimizing the objective function. The optimization of the noise model may use the optimization-Minimization (MM) algorithm, which is accomplished in particular by the following iterative equation:
Figure BDA0002544713260000081
Figure BDA0002544713260000082
Figure BDA0002544713260000083
wherein ,
Figure BDA0002544713260000084
the f-th row t column element of (2) can be represented by the formula +.>
Figure BDA0002544713260000085
Indicating (I)>
Figure BDA0002544713260000086
Wherein +.is the multiplication of para-position +.>
Figure BDA0002544713260000087
Then represent from q φ (z t |x t C) the sampled r sample.
Through the above iterations, the final objective of this embodiment is to calculate a clean speech estimate, which can be expressed as the following expectations:
Figure BDA0002544713260000088
examples
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings.
1. Training and testing samples and objective evaluation criteria
The training data and test samples of this embodiment are derived from the TIMIT voice library, which contains the voice data of 629 speakers. Taking 7/8 of the data as training data for each speaker, and constructing a 629-dimensional identity vector for each speaker. The test sample of the embodiment is formed by randomly mixing the remaining 1/8 data of the TIMIT voice library and 17 different scenes of noises of the DEMAND noise library according to the signal-to-noise ratio of-5 dB to 5dB, and 850 pieces are all used.
The condition variable self-encoder adopted in the embodiment is shown in fig. 2, wherein the encoder and the decoder are both composed of a full-connection depth neural network with independent frames, the hidden layer dimension is 512, and the adopted activation function is tanh. The input of the encoder is the short-time magnitude spectrum of the clean speech and the output is z t Are combined into z by resampling method t The result is input to a decoder whose output is the logarithm of the short-time magnitude spectrum of the clean speech. The speaker identity vector is first reduced to 10 dimensions by a dimension reduction layer (Embedding) and spliced with the hidden layer output of the codec. During training, the clear voice short-time amplitude spectrum for training is disturbed in the time frame direction and input into the network for calculation.
In this embodiment, SDR (Signal-to-failure) is used as an objective evaluation index for enhancing performance. It describes the signal-to-noise ratio of the target speech relative to the residual noise in the enhanced speech, calculated as follows:
Figure BDA0002544713260000091
wherein
Figure BDA0002544713260000092
s represents the enhanced speech signal and the clear speech signal, respectively, for->
Figure BDA0002544713260000093
Energy normalization to obtain s target The snr is calculated again, so the SDR can be regarded as an energy scaled snr. In this embodiment, the SDR calculation is performed on the test sample after speech enhancement and clear speechTo performance evaluation.
2. Parameter setting
1) Short-time fourier transform of a signal
In this example, all signal sampling rates were 16kHz, a hanning window was used for short-time fourier transform, the window length was 1024 (64 ms), and the frame shift was 256 (16 ms).
2) Condition variable self-encoder
The condition variation employed in this embodiment is derived from the encoder, with input and output dimensions 513, hidden layer dimensions 512, hidden variable z t Is 64 and the dimension of the speaker identity vector after compression is 10. Both the encoder and the decoder are fully connected networks with only one hidden layer. Training was optimized using Adam optimizer at a learning rate of 0.001.
3) Non-negative matrix factorization
The non-negative matrix factorization rank employed in this embodiment is 10.
4) Iteration parameters
In the iterative process, the encoder of the condition variation self-encoder is still optimized by using an Adam optimizer at a learning rate of 0.001. The number of iterations is 200, each iteration completes one back propagation training of the conditional variation from the encoder, and the noise model parameters are iterated using equations (13) (14) (15), where the number of samples r = 1, i.e., one output of the direct choice conditional variation from the encoder is substituted into the iteration. The expectation is calculated by only one sampling when finally obtaining the speech estimate, namely, the formula (16) is rewritten as:
Figure BDA0002544713260000101
3. specific implementation flow of method
Referring to fig. 1, the implementation of the method is largely divided into a training phase and an enhancement phase.
In the training stage, the clear speaker voice is used for performing short-time Fourier transform on the amplitude spectrum, and the identity vector of the speaker is input into the network for training. In the enhancement stage, noise-containing voice of a certain speaker is input, short-time Fourier transform is carried out on the noise-containing voice to obtain a magnitude spectrum, random numbers which are uniformly distributed between 0 and 1 are used for initializing W and H, and a is initialized to be all 1. The noisy speech magnitude spectrum and the identity vector of the speaker are input into a conditional variation self-encoder, and the iteration is started by fixing the weight of the decoder. 200 iterations were completed using equations (12) (13) (14) (15). After iteration convergence, the final obtained conditional variation is used for obtaining clear voice time spectrum estimation by substituting the conditional variation from the encoder, W, H and a into the (18), and finally, the enhanced voice can be obtained by performing short-time Fourier inverse transformation on the clear voice time spectrum.
To demonstrate the performance improvement of the present invention over prior methods, this example compares to a variable-based self-encoder-non-negative matrix factorization algorithm in literature (S.Leglaive, X.Alameda-Pineda, l.gilin and r.Horaud, "A Recurrent Variational Autoencoder for Speech Enhancement," ICASSP 2020-2020IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), barcelona, spain,2020, pp.371-375, doi: 10.1109/ICASSP40776.2020.9053164.) that omits the target person condition term in the present invention. FIG. 3 shows a comparison line graph of enhanced performance in the test sample described at point 1 for both methods. Wherein the horizontal axis is 17 different noise types and the vertical axis is the average SDR value of the enhanced speech under certain noise conditions for both approaches. The pentagonal star point curve CVAE in the figure represents the enhancement performance of the conditional variable self-encoder-non-negative matrix factorization algorithm of the present invention, and the dot curve VAE (amp) represents the enhancement performance of the existing variable self-encoder-non-negative matrix factorization algorithm. The method of the invention can be found to have a certain improvement in the enhancement performance in various noise scenes compared with the existing correlation algorithm. The performance improvement of the invention is more obvious in the noise environment (such as meeting, cafeter noise with poor enhancement effect in fig. 3, namely, the scene of meeting, coffee shop and the like filled with other speaker interference) with weak enhancement performance of the existing algorithm. Fig. 4 is an example of speech enhancement under a dual speaker mixing condition, where the target speech is female speech and the interfering speech is male speech, (a) the graph is a mixed speech short-term amplitude spectrum, (b) the graph is an existing algorithm enhanced speech short-term amplitude spectrum, (c) the graph is an inventive method enhanced speech short-term amplitude spectrum, it can be found that the existing algorithm will almost remain interfering with male speech, and the inventive method can better eliminate interfering male speech.

Claims (4)

1. The target human voice enhancement method based on the conditional variation self-encoder is characterized by comprising the following steps of:
step 1, performing short-time Fourier transform on clear voice data of a target speaker to obtain a short-time amplitude spectrum;
step 2, constructing an identity coding vector of the target speaker, and training a condition variation self-encoder as a voice model by using the identity coding vector and the short-time amplitude spectrum obtained in the step 1; the input of the condition variation self-encoder is the voice amplitude spectrum of the target speaker and the identity coding vector thereof, and the output is the logarithm of the voice amplitude spectrum of the target speaker;
step 3, performing short-time Fourier transform on the noise-containing voice signal to obtain a short-time amplitude spectrum, and reserving a phase spectrum of the noise-containing voice signal;
step 4, inputting the short-time amplitude spectrum of the noisy speech signal obtained in the step 3 into the speech model, taking the target speaker identity coding vector as a speech model condition item, and fixing the weight of a decoder of the speech model; the method comprises the steps of performing joint iterative optimization on a voice model and a non-negative matrix factorization model to obtain amplitude spectrum estimation of voice and noise;
and 5, combining the amplitude spectrum estimation obtained in the step 4 and the phase spectrum of the noise-containing voice signal reserved in the step 3 into a complex spectrum, and obtaining an enhanced voice time domain signal through inverse short time Fourier transform.
2. The method according to claim 1, wherein in the step 2, the condition-variable self-encoder uses a deep neural network as an encoder and a decoder, the encoder maps the speech magnitude spectrum to a random variable z, and the decoder maps the random variable z to a clear speech.
3. The method for target human voice enhancement based on the condition-variable self-encoder according to claim 1, wherein in the step 4, the specific step of joint iterative optimization of the voice model and the non-negative matrix factorization model is as follows:
1) The condition-variable self-encoder and decoder are expressed as follows:
z t ~q φ (z t |x t ,c)
x t ~p θ (x t |z t ,c)
wherein xt For the t frame-wise magnitude spectrum of the input speech, z t For the hidden variable of the t frame output by the encoder, c represents the speaker identity vector, phi and theta represent the weights of the encoder and decoder, respectively, q φ and pθ Respectively representing the distribution of hidden variables generated by the encoder and the distribution of speech amplitude spectrum estimation generated by the decoder;
after training the encoder and decoder, the decoder p is fixed θ (x t |z t The weight of c), only the encoder weight is back-propagated and trained during the voice enhancement, the voice amplitude spectrum output by the voice model is estimated as sigma (z) t C) estimation of the power spectrum to sigma 2 (z t ,c);
2) The non-negative matrix factorization is expressed in the form:
V=WH
wherein
Figure FDA0004085271940000021
Noise short-time power spectrum estimation representing F-dimension T frame, R + Representing the positive real number domain, decomposing it into two non-negative low rank matrices using a matrix decomposition algorithm>
Figure FDA0004085271940000022
and />
Figure FDA0004085271940000023
Wherein K is the rank of the two matrices after decomposition and is much smaller than the values of F and T;
3) In the course of the optimization process,inputting amplitude spectrum x of noise-containing voice t And target speaker identity vector c, initializing parameters W, H for non-negative matrix factorization and gain vector for 1-dimensional T-frames
Figure FDA0004085271940000024
In each iteration, the following objective function is first optimized for the condition-variable self-encoder in step 1):
Figure FDA0004085271940000025
wherein
Figure FDA0004085271940000026
Represents the K-L divergence between the two distributions,/and->
Figure FDA0004085271940000027
Representing expectations, where p (z t ) Probability density representing a standard normal distribution; />
The parameters W, H and a of the non-negative matrix factorization are then iterated using the iterative formula t
Figure FDA0004085271940000028
Figure FDA0004085271940000029
Figure FDA00040852719400000210
Wherein X is a complex frequency spectrum,
Figure FDA00040852719400000211
the f-th row and t-th column elements are composed of common elements->
Figure FDA00040852719400000212
The representation is made of a combination of a first and a second color,
Figure FDA00040852719400000213
Figure FDA00040852719400000214
represents the sum of q φ (z t |x t C) the sampled r sample; as indicated by matrix para-multiplication, T representing matrix transposition;
after a plurality of iterations, the clear speech estimate obtained is expressed as:
Figure FDA00040852719400000215
wherein ,xft And
Figure FDA0004085271940000031
f-line t-column elements respectively representing noisy speech spectrum and clear speech spectrum estimates, (W) k H k ) ft Elements of f rows and t columns representing the noise power spectrum estimate.
4. A method of target person speech enhancement based on a conditional variance self-encoder according to claim 3, characterized in that in step 3) the following equation is used to optimize the target function:
Figure FDA0004085271940000032
wherein ,
Figure FDA0004085271940000033
Itakura-Saito divergence, v representing two distributions ft F-th row t column element representing noise short-term power spectrum estimate V, < >>
Figure FDA0004085271940000034
and />
Figure FDA0004085271940000035
Mean and variance vectors representing hidden variables of the encoder output. />
CN202010557116.0A 2020-06-18 2020-06-18 Target person voice enhancement method based on conditional variation self-encoder Active CN111653288B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010557116.0A CN111653288B (en) 2020-06-18 2020-06-18 Target person voice enhancement method based on conditional variation self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010557116.0A CN111653288B (en) 2020-06-18 2020-06-18 Target person voice enhancement method based on conditional variation self-encoder

Publications (2)

Publication Number Publication Date
CN111653288A CN111653288A (en) 2020-09-11
CN111653288B true CN111653288B (en) 2023-05-09

Family

ID=72351639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010557116.0A Active CN111653288B (en) 2020-06-18 2020-06-18 Target person voice enhancement method based on conditional variation self-encoder

Country Status (1)

Country Link
CN (1) CN111653288B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112509593B (en) * 2020-11-17 2024-03-08 北京清微智能科技有限公司 Speech enhancement network model, single-channel speech enhancement method and system
US20220199102A1 (en) * 2020-12-18 2022-06-23 International Business Machines Corporation Speaker-specific voice amplification
CN112767959B (en) * 2020-12-31 2023-10-17 恒安嘉新(北京)科技股份公司 Voice enhancement method, device, equipment and medium
CN113571080A (en) * 2021-02-08 2021-10-29 腾讯科技(深圳)有限公司 Voice enhancement method, device, equipment and storage medium
CN113035217B (en) * 2021-03-01 2023-11-10 武汉大学 Voice enhancement method based on voiceprint embedding under low signal-to-noise ratio condition
CN113707167A (en) * 2021-08-31 2021-11-26 北京地平线信息技术有限公司 Training method and training device for residual echo suppression model
CN113823305A (en) * 2021-09-03 2021-12-21 深圳市芒果未来科技有限公司 Method and system for suppressing noise of metronome in audio
CN113936681B (en) * 2021-10-13 2024-04-09 东南大学 Speech enhancement method based on mask mapping and mixed cavity convolution network
CN114999508B (en) * 2022-07-29 2022-11-08 之江实验室 Universal voice enhancement method and device by utilizing multi-source auxiliary information
CN115116448B (en) * 2022-08-29 2022-11-15 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium
CN115588436A (en) * 2022-09-29 2023-01-10 沈阳新松机器人自动化股份有限公司 Voice enhancement method for generating countermeasure network based on variational self-encoder
CN115376501B (en) * 2022-10-26 2023-02-14 深圳市北科瑞讯信息技术有限公司 Voice enhancement method and device, storage medium and electronic equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9582753B2 (en) * 2014-07-30 2017-02-28 Mitsubishi Electric Research Laboratories, Inc. Neural networks for transforming signals
CN104751855A (en) * 2014-11-25 2015-07-01 北京理工大学 Speech enhancement method in music background based on non-negative matrix factorization
CN107749302A (en) * 2017-10-27 2018-03-02 广州酷狗计算机科技有限公司 Audio-frequency processing method, device, storage medium and terminal
CN107967920A (en) * 2017-11-23 2018-04-27 哈尔滨理工大学 A kind of improved own coding neutral net voice enhancement algorithm
CN110211575B (en) * 2019-06-13 2021-06-04 思必驰科技股份有限公司 Voice noise adding method and system for data enhancement

Also Published As

Publication number Publication date
CN111653288A (en) 2020-09-11

Similar Documents

Publication Publication Date Title
CN111653288B (en) Target person voice enhancement method based on conditional variation self-encoder
CN109841226B (en) Single-channel real-time noise reduction method based on convolution recurrent neural network
CN107452389B (en) Universal single-track real-time noise reduction method
WO2020042706A1 (en) Deep learning-based acoustic echo cancellation method
CN110390950B (en) End-to-end voice enhancement method based on generation countermeasure network
CN110718232B (en) Speech enhancement method for generating countermeasure network based on two-dimensional spectrogram and condition
US6202047B1 (en) Method and apparatus for speech recognition using second order statistics and linear estimation of cepstral coefficients
CN110148420A (en) A kind of audio recognition method suitable under noise circumstance
CN110767244B (en) Speech enhancement method
Zhao et al. Late reverberation suppression using recurrent neural networks with long short-term memory
CN111968666B (en) Hearing aid voice enhancement method based on depth domain self-adaptive network
Mirsamadi et al. Causal speech enhancement combining data-driven learning and suppression rule estimation.
CN112735456A (en) Speech enhancement method based on DNN-CLSTM network
CN110660406A (en) Real-time voice noise reduction method of double-microphone mobile phone in close-range conversation scene
CN111816200B (en) Multi-channel speech enhancement method based on time-frequency domain binary mask
CN110998723B (en) Signal processing device using neural network, signal processing method, and recording medium
CN110808057A (en) Voice enhancement method for generating confrontation network based on constraint naive
CN112735460A (en) Beam forming method and system based on time-frequency masking value estimation
WO2019014890A1 (en) Universal single channel real-time noise-reduction method
CN113936681A (en) Voice enhancement method based on mask mapping and mixed hole convolution network
CN107045874A (en) A kind of Non-linear Speech Enhancement Method based on correlation
CN111899750A (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
Faubel et al. On expectation maximization based channel and noise estimation beyond the vector Taylor series expansion
Chen Noise reduction of bird calls based on a combination of spectral subtraction, Wiener filtering, and Kalman filtering
CN113035217A (en) Voice enhancement method based on voiceprint embedding under low signal-to-noise ratio condition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant