CN111653288B - Target person voice enhancement method based on conditional variation self-encoder - Google Patents
Target person voice enhancement method based on conditional variation self-encoder Download PDFInfo
- Publication number
- CN111653288B CN111653288B CN202010557116.0A CN202010557116A CN111653288B CN 111653288 B CN111653288 B CN 111653288B CN 202010557116 A CN202010557116 A CN 202010557116A CN 111653288 B CN111653288 B CN 111653288B
- Authority
- CN
- China
- Prior art keywords
- encoder
- speech
- voice
- spectrum
- noise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000001228 spectrum Methods 0.000 claims abstract description 86
- 239000011159 matrix material Substances 0.000 claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000005457 optimization Methods 0.000 claims abstract description 11
- 238000009826 distribution Methods 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000000354 decomposition reaction Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000017105 transposition Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 description 9
- 230000003595 spectral effect Effects 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000005534 acoustic noise Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000006735 deficit Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention discloses a target person voice enhancement method based on a condition variation self-encoder. The method comprises the following steps: (1) Performing short-time Fourier transform on clear voice data of a target speaker to obtain a magnitude spectrum; (2) Training a conditional variation self-encoder as a speech model using the target speaker's distinct speech magnitude spectrum and the identity coding vector; (3) Performing short-time Fourier transform on the noise-containing voice signal to obtain an amplitude spectrum and a phase spectrum; (4) Inputting the noisy speech amplitude spectrum and the target speaker identity coding vector into a speech model, fixing the weight of a speech model decoder, and carrying out joint iterative optimization on the speech model and a non-negative matrix factorization model to obtain speech and noise amplitude spectrum estimation; (5) The amplitude spectrum estimation and the noise-containing voice phase spectrum are combined into a complex spectrum, and then the enhanced voice time domain signal is obtained through inverse short-time Fourier transform. The method can enhance the voice of the target person under various complex noise, and has higher robustness.
Description
Technical Field
The invention belongs to the field of voice enhancement, and particularly relates to a target person voice enhancement method based on a conditional variation self-encoder.
Background
When a microphone is used to collect the voice signal of a speaker in a real environment, various interference signals are collected at the same time, and may be background noise, room reverberation and the like. These noise disturbances can degrade the quality of speech and severely degrade the speech recognition accuracy when the signal-to-noise ratio is low. A technique of extracting a target voice from noise interference is called a voice enhancement technique.
Spectral subtraction can be used to achieve speech enhancement (ball, s.f. (1979) Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, speech and Signal Processing,27, 113-120.). In chinese patent CN103594094a, speech is transformed to the time-frequency domain by short-time fourier transform, then the power spectrum of the speech signal of the current frame is subtracted from the estimated noise power spectrum by using a self-adaptive threshold spectral subtraction to obtain the power spectrum of the enhancement signal, and finally the enhancement signal of the time domain is obtained by short-time inverse fourier transform. However, this enhancement method has a large impairment of speech quality due to unreasonable assumptions made about speech and noise.
Non-negative matrix factorization algorithms have also been used for Speech enhancement (Wilson K W, raj B, smaragdis P, et al, spech denoising using nonnegative matrix factorization with priors [ C ]. Proceedings of the IEEE International Conference on Acoustics, spech, and Signal Processing, 2008.). By respectively carrying out non-negative matrix factorization on the short-time power spectrums of the voice and the noise, a dictionary of the voice and the noise can be obtained, and the dictionary is enhanced when the dictionary is enhanced. Chinese patent CN104505100a uses a non-negative matrix factorization algorithm that combines spectral subtraction and minimum mean square error for speech enhancement. However, non-negative matrix factorization only models the linearity of speech characteristics, which has poor modeling capabilities for the non-linearity of speech, limiting its performance.
Recently, various deep learning-based generative models have been used in speech modeling, where a variational self-encoder is a method of explicitly learning data distribution, which can be used to model speech non-linearly. Literature (S.Leglaive, L.Girin and R.Horaud, "AVARIANCE MODELING FRAMEWORK BASED ON VARIATIONAL AUTOENCODERS FOR SPEECH ENHANCEMENT,"2018IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP), aalborg,2018, pp.1-6, doi: 10.1109/MLSP.2018.856711.) uses a joint variable self-encoder and non-negative matrix factorization speech enhancement algorithm, wherein the variable self-encoder model is trained in advance with a clear speech short-time power spectrum, the non-negative matrix model is learned during enhancement, and the algorithm has a good enhancement effect on non-stationary noise on the premise of not damaging speech quality. However, this enhancement model is not good for the enhancement of human voice interference because the variant self-encoder model uses clear speech training.
In practical applications, the types of noise vary widely, and besides non-human noise, it is also significant to extract the voice of the target speaker from human voice interference.
Disclosure of Invention
In the prior art, when a variation self-encoder-non-negative matrix factorization model is used for voice enhancement in an environment full of human voice interference, the interference human voice is often reserved, and the enhancement effect is influenced. The invention provides a voice enhancement method based on a conditional variation self-encoder, which can effectively solve the problem of human voice interference and improve the voice enhancement performance.
The invention adopts the technical scheme that:
a method for target human speech enhancement based on a conditional variation self-encoder, comprising the steps of:
step 1, performing short-time Fourier transform on clear voice data of a target speaker to obtain a short-time amplitude spectrum;
step 2, constructing an identity coding vector of the target speaker, and training a condition variation self-encoder as a voice model by using the identity coding vector and the short-time amplitude spectrum obtained in the step 1; the input of the condition variation self-encoder is the voice amplitude spectrum of the target speaker and the identity coding vector thereof, and the output is the logarithm of the voice amplitude spectrum of the target speaker;
step 3, performing short-time Fourier transform on the noise-containing voice signal to obtain a short-time amplitude spectrum, and reserving a phase spectrum of the noise-containing voice signal;
step 4, inputting the short-time amplitude spectrum of the noisy speech signal obtained in the step 3 into the speech model, taking the target speaker identity coding vector as a speech model condition item, and fixing the weight of a decoder of the speech model; the method comprises the steps of performing joint iterative optimization on a voice model and a non-negative matrix factorization model to obtain amplitude spectrum estimation of voice and noise;
and 5, combining the amplitude spectrum estimation obtained in the step 4 and the phase spectrum of the noise-containing voice signal reserved in the step 3 into a complex spectrum, and obtaining an enhanced voice time domain signal through inverse short time Fourier transform.
Further, in the step 2, the condition-variable self-encoder uses the depth neural network as an encoder, the encoder maps the speech magnitude spectrum to a random variable z, and the decoder maps the random variable z to a clear speech.
Further, in the step 4, the specific steps of joint iterative optimization of the speech model and the non-negative matrix factorization model are as follows:
1) The encoder and decoder of the conditional variable self-encoder can be expressed in the form of:
z t ~q φ (z t |x t ,c)
x t ~p θ (x t |z t ,c)
wherein xt For the t frame-wise magnitude spectrum of the input speech, z t T frame output for encoderC represents the speaker identity vector, phi and theta represent the weights of the encoder and decoder, respectively, q φ and pθ Respectively representing the distribution of hidden variables generated by the encoder and the distribution of speech amplitude spectrum estimation generated by the decoder;
after training the encoder and decoder, the decoder p is fixed θ (x t |z t The weight of c), only the encoder weight is back-propagated and trained during the voice enhancement, the voice amplitude spectrum output by the voice model is estimated as sigma (z) t C) estimation of the power spectrum to sigma 2 (z t ,c);
2) The non-negative matrix factorization may be expressed in the form of:
V=WH
wherein Noise short-time power spectrum estimation representing F-dimension T frame, R + Representing the positive real number domain, decomposing it into two non-negative low rank matrices using a matrix decomposition algorithm> and />Wherein K is the rank of the two matrices after decomposition and is much smaller than the values of F and T;
3) During optimization, the amplitude spectrum x of the noise-containing voice is input t And target speaker identity vector c, initializing parameters W, H for non-negative matrix factorization and gain vector for 1-dimensional T-framesIn each iteration, the following objective function is first optimized for the condition-variable self-encoder in step 1):
wherein Represents the K-L divergence between the two distributions,/and->Representing expectations, where p (z t ) Probability density representing a standard normal distribution;
the parameters W, H and a of the non-negative matrix factorization are then iterated using the iterative formula t :
wherein ,the f-th row t column element of (2) is represented by the formula +.>Indicating (I)> Represents the sum of q φ (z t |x t C) the sampled r sample; as indicated by matrix para-multiplication, T representing matrix transposition;
after a plurality of iterations, the clear speech estimate obtained is expressed as:
wherein ,xft Andf-line t-column elements respectively representing noisy speech spectrum and clear speech spectrum estimates, (W) k H k ) ft Elements of f rows and t columns representing the noise power spectrum estimate.
Further, in step 3), the objective function is optimized using the following equation:
wherein ,Itakura-Saito divergence, v representing two distributions ft F-th row t column element representing noise short-term power spectrum estimate V, < >> and />Mean and variance vectors representing hidden variables of the encoder output.
Compared with the prior art, the invention has the beneficial effects that: the method can carry out voice enhancement under various complex noise scenes, and has strong capability of eliminating interference of non-target voice due to the fact that the target speaker information is introduced into the training process.
Drawings
FIG. 1 is a process flow diagram of a method for targeted human speech enhancement based on a conditional variance self-encoder of the present invention.
FIG. 2 is a schematic diagram of a variational self-encoder model employed in an embodiment of the present invention; the deep neural network used therein is a frame independent fully connected network, |s t I represents the input clear speech magnitude spectrum, C represents the speechThe voice corresponds to the identity independent heat vector of the speaker, and the Embedding represents the network to reduce the dimension of the speaker identity vector to 10 dimensions, sigma (|z) t I, c) represents the magnitude spectrum of the output speech.
Fig. 3 is a graph of the prior art variant self-encoder-non-negative matrix factorization algorithm versus the enhanced speech SDR values of the method of the present invention under different noise types.
FIG. 4 is a graph comparing the enhancement effect of the method of the present invention and the existing algorithm based on the variational self-encoder-non-negative matrix factorization model on target speech under the condition of multi-human voice mixing. (a) is a mixed voice short-time amplitude spectrum, (b) is an existing algorithm enhanced voice short-time amplitude spectrum, and (c) is an enhanced voice short-time amplitude spectrum of the method.
Detailed Description
The invention relates to a target person voice enhancement method based on a conditional variation self-encoder, which mainly comprises the following steps:
1. target person voice model training
1) Performing short-time Fourier transform on clear voice signal of target person
If the clear voice signal of the target person is X (T), performing short-time Fourier transform of N-point FFT to obtain a complex frequency spectrum of X= { X in F dimension (F=N/2+1) of T frame 1 ,...,x t}, wherein |x ft The i represents the magnitude of the f-th spectral component of the t-th frame.
2) Construction of target person identity vector
If there are clear voice data of M speakers, the identity of each speaker is marked as an M-dimensional one-hot vector, and if a certain target speaker is the ith bit in the data set, the ith dimension of the identity vector is 1, and the other dimensions are all 0.
3) Training of conditional variational self-encoders
The conditional variation self-Encoder model consists of an Encoder (Encoder) and a Decoder (Decoder). The goal of the encoder is to convert the speech magnitude spectrum |x t Mapping of I to a random variable z t WhileThe goal of the decoder is to map back the speech magnitude spectrum from this random variable, which is generally assumed to satisfy a gaussian distribution.
The model of the encoder and decoder can thus be expressed as:
z t ~q φ (z t |x t ,c) (2)
x t ~p θ (x t |z t ,c) (3)
where c is a conditional term, i.e. the target speaker identity vector in step 2), the coupling of which to the encoder and decoder can be seen with reference to fig. 2. In this embodiment, a neural network is first used to reduce the dimension of the M-dimension identity vector to 10 dimensions, and the reduced dimension output is spliced with each hidden layer output of the codec.
The training goal of the conditional variance self-encoder is to maximize the likelihood of the decoder output, i.e., the closer the speech spectrum output by the decoder is to the true speech spectrum, the better. So its objective function can be written as log likelihood:
the above objective function can be decomposed into the following equations, which are easier to calculate, by the variance inference method:
therein, whereinRepresenting the Kullback-Leibler (K-L) divergence between the two distributions. The first term above describes the hidden variable output from the encoder and the K-L divergence of the standard normal distribution, in particular, z will be output from the encoder t Mean sum of (2)Variance, then z is obtained using the resampling method shown in fig. 2 t And then input to a decoder. The second term is represented by x t Input encoder gets hidden variable z t ,z t Inputting again the decoder to retrieve x t Can be obtained in a neural network with an X input network and with its output as close as possible to X, in particular, it will calculate the Itakura-Saito (I-S) divergence of the network inputs and outputs:
the objective function can therefore be rewritten as:
wherein Encoder network output z, respectively t Mean and variance of (c). And optimizing the model to obtain the target human voice model. It is generally assumed that speech satisfies the following complex gaussian distribution, which can be used as a speech model:
2. speech enhancement using iterative algorithms
1) Performing short-time Fourier transform on noise-containing voice signal
If the noise-containing voice signal is X (t), the complex frequency spectrum of X= { X can be obtained by performing short-time Fourier transform of N-point FFT 1 ,...,x t }。
2) Modeling noise using non-negative matrix factorization model
Similar to the speech model (8), the noise model can also be described as the following distribution form:
wherein the matrixMatrix, which is decomposed into the product of the following two low rank matrices:
V=WH (10)
assuming that the noise is additive noise, the noisy speech signal x ft Can be expressed as:
wherein xft 、s ft and nft Respectively represent a noisy speech spectrum, a clear speech spectrum and a noise spectrum, a t Representing the t-th element of the gain vector.
3) Iterative optimization
The parameters optimized in this embodiment are the parameters of the noise model { W, H, a } and the speech model, the optimization objective being to maximize the noisy speech likelihood of equation (11), which can typically be done by the expectation maximization algorithm (E-M). For the speech model, as long as the likelihood of noisy speech is maximized, the objective function is as follows:
the training-like objective function (7) is calculated as the noise-containing speech power spectrum |x ft | 2 And estimated noisy speech power spectrum a t σ 2 (z t ,c)+v ft ) The optimization process will fix the decoder weights θ, optimizing only phi.
Since the decoder is a suitable speech generation model, the present embodiment will fix the decoder weights and only optimize the encoder weights when optimizing the objective function. The optimization of the noise model may use the optimization-Minimization (MM) algorithm, which is accomplished in particular by the following iterative equation:
wherein ,the f-th row t column element of (2) can be represented by the formula +.>Indicating (I)>Wherein +.is the multiplication of para-position +.>Then represent from q φ (z t |x t C) the sampled r sample.
Through the above iterations, the final objective of this embodiment is to calculate a clean speech estimate, which can be expressed as the following expectations:
examples
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings.
1. Training and testing samples and objective evaluation criteria
The training data and test samples of this embodiment are derived from the TIMIT voice library, which contains the voice data of 629 speakers. Taking 7/8 of the data as training data for each speaker, and constructing a 629-dimensional identity vector for each speaker. The test sample of the embodiment is formed by randomly mixing the remaining 1/8 data of the TIMIT voice library and 17 different scenes of noises of the DEMAND noise library according to the signal-to-noise ratio of-5 dB to 5dB, and 850 pieces are all used.
The condition variable self-encoder adopted in the embodiment is shown in fig. 2, wherein the encoder and the decoder are both composed of a full-connection depth neural network with independent frames, the hidden layer dimension is 512, and the adopted activation function is tanh. The input of the encoder is the short-time magnitude spectrum of the clean speech and the output is z t Are combined into z by resampling method t The result is input to a decoder whose output is the logarithm of the short-time magnitude spectrum of the clean speech. The speaker identity vector is first reduced to 10 dimensions by a dimension reduction layer (Embedding) and spliced with the hidden layer output of the codec. During training, the clear voice short-time amplitude spectrum for training is disturbed in the time frame direction and input into the network for calculation.
In this embodiment, SDR (Signal-to-failure) is used as an objective evaluation index for enhancing performance. It describes the signal-to-noise ratio of the target speech relative to the residual noise in the enhanced speech, calculated as follows:
wherein s represents the enhanced speech signal and the clear speech signal, respectively, for->Energy normalization to obtain s target The snr is calculated again, so the SDR can be regarded as an energy scaled snr. In this embodiment, the SDR calculation is performed on the test sample after speech enhancement and clear speechTo performance evaluation.
2. Parameter setting
1) Short-time fourier transform of a signal
In this example, all signal sampling rates were 16kHz, a hanning window was used for short-time fourier transform, the window length was 1024 (64 ms), and the frame shift was 256 (16 ms).
2) Condition variable self-encoder
The condition variation employed in this embodiment is derived from the encoder, with input and output dimensions 513, hidden layer dimensions 512, hidden variable z t Is 64 and the dimension of the speaker identity vector after compression is 10. Both the encoder and the decoder are fully connected networks with only one hidden layer. Training was optimized using Adam optimizer at a learning rate of 0.001.
3) Non-negative matrix factorization
The non-negative matrix factorization rank employed in this embodiment is 10.
4) Iteration parameters
In the iterative process, the encoder of the condition variation self-encoder is still optimized by using an Adam optimizer at a learning rate of 0.001. The number of iterations is 200, each iteration completes one back propagation training of the conditional variation from the encoder, and the noise model parameters are iterated using equations (13) (14) (15), where the number of samples r = 1, i.e., one output of the direct choice conditional variation from the encoder is substituted into the iteration. The expectation is calculated by only one sampling when finally obtaining the speech estimate, namely, the formula (16) is rewritten as:
3. specific implementation flow of method
Referring to fig. 1, the implementation of the method is largely divided into a training phase and an enhancement phase.
In the training stage, the clear speaker voice is used for performing short-time Fourier transform on the amplitude spectrum, and the identity vector of the speaker is input into the network for training. In the enhancement stage, noise-containing voice of a certain speaker is input, short-time Fourier transform is carried out on the noise-containing voice to obtain a magnitude spectrum, random numbers which are uniformly distributed between 0 and 1 are used for initializing W and H, and a is initialized to be all 1. The noisy speech magnitude spectrum and the identity vector of the speaker are input into a conditional variation self-encoder, and the iteration is started by fixing the weight of the decoder. 200 iterations were completed using equations (12) (13) (14) (15). After iteration convergence, the final obtained conditional variation is used for obtaining clear voice time spectrum estimation by substituting the conditional variation from the encoder, W, H and a into the (18), and finally, the enhanced voice can be obtained by performing short-time Fourier inverse transformation on the clear voice time spectrum.
To demonstrate the performance improvement of the present invention over prior methods, this example compares to a variable-based self-encoder-non-negative matrix factorization algorithm in literature (S.Leglaive, X.Alameda-Pineda, l.gilin and r.Horaud, "A Recurrent Variational Autoencoder for Speech Enhancement," ICASSP 2020-2020IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), barcelona, spain,2020, pp.371-375, doi: 10.1109/ICASSP40776.2020.9053164.) that omits the target person condition term in the present invention. FIG. 3 shows a comparison line graph of enhanced performance in the test sample described at point 1 for both methods. Wherein the horizontal axis is 17 different noise types and the vertical axis is the average SDR value of the enhanced speech under certain noise conditions for both approaches. The pentagonal star point curve CVAE in the figure represents the enhancement performance of the conditional variable self-encoder-non-negative matrix factorization algorithm of the present invention, and the dot curve VAE (amp) represents the enhancement performance of the existing variable self-encoder-non-negative matrix factorization algorithm. The method of the invention can be found to have a certain improvement in the enhancement performance in various noise scenes compared with the existing correlation algorithm. The performance improvement of the invention is more obvious in the noise environment (such as meeting, cafeter noise with poor enhancement effect in fig. 3, namely, the scene of meeting, coffee shop and the like filled with other speaker interference) with weak enhancement performance of the existing algorithm. Fig. 4 is an example of speech enhancement under a dual speaker mixing condition, where the target speech is female speech and the interfering speech is male speech, (a) the graph is a mixed speech short-term amplitude spectrum, (b) the graph is an existing algorithm enhanced speech short-term amplitude spectrum, (c) the graph is an inventive method enhanced speech short-term amplitude spectrum, it can be found that the existing algorithm will almost remain interfering with male speech, and the inventive method can better eliminate interfering male speech.
Claims (4)
1. The target human voice enhancement method based on the conditional variation self-encoder is characterized by comprising the following steps of:
step 1, performing short-time Fourier transform on clear voice data of a target speaker to obtain a short-time amplitude spectrum;
step 2, constructing an identity coding vector of the target speaker, and training a condition variation self-encoder as a voice model by using the identity coding vector and the short-time amplitude spectrum obtained in the step 1; the input of the condition variation self-encoder is the voice amplitude spectrum of the target speaker and the identity coding vector thereof, and the output is the logarithm of the voice amplitude spectrum of the target speaker;
step 3, performing short-time Fourier transform on the noise-containing voice signal to obtain a short-time amplitude spectrum, and reserving a phase spectrum of the noise-containing voice signal;
step 4, inputting the short-time amplitude spectrum of the noisy speech signal obtained in the step 3 into the speech model, taking the target speaker identity coding vector as a speech model condition item, and fixing the weight of a decoder of the speech model; the method comprises the steps of performing joint iterative optimization on a voice model and a non-negative matrix factorization model to obtain amplitude spectrum estimation of voice and noise;
and 5, combining the amplitude spectrum estimation obtained in the step 4 and the phase spectrum of the noise-containing voice signal reserved in the step 3 into a complex spectrum, and obtaining an enhanced voice time domain signal through inverse short time Fourier transform.
2. The method according to claim 1, wherein in the step 2, the condition-variable self-encoder uses a deep neural network as an encoder and a decoder, the encoder maps the speech magnitude spectrum to a random variable z, and the decoder maps the random variable z to a clear speech.
3. The method for target human voice enhancement based on the condition-variable self-encoder according to claim 1, wherein in the step 4, the specific step of joint iterative optimization of the voice model and the non-negative matrix factorization model is as follows:
1) The condition-variable self-encoder and decoder are expressed as follows:
z t ~q φ (z t |x t ,c)
x t ~p θ (x t |z t ,c)
wherein xt For the t frame-wise magnitude spectrum of the input speech, z t For the hidden variable of the t frame output by the encoder, c represents the speaker identity vector, phi and theta represent the weights of the encoder and decoder, respectively, q φ and pθ Respectively representing the distribution of hidden variables generated by the encoder and the distribution of speech amplitude spectrum estimation generated by the decoder;
after training the encoder and decoder, the decoder p is fixed θ (x t |z t The weight of c), only the encoder weight is back-propagated and trained during the voice enhancement, the voice amplitude spectrum output by the voice model is estimated as sigma (z) t C) estimation of the power spectrum to sigma 2 (z t ,c);
2) The non-negative matrix factorization is expressed in the form:
V=WH
wherein Noise short-time power spectrum estimation representing F-dimension T frame, R + Representing the positive real number domain, decomposing it into two non-negative low rank matrices using a matrix decomposition algorithm> and />Wherein K is the rank of the two matrices after decomposition and is much smaller than the values of F and T;
3) In the course of the optimization process,inputting amplitude spectrum x of noise-containing voice t And target speaker identity vector c, initializing parameters W, H for non-negative matrix factorization and gain vector for 1-dimensional T-framesIn each iteration, the following objective function is first optimized for the condition-variable self-encoder in step 1):
wherein Represents the K-L divergence between the two distributions,/and->Representing expectations, where p (z t ) Probability density representing a standard normal distribution; />
The parameters W, H and a of the non-negative matrix factorization are then iterated using the iterative formula t :
Wherein X is a complex frequency spectrum,the f-th row and t-th column elements are composed of common elements->The representation is made of a combination of a first and a second color, represents the sum of q φ (z t |x t C) the sampled r sample; as indicated by matrix para-multiplication, T representing matrix transposition;
after a plurality of iterations, the clear speech estimate obtained is expressed as:
4. A method of target person speech enhancement based on a conditional variance self-encoder according to claim 3, characterized in that in step 3) the following equation is used to optimize the target function:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010557116.0A CN111653288B (en) | 2020-06-18 | 2020-06-18 | Target person voice enhancement method based on conditional variation self-encoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010557116.0A CN111653288B (en) | 2020-06-18 | 2020-06-18 | Target person voice enhancement method based on conditional variation self-encoder |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111653288A CN111653288A (en) | 2020-09-11 |
CN111653288B true CN111653288B (en) | 2023-05-09 |
Family
ID=72351639
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010557116.0A Active CN111653288B (en) | 2020-06-18 | 2020-06-18 | Target person voice enhancement method based on conditional variation self-encoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111653288B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112509593B (en) * | 2020-11-17 | 2024-03-08 | 北京清微智能科技有限公司 | Speech enhancement network model, single-channel speech enhancement method and system |
US20220199102A1 (en) * | 2020-12-18 | 2022-06-23 | International Business Machines Corporation | Speaker-specific voice amplification |
CN112767959B (en) * | 2020-12-31 | 2023-10-17 | 恒安嘉新(北京)科技股份公司 | Voice enhancement method, device, equipment and medium |
CN113571080A (en) * | 2021-02-08 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Voice enhancement method, device, equipment and storage medium |
CN113035217B (en) * | 2021-03-01 | 2023-11-10 | 武汉大学 | Voice enhancement method based on voiceprint embedding under low signal-to-noise ratio condition |
CN113707167A (en) * | 2021-08-31 | 2021-11-26 | 北京地平线信息技术有限公司 | Training method and training device for residual echo suppression model |
CN113823305A (en) * | 2021-09-03 | 2021-12-21 | 深圳市芒果未来科技有限公司 | Method and system for suppressing noise of metronome in audio |
CN113936681B (en) * | 2021-10-13 | 2024-04-09 | 东南大学 | Speech enhancement method based on mask mapping and mixed cavity convolution network |
CN114999508B (en) * | 2022-07-29 | 2022-11-08 | 之江实验室 | Universal voice enhancement method and device by utilizing multi-source auxiliary information |
CN115116448B (en) * | 2022-08-29 | 2022-11-15 | 四川启睿克科技有限公司 | Voice extraction method, neural network model training method, device and storage medium |
CN115588436A (en) * | 2022-09-29 | 2023-01-10 | 沈阳新松机器人自动化股份有限公司 | Voice enhancement method for generating countermeasure network based on variational self-encoder |
CN115376501B (en) * | 2022-10-26 | 2023-02-14 | 深圳市北科瑞讯信息技术有限公司 | Voice enhancement method and device, storage medium and electronic equipment |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9582753B2 (en) * | 2014-07-30 | 2017-02-28 | Mitsubishi Electric Research Laboratories, Inc. | Neural networks for transforming signals |
CN104751855A (en) * | 2014-11-25 | 2015-07-01 | 北京理工大学 | Speech enhancement method in music background based on non-negative matrix factorization |
CN107749302A (en) * | 2017-10-27 | 2018-03-02 | 广州酷狗计算机科技有限公司 | Audio-frequency processing method, device, storage medium and terminal |
CN107967920A (en) * | 2017-11-23 | 2018-04-27 | 哈尔滨理工大学 | A kind of improved own coding neutral net voice enhancement algorithm |
CN110211575B (en) * | 2019-06-13 | 2021-06-04 | 思必驰科技股份有限公司 | Voice noise adding method and system for data enhancement |
-
2020
- 2020-06-18 CN CN202010557116.0A patent/CN111653288B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111653288A (en) | 2020-09-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111653288B (en) | Target person voice enhancement method based on conditional variation self-encoder | |
CN109841226B (en) | Single-channel real-time noise reduction method based on convolution recurrent neural network | |
CN107452389B (en) | Universal single-track real-time noise reduction method | |
WO2020042706A1 (en) | Deep learning-based acoustic echo cancellation method | |
CN110390950B (en) | End-to-end voice enhancement method based on generation countermeasure network | |
CN110718232B (en) | Speech enhancement method for generating countermeasure network based on two-dimensional spectrogram and condition | |
US6202047B1 (en) | Method and apparatus for speech recognition using second order statistics and linear estimation of cepstral coefficients | |
CN110148420A (en) | A kind of audio recognition method suitable under noise circumstance | |
CN110767244B (en) | Speech enhancement method | |
Zhao et al. | Late reverberation suppression using recurrent neural networks with long short-term memory | |
CN111968666B (en) | Hearing aid voice enhancement method based on depth domain self-adaptive network | |
Mirsamadi et al. | Causal speech enhancement combining data-driven learning and suppression rule estimation. | |
CN112735456A (en) | Speech enhancement method based on DNN-CLSTM network | |
CN110660406A (en) | Real-time voice noise reduction method of double-microphone mobile phone in close-range conversation scene | |
CN111816200B (en) | Multi-channel speech enhancement method based on time-frequency domain binary mask | |
CN110998723B (en) | Signal processing device using neural network, signal processing method, and recording medium | |
CN110808057A (en) | Voice enhancement method for generating confrontation network based on constraint naive | |
CN112735460A (en) | Beam forming method and system based on time-frequency masking value estimation | |
WO2019014890A1 (en) | Universal single channel real-time noise-reduction method | |
CN113936681A (en) | Voice enhancement method based on mask mapping and mixed hole convolution network | |
CN107045874A (en) | A kind of Non-linear Speech Enhancement Method based on correlation | |
CN111899750A (en) | Speech enhancement algorithm combining cochlear speech features and hopping deep neural network | |
Faubel et al. | On expectation maximization based channel and noise estimation beyond the vector Taylor series expansion | |
Chen | Noise reduction of bird calls based on a combination of spectral subtraction, Wiener filtering, and Kalman filtering | |
CN113035217A (en) | Voice enhancement method based on voiceprint embedding under low signal-to-noise ratio condition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |