CN112927709B - Voice enhancement method based on time-frequency domain joint loss function - Google Patents
Voice enhancement method based on time-frequency domain joint loss function Download PDFInfo
- Publication number
- CN112927709B CN112927709B CN202110155444.2A CN202110155444A CN112927709B CN 112927709 B CN112927709 B CN 112927709B CN 202110155444 A CN202110155444 A CN 202110155444A CN 112927709 B CN112927709 B CN 112927709B
- Authority
- CN
- China
- Prior art keywords
- voice
- frequency domain
- data set
- amplitude spectrum
- clean
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000001228 spectrum Methods 0.000 claims abstract description 137
- 238000012549 training Methods 0.000 claims abstract description 25
- 238000005070 sampling Methods 0.000 claims description 17
- 238000009432 framing Methods 0.000 claims description 12
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 4
- 238000005457 optimization Methods 0.000 abstract 1
- 238000007781 pre-processing Methods 0.000 abstract 1
- 238000013527 convolutional neural network Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 11
- 238000013135 deep learning Methods 0.000 description 5
- 230000003595 spectral effect Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000013179 statistical model Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011410 subtraction method Methods 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Quality & Reliability (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention provides a voice enhancement method based on a time-frequency domain joint loss function. The method comprises the steps of integrating a clean voice data set and a noise data set in an open source data set into a voice data set with noise, converting the voice data set with noise into amplitude spectrum, phase spectrum and waveform data through preprocessing operation, and constructing a training set. And (3) constructing a CNN network model, taking the noisy speech amplitude spectrum as input, taking the clean speech amplitude spectrum as a label, and performing model training. Carrying out waveform reconstruction on the amplitude spectrum estimation value output by the model and the phase spectrum of the voice with noise by an inverse short-time Fourier transform method to obtain a time domain waveform of the estimated voice; calculating the loss of a frequency domain through a clean voice amplitude spectrum and an amplitude spectrum estimation value; calculating time domain loss through the clean voice time domain waveform and the estimated voice time domain waveform; and constructing time-frequency domain combined loss according to the frequency domain loss and the time domain loss, and guiding the CNN network model to carry out weight optimization. The invention reduces the phenomenon that the estimated magnitude spectrum is not matched with the phase spectrum, and improves the effect of voice enhancement.
Description
Technical Field
The invention relates to the field of voice enhancement, in particular to a voice enhancement method based on a time-frequency domain joint loss function.
Background
The voice communication is the most convenient information interaction mode between people and machines. However, regardless of the environment, voice communications are more or less disturbed by ambient noise. The voice enhancement technology is an effective way to solve the noise influence in the voice interaction process. The purpose of speech enhancement is to extract clean speech signals from background noise as much as possible, eliminate environmental noise, and improve speech quality and speech intelligibility.
In recent years, the popularity of artificial intelligence technology is high, and speech enhancement technology has also been developed rapidly, and various speech enhancement technologies are emerging. These speech enhancement schemes are mainly divided into: conventional speech enhancement schemes and deep learning based speech enhancement schemes.
The conventional speech enhancement scheme mainly includes: spectral subtraction, statistical model-based enhancement algorithms, and subspace enhancement algorithms. Spectral subtraction assumes that the noise is additive noise, then subtracts an estimate of the noise spectrum from the speech spectrum of the noisy speech, and finally to clean speech. The wiener filtering algorithm and the minimum mean square error algorithm are representatives of the enhancement algorithm based on the statistical model, and compared with the spectral subtraction method, the residual noise in the voice signal processed by the wiener filtering algorithm is similar to white noise, so that the voice signal is more comfortable in hearing. The minimum mean square error algorithm utilizes the important role of the short-time spectral amplitude of the speech signal in perception and utilizes a short-time spectral amplitude estimator of the minimum mean square error to enhance the noisy speech. The subspace enhancement algorithm is mainly derived from linear algebraic theory, and the principle is that in Euclidean space, the distribution of clean signals is limited to the subspace carrying away the signals. The task of speech enhancement can be accomplished by decomposing the vector space of the noisy signal into two subspaces.
Conventional speech enhancement algorithms mostly assume that the speech signal is stationary. However, in real life, such assumption conditions cannot be satisfied at all. The speech enhancement algorithm based on deep learning can effectively solve the problem by strong nonlinear fitting capability. Based on different training targets, the speech enhancement algorithm based on deep learning can be divided into two categories: one is a mask-based enhancement network and the other is a mapping-based enhancement network. The mask-based enhancement network is a training target for a neural network using an ideal scale mask or a phase mask, etc. The mapping-based enhancement network utilizes the fitting capabilities of the neural network to map the log or power spectrum of noisy speech directly to the power spectrum of clean speech. The deep learning based voice enhancement network may be classified into a DNN enhancement network, a CNN enhancement network, an RNN enhancement network, and a GAN enhancement network according to a difference in network models.
Wherein, the feature processing of the spectrogram is the key of the deep learning-based speech enhancement network. Therefore, CNN networks are more amenable to speech enhancement tasks than other network models.
In the process of implementing the invention, the inventor of the application finds that the prior art method has at least the following and technical problems:
A speech enhancement network usually uses a noisy speech magnitude spectrum to generate a magnitude spectrum of an estimated speech, and uses a noisy speech phase spectrum to perform waveform reconstruction, which may cause a mismatch between the magnitude spectrum and the phase spectrum. The speech enhancement algorithm is usually used as a front-end module for speech recognition, and the effect greatly affects the performance of the recognition system. The mismatch between the magnitude spectrum and the phase spectrum may damage the feature information of the speech signal, such as Mel Frequency Cepstrum Coefficient (MFCC), thereby affecting the accuracy of speech recognition.
Therefore, the method in the prior art has the defects as the voice enhancement method of the voice recognition front-end module, and has important practical significance for solving the problem of the voice enhancement phase mismatch.
Disclosure of Invention
The speech enhancement model based on the deep neural network generally operates by transforming the noisy speech signal onto the frequency domain using a Short Time Fourier Transform (STFT) method. The method generally learns the mapping relationship between the magnitude spectra by using the Mean Square Error (MSE) of the frequency domain magnitude spectra of the clean speech and the estimation result as a loss function. The traditional method uses the phase spectrum of the noisy language to generate an estimation signal, which can cause the mismatching of the amplitude spectrum and the phase spectrum, thereby influencing the enhancement effect. The invention aims to reduce the phenomenon that the amplitude spectrum and the phase spectrum of the enhanced generated voice are not matched by using a combined loss function of a time domain and a frequency domain.
In view of the defects of the traditional method, the invention provides a speech enhancement model based on a time-frequency domain joint loss function. The invention uses a time-frequency domain joint loss function to replace a single frequency domain loss function in the traditional scheme. By introducing time domain loss, the waveform error of the estimated signal and the clean voice signal in the time domain is calculated, and the influence caused by the condition that the amplitude spectrum and the phase spectrum are not matched in the traditional scheme is reduced. The performance of speech enhancement algorithms as a speech recognition front-end module is improved by reducing loss and corruption of information in the enhanced speech. The purpose of the invention is realized by the following technical scheme:
the invention provides a speech enhancement method based on a time-frequency domain joint loss function, which is characterized by comprising the following steps of:
step 1, collecting a clean voice data set and noise data in an open source data set into a voice data set with noise, framing and overlapping the clean voice in the clean voice data set by a short-time Fourier transform method, converting the clean voice into a frequency domain amplitude spectrum of each clean voice, constructing a clean voice frequency domain amplitude spectrum data set, sampling and framing the clean voice in the clean voice data set, adding a Hamming window to convert the clean voice into waveform data of the clean voice, forming a clean voice time domain waveform data set, framing and overlapping the noisy voice in the voice data set with noise by a short-time Fourier transform method, converting the frequency domain amplitude spectrum of each noisy voice and the frequency domain phase spectrum of each noisy voice, forming a voice amplitude spectrum data set with noise and a voice frequency domain phase spectrum data set with noise, and collecting the voice frequency domain amplitude spectrum data set with noise, the clean voice frequency domain amplitude spectrum data set, Constructing network training set data by using a time domain waveform data set of clean voice, a frequency domain amplitude spectrum data set of noisy voice and a frequency domain phase spectrum data set of noisy voice;
Step 2, a CNN network model is constructed, a frequency domain amplitude spectrum of the voice with noise in the network training set data is used as a model input data set, a frequency domain amplitude spectrum of the clean voice is used as a training target set, the network acquires the frequency domain amplitude spectrum of the voice with noise each time, the frequency domain amplitude spectrum of the clean voice corresponding to the frequency domain amplitude spectrum is used as a label, and the CNN network model predicts the frequency domain amplitude spectrum of the clean voice according to the frequency domain amplitude spectrum of the voice with noise to obtain a frequency domain amplitude spectrum estimated value; combining the frequency domain amplitude spectrum estimation value with a frequency domain phase spectrum of the voice with noise, and further performing waveform reconstruction by an inverse short-time Fourier transform method to obtain enhanced voice; sampling, framing and overlapping the enhanced voice and adding a Hamming window to obtain waveform data on an estimated voice time domain;
calculating the loss of the frequency domain through the frequency domain amplitude spectrum of the clean voice and the frequency domain amplitude spectrum estimation value; calculating time domain loss through time domain waveform data of the clean voice and waveform data on an estimated voice time domain; constructing time-frequency domain combined loss according to the frequency domain loss and the time domain loss;
and 3, updating the weight matrix of the convolutional layer by using Adam as an optimizer according to the time-frequency domain joint loss, and performing next iteration until training is finished to obtain optimized network weight parameters so as to construct an optimized CNN network model.
Preferably, the clean speech data set in step 1 is:
{Ci,i∈[1,K]}
wherein, CiThe ith clean voice of the clean voice data set, wherein K is the number of clean voices;
step 1 the noisy data set is:
{Ni,i∈[1,K']}
where K' is the amount of noise in the noise data set;
the synthesized noisy speech data set in step 1 is:
{Ri,i∈[1,K]}
wherein R isiThe ith synthesized voice with noise in the synthesized voice data set with noise is obtained, and K is the number of the voice with noise in the synthesized voice data set with noise;
step 1, the clean voice frequency domain amplitude spectrum data set is as follows:
{Yi,i∈[1,K]},Yi={Yi,j,j∈[1,Ni]}
wherein, YiFrequency domain amplitude spectrum, Y, representing the i-th clean speechi,jJ frame frequency domain amplitude data, N, representing the ith clean speechiThe total frame number of the ith dry clean voice;
step 1, the clean voice time domain waveform data set is as follows:
{yi,i∈[1,K]},yi={yi,j,j∈[1,Ti]}
wherein, yiWaveform data representing the ith clean speech, yi,jValue of j sample point, T, representing i clean speechiThe total sampling point number of the ith dry clean voice is counted;
step 1, the data set of the frequency domain magnitude spectrum of the voice with noise comprises:
{Mi,i∈[1,K]},Mi={Mi,j,j∈[1,Ni]}
wherein M isiFrequency domain amplitude spectrum, M, representing the i-th noisy speechi,jFrequency domain amplitude data, N, of the j-th frame representing the i-th noisy speechiThe total frame number of the ith voice with noise;
The frequency domain phase spectrum data set of the voice with noise in the step 1 is as follows:
{Pi,i∈[1,K]}
wherein, PiRepresenting the frequency domain phase spectrum of the ith voice with noise, wherein K is the number of the frequency domain phases of the voice with noise in the data set of the frequency domain phase spectrum of the voice with noise;
step 1, constructing network training set data comprises:
{Mi,Pi,Yi,yi,i∈[1,K]}
wherein K is the total number of voices in the voice data set with noise;
preferably, the pre-trained CNN network model in step 2 is formed by cascading an encoder and a decoder;
the encoder is formed by sequentially cascading a coding convolution layer, a coding batch normalization layer, a maximum pooling layer and a Leaky ReLu activation layer, wherein the coding convolution layer relates to weight updating;
the decoder is formed by sequentially cascading a decoding convolution layer, a decoding batch normalization layer and an upper sampling layer, wherein the decoding convolution layer relates to weight updating;
step 2, the frequency domain magnitude spectrum of the noisy speech in the training set data is: { Mi,i∈[1,K]};
Mi={Mi,j,j∈[1,Ni]}
Wherein M isiFrequency domain amplitude spectrum, M, representing the i-th noisy speechi,jFrequency domain amplitude data, N, of the j-th frame representing the i-th noisy speechiThe total frame number of the ith voice with noise;
step 2, the frequency domain amplitude spectrum of the clean voice in the training set data is as follows: { Yi,i∈[1,K]};
Yi={Yi,j,j∈[1,Ni]}
Wherein, YiFrequency domain amplitude spectrum, Y, representing the i-th clean speech i,jJ frame frequency domain amplitude data, N, representing the ith clean speechiThe total frame number of the ith dry clean voice;
wherein,the estimated value of the frequency domain amplitude spectrum output by the network after the ith strip of noisy speech is input is shown,j frame data, N, representing an estimate of the ith frequency domain amplitude spectrumiThe total frame number of the estimated value of the ith frequency domain amplitude spectrum;
wherein,time domain waveform data representing the i-th piece of estimated speech,the amplitude, T, of the jth sample point representing the ith estimated speechiEstimating the total number of sampling points of the voice for the ith strip;
the frequency domain loss in the step 2 is the mean square error of the frequency domain amplitude spectrum, and specifically comprises the following steps:
wherein L isi,FThe frequency domain loss of the ith strip of voice with noise as input;
the time domain loss in step 2 is a mean square error of the time domain waveform amplitude, and specifically includes the following steps:
wherein L isi,WThe time domain loss of the ith strip of voice with noise as input;
the time-frequency domain joint loss in the step 2 is as follows:
Li,total=Li,F+αLi,W
wherein L isi,totalAnd alpha is the weight coefficient of the hyper-parameter, which is the time-frequency domain joint loss when the ith strip of noisy speech is input.
Compared with the prior art, the invention has the following advantages and beneficial effects: the phenomenon that the estimated magnitude spectrum is not matched with the phase spectrum is reduced, and the voice enhancement effect is improved to a certain extent.
Drawings
FIG. 1: is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram: the invention is a structural schematic diagram of a CNN voice enhancement network.
Detailed Description
The embodiment is used for realizing training and testing based on the aishell speech set and the musan noise set.
As shown in fig. 1, the present embodiment performs speech enhancement and training based on a Convolutional Neural Network (CNN) model. And performing voice enhancement comparison with the existing algorithm by replacing the loss function with a joint loss function of a time-frequency domain.
The first embodiment of the present invention is a speech enhancement method based on a time-frequency domain joint loss function, and specifically includes the following steps:
step 1, collecting a clean voice data set and noise data in an open source data set into a voice data set with noise, framing and overlapping the clean voice in the clean voice data set by a short-time Fourier transform method, converting the clean voice into a frequency domain amplitude spectrum of each clean voice, constructing a clean voice frequency domain amplitude spectrum data set, sampling and framing the clean voice in the clean voice data set, adding a Hamming window to convert the clean voice into waveform data of the clean voice, forming a clean voice time domain waveform data set, framing and overlapping the noisy voice in the voice data set with noise by a short-time Fourier transform method, converting the frequency domain amplitude spectrum of each noisy voice and the frequency domain phase spectrum of each noisy voice, forming a voice amplitude spectrum data set with noise and a voice frequency domain phase spectrum data set with noise, and collecting the voice frequency domain amplitude spectrum data set with noise, the clean voice frequency domain amplitude spectrum data set, Constructing network training set data by using a time domain waveform data set of clean voice, a frequency domain amplitude spectrum data set of noisy voice and a frequency domain phase spectrum data set of noisy voice;
Step 1 the clean speech data set is:
{Ci,i∈[1,K]}
wherein, CiThe ith clean voice of the clean voice data set, wherein K is 1024 the number of clean voices;
step 1 the noisy data set is:
{Ni,i∈[1,K']}
where K' is the amount of noise in the noise data set;
the synthesized noisy speech data set in step 1 is:
{Ri,i∈[1,K]}
wherein R isiThe number of the i-th synthesized voice with noise in the synthesized voice data set with noise is K-1024;
in step 1, the short-time fourier transform parameters are specifically 16kHz sampling, a window length of 256, and a window overlap of 75%, and the calculation method of the number of sampling points and the number of frames is as follows:
T=t×16000
wherein T is the time length of the speech, T is the total number of sampling points of the speech, and N is the number of frames of the frequency domain amplitude spectrum of the speech
Step 1, the clean speech frequency domain magnitude spectrum data set is as follows:
{Yi,i∈[1,K]},Yi={Yi,j,j∈[1,Ni]}
wherein, YiFrequency domain amplitude spectrum, Y, representing the i-th clean speechi,jJ frame frequency domain amplitude data, N, representing the ith clean speechiThe total frame number of the ith dry clean voice;
step 1, the clean voice time domain waveform data set is as follows:
{yi,i∈[1,K]},yi={yi,j,j∈[1,Ti]}
wherein, yiWaveform data representing the ith clean speech, yi,jValue of j sample point, T, representing i clean speech iThe total sampling point number of the ith dry clean voice is counted; (T)iNumber of total sampling points for ith dry and clean voice through 16kHz sampling)
Step 1, the data set of the frequency domain magnitude spectrum of the voice with noise comprises:
{Mi,i∈[1,K]},Mi={Mi,j,j∈[1,Ni]}
wherein M isiFrequency domain magnitude spectrum, M, representing the i-th noisy speechi,jFrequency domain amplitude data, N, of the j-th frame representing the i-th noisy speechiThe total frame number of the ith voice with noise;
the frequency domain phase spectrum data set of the voice with noise in the step 1 is as follows:
{Pi,i∈[1,K]}
wherein, PiRepresenting the frequency domain phase spectrum of the ith voice with noise, wherein K is the number of the frequency domain phases of the voice with noise in the data set of the frequency domain phase spectrum of the voice with noise;
step 1, constructing network training set data as follows:
{Mi,Pi,Yi,yi,i∈[1,K]}
wherein K is the total number of voices in the voice data set with noise;
step 2, a CNN network model is constructed, a frequency domain amplitude spectrum of the voice with noise in the network training set data is used as a model input data set, a frequency domain amplitude spectrum of the clean voice is used as a training target set, the network acquires the frequency domain amplitude spectrum of the voice with noise each time, the frequency domain amplitude spectrum of the clean voice corresponding to the frequency domain amplitude spectrum is used as a label, and the CNN network model predicts the frequency domain amplitude spectrum of the clean voice according to the frequency domain amplitude spectrum of the voice with noise to obtain a frequency domain amplitude spectrum estimated value; combining the frequency domain amplitude spectrum estimation value with a frequency domain phase spectrum of the voice with noise, and further performing waveform reconstruction by an inverse short-time Fourier transform method to obtain enhanced voice; sampling, framing and overlapping the enhanced voice and adding a Hamming window to obtain waveform data on an estimated voice time domain;
Calculating the loss of the frequency domain through the frequency domain amplitude spectrum of the clean voice and the frequency domain amplitude spectrum estimation value; calculating time domain loss through time domain waveform data of the clean voice and waveform data on an estimated voice time domain; constructing time-frequency domain combined loss according to the frequency domain loss and the time domain loss;
the pre-training CNN network model is formed by cascade connection of an encoder and a decoder;
the encoder is formed by sequentially cascading a coding convolution layer, a coding batch normalization layer, a maximum pooling layer and a Leaky ReLu activation layer, wherein the coding convolution layer relates to weight updating;
the decoder is formed by sequentially cascading a decoding convolution layer, a decoding batch normalization layer and an upper sampling layer, wherein the decoding convolution layer relates to weight updating;
step 2, the frequency domain magnitude spectrum of the noisy speech in the training set data is: { Mi,i∈[1,K]};
Mi={Mi,j,j∈[1,Ni]}
Wherein M isiFrequency domain amplitude spectrum, M, representing the i-th noisy speechi,jFrequency domain amplitude data, N, of the j-th frame representing the i-th noisy speechiThe total frame number of the ith voice with noise;
step 2, the frequency domain amplitude spectrum of the clean voice in the training set data is as follows: { Yi,i∈[1,K]};
Yi={Yi,j,j∈[1,Ni]}
Wherein, YiFrequency domain amplitude spectrum, Y, representing the i-th clean speechi,jJ frame frequency domain amplitude data, N, representing the ith clean speech iThe total frame number of the ith dry clean voice;
wherein,the estimated value of the frequency domain amplitude spectrum output by the network after the ith strip of noisy speech is input is shown,j frame data, N, representing an estimate of the ith frequency domain amplitude spectrumiThe total frame number of the estimated value of the ith frequency domain amplitude spectrum;
wherein,time domain waveform data representing the i-th estimated voice,the amplitude, T, of the jth sample point representing the ith estimated speechiEstimating the total number of sampling points of the voice for the ith strip;
the frequency domain loss in step 2 is the mean square error of the frequency domain amplitude spectrum, and specifically comprises the following steps:
wherein L isi,FThe frequency domain loss of the ith strip of voice with noise as input;
the time domain loss in step 2 is a mean square error of the time domain waveform amplitude, and specifically includes the following steps:
wherein L isi,WThe time domain loss of the ith strip of voice with noise as input;
the time-frequency domain joint loss in the step 2 is as follows:
Li,total=Li,F+αLi,W
wherein L isi,totalFor the time-frequency domain joint loss at the i-th noisy speech input, α ═ 0.15 is the weight coefficient of the hyperparameter.
And 3, updating the weight matrix of the convolutional layer by using Adam as an optimizer according to the time-frequency domain joint loss, and performing next iteration until training is finished to obtain optimized network weight parameters so as to construct an optimized CNN network model.
A second embodiment of the present invention is a specific test procedure, including the steps of:
step 1, collecting the voice with noise, and converting the voice with noise into a magnitude spectrum and a phase spectrum.
And 2, sending the amplitude spectrum into an enhancement model to obtain an enhanced voice amplitude spectrum.
Step 3, using the phase spectrum of the noisy speech and the amplitude spectrum of the estimation result to perform Inverse Short-Time Fourier Transform (Inverse Short Time Fourier Transform ISTFT) to obtain an enhanced speech waveform, writing the enhanced speech waveform into a wav file,
and 4, carrying out voice recognition on the generated voice file, and comparing recognition accuracy rates of the generated voice with different enhancement models.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
Claims (1)
1. A speech enhancement method based on a time-frequency domain joint loss function is characterized by comprising the following steps:
step 1, collecting a clean voice data set and noise data in an open source data set into a voice data set with noise, framing and overlapping the clean voice in the clean voice data set by a short-time Fourier transform method, converting the clean voice into a frequency domain amplitude spectrum of each clean voice, constructing a clean voice frequency domain amplitude spectrum data set, sampling and framing the clean voice in the clean voice data set, adding a Hamming window to convert the clean voice into waveform data of the clean voice, forming a clean voice time domain waveform data set, framing and overlapping the noisy voice in the voice data set with noise by a short-time Fourier transform method, converting the frequency domain amplitude spectrum of each noisy voice and the frequency domain phase spectrum of each noisy voice, forming a voice amplitude spectrum data set with noise and a voice frequency domain phase spectrum data set with noise, and collecting the voice frequency domain amplitude spectrum data set with noise, the clean voice frequency domain amplitude spectrum data set, Constructing network training set data by using a time domain waveform data set of clean voice, a frequency domain amplitude spectrum data set of noisy voice and a frequency domain phase spectrum data set of noisy voice;
Step 2, a CNN network model is constructed, a frequency domain amplitude spectrum of the noisy speech in the network training set data is used as a model input data set, a frequency domain amplitude spectrum of the clean speech is used as a training target set, the network acquires the frequency domain amplitude spectrum of the noisy speech at a time, the frequency domain amplitude spectrum of the clean speech corresponding to the frequency domain amplitude spectrum is used as a label, and the CNN network model predicts the frequency domain amplitude spectrum of the clean speech according to the frequency domain amplitude spectrum of the noisy speech to obtain a frequency domain amplitude spectrum estimated value; combining the frequency domain amplitude spectrum estimated value with a frequency domain phase spectrum of the voice with noise, and further performing waveform reconstruction by an inverse short-time Fourier transform method to obtain an enhanced voice; sampling, framing and overlapping the enhanced voice and adding a Hamming window to obtain waveform data on an estimated voice time domain;
calculating the frequency domain loss through the frequency domain amplitude spectrum of the clean voice and the frequency domain amplitude spectrum estimation value; calculating time domain loss through time domain waveform data of the clean voice and waveform data on an estimated voice time domain; constructing time-frequency domain combined loss according to the frequency domain loss and the time domain loss;
and 3, updating the weight matrix of the convolutional layer by using Adam as an optimizer according to the time-frequency domain joint loss, and performing next iteration until training is finished to obtain optimized network weight parameters so as to construct an optimized CNN network model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110155444.2A CN112927709B (en) | 2021-02-04 | 2021-02-04 | Voice enhancement method based on time-frequency domain joint loss function |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110155444.2A CN112927709B (en) | 2021-02-04 | 2021-02-04 | Voice enhancement method based on time-frequency domain joint loss function |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112927709A CN112927709A (en) | 2021-06-08 |
CN112927709B true CN112927709B (en) | 2022-06-14 |
Family
ID=76170408
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110155444.2A Expired - Fee Related CN112927709B (en) | 2021-02-04 | 2021-02-04 | Voice enhancement method based on time-frequency domain joint loss function |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112927709B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113436640B (en) * | 2021-06-28 | 2022-11-25 | 歌尔科技有限公司 | Audio noise reduction method, device and system and computer readable storage medium |
CN113707164A (en) * | 2021-09-02 | 2021-11-26 | 哈尔滨理工大学 | Voice enhancement method for improving multi-resolution residual error U-shaped network |
CN114400018A (en) * | 2021-12-31 | 2022-04-26 | 深圳市优必选科技股份有限公司 | Voice noise reduction method and device, electronic equipment and computer readable storage medium |
CN114974281A (en) * | 2022-05-24 | 2022-08-30 | 云知声智能科技股份有限公司 | Training method and device of voice noise reduction model, storage medium and electronic device |
CN114783457B (en) * | 2022-06-01 | 2024-10-29 | 中国科学院半导体研究所 | Sound signal enhancement method and device based on waveform and frequency domain information fusion network |
CN115240648B (en) * | 2022-07-18 | 2023-04-07 | 四川大学 | Controller voice enhancement method and device facing voice recognition |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013127364A1 (en) * | 2012-03-01 | 2013-09-06 | 华为技术有限公司 | Voice frequency signal processing method and device |
CN109360581A (en) * | 2018-10-12 | 2019-02-19 | 平安科技(深圳)有限公司 | Sound enhancement method, readable storage medium storing program for executing and terminal device neural network based |
CN110503967A (en) * | 2018-05-17 | 2019-11-26 | 中国移动通信有限公司研究院 | A kind of sound enhancement method, device, medium and equipment |
CN111081268A (en) * | 2019-12-18 | 2020-04-28 | 浙江大学 | Phase-correlated shared deep convolutional neural network speech enhancement method |
CN111696568A (en) * | 2020-06-16 | 2020-09-22 | 中国科学技术大学 | Semi-supervised transient noise suppression method |
CN112185405A (en) * | 2020-09-10 | 2021-01-05 | 中国科学技术大学 | Bone conduction speech enhancement method based on differential operation and joint dictionary learning |
-
2021
- 2021-02-04 CN CN202110155444.2A patent/CN112927709B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013127364A1 (en) * | 2012-03-01 | 2013-09-06 | 华为技术有限公司 | Voice frequency signal processing method and device |
CN110503967A (en) * | 2018-05-17 | 2019-11-26 | 中国移动通信有限公司研究院 | A kind of sound enhancement method, device, medium and equipment |
CN109360581A (en) * | 2018-10-12 | 2019-02-19 | 平安科技(深圳)有限公司 | Sound enhancement method, readable storage medium storing program for executing and terminal device neural network based |
CN111081268A (en) * | 2019-12-18 | 2020-04-28 | 浙江大学 | Phase-correlated shared deep convolutional neural network speech enhancement method |
CN111696568A (en) * | 2020-06-16 | 2020-09-22 | 中国科学技术大学 | Semi-supervised transient noise suppression method |
CN112185405A (en) * | 2020-09-10 | 2021-01-05 | 中国科学技术大学 | Bone conduction speech enhancement method based on differential operation and joint dictionary learning |
Non-Patent Citations (4)
Title |
---|
孟祥彩等.基于相位影响的单声道语音增强改进算法.《电声技术》.2016,(第11期), * |
曾亮.基于小波变换语音增强算法的Matlab仿真.《软件导刊》.2013,(第10期), * |
秦焕丁等.基于最小均方误差幅度谱的改进语音增强算法研究.《电子技术》.2016,(第07期), * |
胡科开等.一种基于改进型谱减法的语音增强新算法.《大众科技》.2008,(第09期), * |
Also Published As
Publication number | Publication date |
---|---|
CN112927709A (en) | 2021-06-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112927709B (en) | Voice enhancement method based on time-frequency domain joint loss function | |
CN108172238B (en) | Speech enhancement algorithm based on multiple convolutional neural networks in speech recognition system | |
Tu et al. | Speech enhancement based on teacher–student deep learning using improved speech presence probability for noise-robust speech recognition | |
CN112735456B (en) | Speech enhancement method based on DNN-CLSTM network | |
CN110767244B (en) | Speech enhancement method | |
Shi et al. | Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation. | |
CN112802491B (en) | Voice enhancement method for generating confrontation network based on time-frequency domain | |
CN111899750B (en) | Speech enhancement algorithm combining cochlear speech features and hopping deep neural network | |
Le et al. | Inference skipping for more efficient real-time speech enhancement with parallel RNNs | |
Yoneyama et al. | Unified source-filter GAN: Unified source-filter network based on factorization of quasi-periodic parallel WaveGAN | |
CN114495969A (en) | Voice recognition method integrating voice enhancement | |
CN113035217B (en) | Voice enhancement method based on voiceprint embedding under low signal-to-noise ratio condition | |
Tu et al. | DNN training based on classic gain function for single-channel speech enhancement and recognition | |
Saleem et al. | Time domain speech enhancement with CNN and time-attention transformer | |
Girirajan et al. | Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network. | |
Haton | Automatic speech recognition: A Review | |
CN109741733B (en) | Voice phoneme recognition method based on consistency routing network | |
Xu et al. | Joint training ResCNN-based voice activity detection with speech enhancement | |
CN111508475A (en) | Robot awakening voice keyword recognition method and device and storage medium | |
Yang et al. | RS-CAE-based AR-Wiener filtering and harmonic recovery for speech enhancement | |
Kalamani et al. | Continuous Tamil Speech Recognition technique under non stationary noisy environments | |
CN114360571A (en) | Reference-based speech enhancement method | |
Li et al. | A Convolutional Neural Network with Non-Local Module for Speech Enhancement. | |
CN108573698B (en) | Voice noise reduction method based on gender fusion information | |
CN115273874A (en) | Voice enhancement model calculated quantity compression method based on recurrent neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220614 |