CN114242099A - Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network - Google Patents
Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network Download PDFInfo
- Publication number
- CN114242099A CN114242099A CN202111534489.7A CN202111534489A CN114242099A CN 114242099 A CN114242099 A CN 114242099A CN 202111534489 A CN202111534489 A CN 202111534489A CN 114242099 A CN114242099 A CN 114242099A
- Authority
- CN
- China
- Prior art keywords
- spectrum
- voice
- speech
- phase
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001228 spectrum Methods 0.000 title claims abstract description 197
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 79
- 238000003062 neural network model Methods 0.000 claims abstract description 20
- 230000000694 effects Effects 0.000 claims abstract description 15
- 238000004364 calculation method Methods 0.000 claims abstract description 8
- 238000005457 optimization Methods 0.000 claims abstract description 5
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000012360 testing method Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 28
- 238000013461 design Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 7
- 230000009471 action Effects 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000012886 linear function Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 230000008030 elimination Effects 0.000 abstract description 2
- 238000003379 elimination reaction Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 64
- 230000003213 activating effect Effects 0.000 description 6
- 230000002708 enhancing effect Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 230000006854 communication Effects 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000004580 weight loss Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
Abstract
The invention provides a voice enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, which is used for preprocessing training set voice data to obtain characteristic data of noisy voice and pure voice; introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm by combining the characteristic data of the noisy speech and the pure speech, and calculating a phase compensation factor by using an improved formula; building a full convolution neural network model, using the phase compensation factor and the logarithmic power spectrum of the pure voice as a training target of the full convolution neural network, and training the network model; inputting the test voice into the trained model to obtain an estimated value of a logarithmic power spectrum and a phase compensation function; and respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by utilizing the estimated value of the logarithmic power spectrum and the phase compensation function to obtain the final enhanced voice. The invention improves the noise elimination capability of the algorithm and simultaneously better ensures the speech intelligibility, thereby improving the overall effect of speech enhancement.
Description
Technical Field
The invention relates to a voice enhancement method, in particular to a voice enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, and belongs to the technical field of voice signal processing.
Background
It is known that voice is an important way of communicating information between people, but people always suffer from various noises during the communication and communication process by voice. The noisy speech not only increases the hearing fatigue of the person and degrades the speech communication quality, but also degrades the performance of the speech processing system based on feature parameter extraction. Therefore, to reduce the effect of background noise on speech quality, speech enhancement is required to suppress background noise.
The phase spectrum compensation is an enhancement algorithm for enhancing a voice signal by utilizing voice phase spectrum information, and the basic idea is as follows: respectively calculating the short-time amplitude spectrum of the voice signal with noise and the estimated short-time amplitude spectrum of the noise signal, calculating a compensation factor by using a phase spectrum compensation function, and then superposing the compensation factor and the frequency spectrum of the voice with noise to obtain a compensated voice frequency spectrum. When restoring the enhanced voice signal, the compensated voice frequency spectrum is used to calculate the compensated phase, and then the amplitude spectrum of the voice signal with noise is inserted to perform inverse discrete Fourier transform. The general form of the phase compensation function is:
for the estimation of the noise amplitude spectrum, Ψ (k) is an antisymmetric function, λ is a compensation factor, and a conventional phase compensation factor λ is an empirical value, typically taken to be 3.74.
The phase compensation has the advantages of small operand, easy realization and better enhancement effect. However, because the current speech enhancement algorithm processes non-stationary noise, the noise energy has uncertainty, and the fixed compensation factor cannot be dynamically adjusted to a proper value according to the change of the noise. The fixed parameters cannot fully exert the effect of the phase compensation algorithm, and cannot be combined with DNN to correct the voice phase spectrum. Therefore, optimizing a compensation function with fixed parameters, introducing supervised parameter learning, balancing and considering the enhanced speech distortion and denoising effect, and improving the phase compensation algorithm to fully exert the advantages of the phase compensation algorithm are important points.
Compared with the traditional method, the voice enhancement method based on the deep neural network obviously improves the voice enhancement performance under the non-stationary noise condition, and is a research hotspot in the field of voice enhancement in recent years. Existing work has been mainly directed to the design of training features and training goals and the development of improvements in network architecture. According to the training characteristics and the target design method, the speech enhancement method based on the deep neural network can be divided into two types, namely a time domain and a frequency domain. The speech enhancement in the frequency domain generally adopts the magnitude spectrum or the logarithmic power spectrum of the voice with noise as the training feature, and besides the magnitude spectrum and the logarithmic power spectrum, the magnitude spectrum can be masked as the training target. Time-domain speech enhancement generally employs time-domain waveforms of noisy speech and clean speech as training features and training targets, respectively. However, the existing solutions still have many problems. In recent years, researches show that the voice quality can be effectively improved by only enhancing the phase spectrum of the voice and keeping the amplitude spectrum of the voice with noise unchanged. Although the time-domain speech enhancement algorithm directly uses waveforms of noisy speech and clean speech as training features and training targets, the performance of the time-domain speech enhancement algorithm is very dependent on the design of a loss function, and the complex loss function greatly improves the training difficulty. On the contrary, the simple time domain minimum mean square error function is adopted as the loss function, a large amount of time is consumed for parameter adjustment, the problem of voice distortion is easily caused, the intelligibility of voice signals is influenced, the voice signals are damaged, and even the signal-to-noise ratio is reduced.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a speech enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, which overcomes the defects of the prior art, adopts the logarithmic power spectrum of the noisy speech as the training characteristic, utilizes the improved phase spectrum compensation algorithm to carry out phase spectrum estimation on the noisy signal, obtains a phase spectrum compensation factor as one of the training targets of the network, and uses the logarithmic power spectrum of the pure speech signal as a common training target by matching with the unique loss function design. Considering that the training features have correlation in both time and frequency, the present invention uses convolutional neural networks to obtain better training effect, so as to enhance the speech signal.
The invention provides a voice enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, which comprises the following steps:
step 1, preprocessing the voice data of a training set to obtain characteristic data of noise-containing voice and pure voice;
step 2, combining the calculated characteristic data of the voice with noise and the pure voice, introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm, and calculating a phase compensation factor by using the improved formula;
step 3, building a full convolution neural network model, using the solved phase compensation factor and the log power spectrum of the pure voice as a training target of a Full Convolution Neural Network (FCNN), and training the full convolution neural network model;
step 4, inputting the test voice into the trained model to obtain an estimated value of a logarithmic power spectrum and a phase compensation function;
and 5, respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by using the estimated value of the logarithmic power spectrum and the phase compensation function obtained in the previous step to obtain the final enhanced voice. As a further technical solution of the present invention,
the further optimized technical scheme of the invention is as follows:
in the step 1, the voice with noise and the pure voice are preprocessed, and the characteristic parameters of the voice with noise and the pure voice are extracted, specifically, the operations are as follows:
step 1-1, let y (n) represent a noisy speech signal and y (n) ═ d (n) + x (n), where d (n) is a noise signal and x (n) is a clean signal. Let x (n) and d (n) be statistically independent and have a zero mean. Transforming the noisy speech into the frequency domain by windowing M samples of Y (n), wherein the window function is w (n), and performing FFT (fast fourier transform) of M points, to obtain a frequency spectrum Y (l, k) of the noisy speech, wherein l is a frame number label, k represents a frequency component, and k is 0, 1, 2. The same spectrum X (l, k) of the clean speech X (n) and the spectrum D (l, k) of the noise signal D (n) can be obtained;
step 1-2, representing the frequency spectrum Y (l, k) of the voice with noise on a polar coordinate, and dividing the frequency spectrum into a magnitude spectrum and a phase spectrum, namely
Y(l,k)=|Y(l,k)|ej∠Y(l,k)
Wherein | Y (l, k) | is a short-time amplitude spectrum of the voice Y (n) with noise, and | Y (l, k) is a phase spectrum; similarly, a short-time amplitude spectrum | X (l, k) | of pure voice X (n) and a short-time amplitude spectrum | D (l, k) | of a noise signal D (n) can be obtained;
step 1-3, calculating the log power spectrum S (n) of the noisy speech and the log power spectrum T (n) of the clean speech by using the following formula,
S(n)=[loge(|Y(n,1)|2),loge(|Y(n,2)|2),...,loge(|Y(n,k)|2),...,loge(|Y(n,M-1)|2)]
T(n)=[loge(|X(n,1)|2),loge(|X(n,2)|2),...,loge(|X(n,k)|2),...,loge(|X(n,M-1)|2)]
wherein | Y (n, 1) & gtY2For the power of the 1 st frequency band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, 2) & gtY2For the power of the 2 nd band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, k) & gtY2For the power of the kth frequency band of the nth frame obtained by short-time Fourier transform of the noisy speech, | Y (n, M-1) & gt2Obtaining the power of the (M-1) th frequency band of the nth frame of the voice with noise through short-time Fourier transform; | X (n, 1) emittingphosphor2(ii) power of 1 st band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, 2) & ltY |)2(ii) power of 2 nd band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, k) & ltY |2Power of kth frequency band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, M-1) & ltY & gt2The power of the (M-1) th frequency band of the nth frame obtained by short-time Fourier transform for pure speech.
In step 2, the phase compensation function is optimized, specifically, the operations are as follows:
step 2-1, the conventional phase compensation function is expressed as
Wherein Λ (l, k) is a phase compensation function,for noise estimation, the amplitude spectrum Y (l, k) of the noisy speech can be generally substituted, λ is an empirical value, Ψ (k) is an antisymmetric function for phase correction, and denoted as
The action of Λ (l, k) is shown by the following formula,
YΛ(l,k)=Y(l,k)+Λ(l,k)
wherein, YΛ(l, k) is a compensated spectrum, and a phase spectrum < Y shown in the following formula is obtained by extracting the phase of the compensated spectrumΛ(l,k),
∠YΛ(l,k)=arg(YΛ(l,k))
Is less than YΛ(l, k) is combined with the magnitude spectrum Y (l, k) of the noisy speech to obtain the spectral expression of the enhanced speech as
Wherein the exponential form of the complex number is represented by the Euler formula rejαWhere r is amplitude and α represents amplitude (phase), where j is a complex field representation;
step 2-2, introducing frame signal-to-noise ratio to optimize a phase compensation function to obtain an optimized phase compensation function, namely
Wherein,c is an empirical value, typically set to 2.7, Λ (l, k) with SNRlWhen the current frame is a voice frame, the influence of the phase compensation factor Λ (l, k) on the noise voice is reduced, and more voice details are reserved. SNRlThe signal-to-noise ratio of the l-th frame.
The step 2 further comprises: step 2-3, in order to simplify the training target of the subsequent neural network, the optimized phase compensation function is abbreviated as the following formula,
Λ(l,k)=Ψ(k)×Q(l,k)
the above equation is a product of a new compensation factor and a noise magnitude spectrum estimation value, and may also be referred to as a new phase compensation factor for short, and is also one of training targets of a subsequent neural network. Q (l, k) is a new phase compensation factor, an antisymmetric function is removed, and the Q (l, k) is used as one of training targets of a following neural network. The formula for Q (l, k) is expressed as,
. Q (l, k) is a total of N frames, Q (l, k) is M long two-dimensional data per frame in N frames (the symbols of all bands (l, k) are of this dimension). In actual operation, N frames of data with length of M in each frame are brought into operation, namely Q (l, k), with which Λ (l, k) can be solved, and spectrum S of enhanced voice can be obtained according to step 2-1Λ(l, k), and then performing ISTFT to obtain the time domain waveform of the enhanced voice.
In the step 3, a design of a training target and a loss function for integrating phase spectrum compensation is provided, and the specific method is as follows:
step 3-1, for the voice enhancement of the frequency domain, adopting the logarithmic power spectrum of the voice with noise and the logarithmic power spectrum of the pure voice as the training characteristic and the training target of the neural network respectively, namely
Wherein,is a pure voice logarithm power spectrum T obtained by the noisy voice logarithm power spectrum through the neural network trainingnOf the approximation value of SnIs a logarithmic power spectrum of the noisy speech; then combines the phase alpha of the n-th frame of the noisy speechnTo obtain the time domain waveform of the n-th frame of the enhanced voiceNamely, it is
Step 3-2, on the basis of voice enhancement of the frequency domain, a phase compensation algorithm is fused to serve as a new training target, and the new training target comprises a logarithmic power spectrum T of pure voicenAnd a phase compensation factor QnTwo parts, i.e.
Wherein Q isnRepresents data of the n-th frame of Q (l, k) and Qn=Q(l,k)l=nQ (n, 0), Q (n, 1), Q (n, M-1)), by a phase compensation factorThen the phase compensation function is obtained, and the enhanced phase of the nth frame is obtained by the phase compensation function, so that the phase alpha of the voice with noise can be replacednTo achieve better voice enhancement effect;
step 3-3, for the training target fusing the phase spectrum compensation and provided in the last step, defining the loss function of the neural network as
Where MSE is logarithmic workThe combined mean square error of the rate spectrum and phase spectrum compensation, the log power spectrum and the phase spectrum compensation jointly influence the parameter. M is the frame length, T (n, k) is the log power spectrum of the nth frame of clean speech,the approximate value of the clean voice logarithmic power spectrum obtained by training the logarithmic power spectrum of the nth frame of the voice with noise through the neural network,obtaining an approximate value of the phase compensation factor through neural network training; beta is a modulation parameter and is a constant, and is used for balancing the influence of the logarithmic power spectrum and the phase compensation on the network;
3-4, building a full convolution neural network model which is mainly divided into an input layer, an encoder layer, a decoder layer and an output layer;
and 3-5, taking the logarithmic power spectrum of the phase compensation factor combined with the pure voice as a training target of the full convolution neural network, and training the full convolution neural network model to obtain the trained full convolution neural network model.
The neural network training method adopts a conventional back propagation algorithm to train the neural network, adopts the combined mean square error of the step 3-3 as a loss function, and uses an Adam algorithm to carry out optimization, wherein the initial learning rate is set to be 0.05, the learning rate discarding period is set to be 10, the discarding parameter is set to be 0.2, the small batch (Batchsize) is set to be 128, and the iteration number (epoch) is set to be 100. The model trained by the method is obtained after the training.
In the step 3-4, the input layer is a matrix of 5 × 257, the length of the context representing the input features is 5 frames, each frame has 257 data points, and the input data is composed of the log power spectrum of the noisy speech of the adjacent 5 frames, so that the correlation of the input data in the two dimensions of time and frequency is increased.
The encoder layer consists of two convolution layers and two dropout layers, wherein the convolution layers have three hyper-parameters which are convolution kernel size, moving step length and filling mode respectively. The parameter settings of the convolutional layer 1 are as follows: the convolution kernel size is 5; the moving step length is 1; the filling mode is a same mode; the number of convolution kernels is designated 32; the activation function is selected from a ReLU (reconstructed Linear Unit, Re-LU) function. The adoption of the ReLU activation function can greatly reduce the calculation amount. After passing through convolution layer 1, the weight loss ratio is set to 0.2. The parameter settings of the convolutional layer 2 are as follows: the convolution kernel size is 7; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The encoder is used for preliminarily extracting shallow layer characteristics of the noisy speech through the convolutional layer 1 and further extracting more abstract deep layer speech characteristics through the convolutional layer 2 for encoding.
The decoder layer is also composed of two convolutional layers and two dropout layers, and the network structure of the decoder layer is symmetrical to that of the encoder layer. The parameter settings of the convolution layer 3 are as follows: the convolution kernel size is 7; the filling mode is set to be a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The convolutional layer 4 parameters are set as follows: the convolution kernel size is 5; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 32; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The decoder acts in reverse to the encoder to restore the features compressed by the encoder to produce clean speech features.
The output layer is a full connection layer, is directly connected with a decoder and is provided with 257 neurons, the activation function selects a linear function, and the output is 1 frame of voice characteristic data and 1 frame of phase spectrum compensation factors.
In the step 4, the characteristic extraction is carried out on the voice with noise to be tested according to the method in the step 1, and the extracted characteristic parameters are input into the trained full convolution neural network model, so that the logarithmic power spectrum estimation value and the phase compensation factor of the voice signal can be obtained.
In the invention, the logarithmic power spectrum S of the voice with noise is input into a trained neural network model to carry out forward propagation once, so that the logarithmic power spectrum estimated value and the phase compensation factor of the voice signal can be obtained. The speech enhancement process based on the neural network can be divided into two parts of training and enhancing, and when the model is trained (the model can be trained)Consider a laboratory environment) with all the information of noisy and clean speech, which are used to train a model, i.e., S → (T, Q), which is a function of the model (arrow, or function f). When enhancing speech (it can be considered as practical application), only speech information with noise is needed, and information of pure speech is estimated through the trained model obtained in the training stage, that is, information of pure speech is estimatedThis step results in an approximation or estimate of the clean speech informationAnd a compensation factorAnd (4) restoring pure voice (enhanced voice) through the approximation value, and directly substituting the estimation value into the step 2-2 for calculation.
In the step 5, the amplitude spectrum and the phase spectrum of the voice signal are reconstructed by using the logarithmic power spectrum and the phase compensation factor (S is finally obtained according to the method of the step 2-1)Λ(l, k)), and a final enhanced speech is obtained by short time inverse fourier transform (ISTFT).
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
(1) aiming at the problems that the phase spectrum compensation algorithm parameters are fixed and cannot be dynamically adjusted, the invention introduces the frame signal-to-noise ratio, improves the phase compensation function, overcomes the defect that the traditional phase compensation can not change along with the change of noise, can quickly correspond to the change of the noise spectrum, is more suitable for the training and estimation of a neural network, exerts the performance of the phase compensation algorithm to the maximum extent and achieves better enhancement effect;
(2) an improved phase compensation factor is introduced as one of training targets of the neural network, the defect that the phase is neglected in a traditional voice enhancement algorithm based on a frequency domain is overcome, and a better enhancement effect is achieved through simple and effective loss function design under the condition of the same data sample number.
In a word, the invention improves the noise elimination capability of the algorithm and simultaneously better ensures the speech intelligibility, thereby improving the overall effect of speech enhancement.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a block diagram of a full convolution neural network in accordance with the present invention.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings: the present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection authority of the present invention is not limited to the following embodiments.
The embodiment proposes a speech enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, as shown in fig. 1, which includes the following steps:
step 1, preprocessing the voice data of the training set to obtain the characteristic data of the voice with noise and the pure voice.
The speech used for training in the experiment was from the training set of TIMIT. 15 kinds of noise of NOISEX-92 were selected as the noise set. Under the conditions that signal-to-noise ratios are-5 dB, 0dB and 5dB, 100 pure voices and noise sets in the TIMIT training set are randomly mixed, 4500 voices containing noise are generated under each signal-to-noise ratio, and the voices are divided into a training set and a testing set according to the ratio of 9:1 to train a network model.
Preprocessing the voice with noise and the pure voice and extracting characteristic parameters of the voice with noise and the pure voice, wherein the method specifically comprises the following steps:
step 1-1, let y (n) represent a noisy speech signal and y (n) ═ d (n) + x (n), where d (n) is a noise signal and x (n) is a clean signal. Let x (n) and d (n) be statistically independent and have a zero mean. The noisy speech is transformed into the frequency domain by windowing M samples of Y (n) with w (n) and performing an FFT (fast fourier transform) of M points, resulting in a frequency spectrum Y (l, k) of the noisy speech, where l is a frame number marker, k represents a frequency component and k is 0, 1, 2.
The same spectrum X (l, k) of clean speech X (n) and the spectrum D (l, k) of noise signal D (n) are obtained. Namely:
window w (n) on M samples of X (n), and perform FFT of M points to transform the pure voice to frequency domain, so as to obtain the frequency spectrum X (l, k) of the pure voice;
and (3) windowing w (n) to the M samples of D (n), performing FFT of M points, and transforming the noise signal to a frequency domain to obtain a frequency spectrum D (l, k) of the noise signal.
Step 1-2, representing the frequency spectrum Y (l, k) of the voice with noise on a polar coordinate, and dividing the frequency spectrum into a magnitude spectrum and a phase spectrum, namely
Y(l,k)=|Y(l,k)|ej∠Y(l,k)
Wherein, | Y (l, k) | is the short-time amplitude spectrum of the noisy speech Y (n), and the phase spectrum thereof is ≈ Y (l, k). Similarly, a short-time amplitude spectrum | X (l, k) | of the clean speech X (n) and a short-time amplitude spectrum | D (l, k) | of the noise signal D (n) can be obtained. Namely:
the frequency spectrum X (l, k) of clean speech is represented on polar coordinates and can be divided into a magnitude spectrum and a phase spectrum, i.e. a spectrum with a high frequency
X(l,k)=|X(l,k)|ej∠X(l,k)
I X (l, k) I is a short-time amplitude spectrum of pure speech X (n), and a phase spectrum thereof is X (l, k); representing the frequency spectrum D (l, k) of the noise signal in polar coordinates, it can be divided into a magnitude spectrum and a phase spectrum, i.e.
D(l,k)=|D(l,k)|ej∠D(l,k)
| D (l, k) | represents a short-time amplitude spectrum of the noise signal D (n), and the phase spectrum is ≈ D (l, k).
Step 1-3, calculating the log power spectrum S (n) of the voice with noise and the log power spectrum T (n) of the pure voice by using the following formula
S(n)=[loge(|Y(n,1)|2),loge(|Y(n,2)|2),...,loge(|Y(n,k)|2),...,loge(|Y(n,M-1)|2)]
T(n)=[loge(|X(n,1)|2),loge(|X(n,2)|2),...,loge(|X(n,k)|2),...,loge(|X(n,M-1)|2)]
Wherein | Y (n, 1) & gtY2For the power of the 1 st frequency band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, 2) & gtY2For the power of the 2 nd band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, k) & gtY2For the power of the kth frequency band of the nth frame obtained by short-time Fourier transform of the noisy speech, | Y (n, M-1) & gt2Obtaining the power of the (M-1) th frequency band of the nth frame of the voice with noise through short-time Fourier transform; | X (n, 1) emittingphosphor2(ii) power of 1 st band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, 2) & ltY |)2(ii) power of 2 nd band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, k) & ltY |2Power of kth frequency band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, M-1) & ltY & gt2The power of the (M-1) th frequency band of the nth frame obtained by short-time Fourier transform for pure speech.
And 2, combining the calculated characteristic data of the voice with noise and the pure voice, introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm, and calculating a phase compensation factor by using the improved formula.
The introduction of the frame signal-to-noise ratio improves the phase compensation function, and the specific operation is as follows:
step 2-1, the conventional phase compensation function can be expressed as
Wherein Λ (l, k) is a phase compensation function, λ is a phase compensation factor,for the estimation of noise, the amplitude spectrum | Y (l, k) | of the voice with noise can be used to replace the estimation; λ is an empirical value, which is originally a constant, modified later asWhen model training is carried out, the model training device willSubstituted by the actual value Z of the amplitude of the noise used directlyThe new compensation factor is Ψ (k) is an inverse symmetric function for phase correction, denoted
The action of Λ (l, k) is shown by the following formula,
YΛ(l,k)=Y(l,k)+Λ(l,k)
wherein, YΛ(l, k) is a compensated spectrum, and a phase spectrum < Y shown in the following formula is obtained by extracting the phase of the compensated spectrumΛ(l,k),
∠YΛ(l,k)=arg(YΛ(l,k))
Is less than YΛ(l, k) is combined with the magnitude spectrum Y (l, k) of the noisy speech to obtain the spectral expression of the enhanced speech as
Step 2-2, introducing frame signal-to-noise ratio to optimize a phase compensation function to obtain an optimized phase compensation function, namely
Wherein c is a warpA value, typically set to 2.7, Λ (l, k) with SNRlWhen the current frame is a voice frame, the influence of the phase compensation function lambda (l, k) on the noise-carrying voice is reduced, and more voice details are reserved. SNRlThe signal-to-noise ratio of the signal is the l frame.
Step 2-3, in order to simplify the training target of the subsequent neural network, the optimized phase compensation function is abbreviated as the following formula,
Λ(l,k)=Ψ(k)×Q(l,k)
wherein Q (l, k) is a new compensation factor. And eliminating an antisymmetric function, and taking Q (l, k) as one of training targets of the following neural network. The formula for Q (l, k) is expressed as,
and 3, building a full convolution neural network model, using the solved phase compensation factor and the log power spectrum of the pure voice as a training target of a Full Convolution Neural Network (FCNN), and training the full convolution neural network model.
The design of a training target and a loss function for integrating phase spectrum compensation is provided, and the specific method comprises the following steps:
step 3-1, for the voice enhancement of the frequency domain, adopting the logarithmic power spectrum of the voice with noise and the logarithmic power spectrum of the pure voice as the training characteristic and the training target of the neural network respectively
Wherein,is a pure voice logarithm power spectrum T obtained by the noisy voice logarithm power spectrum through the neural network trainingnOf the approximation value of SnIs the log power spectrum of noisy speech. Then combines the phase alpha of the n-th frame of the noisy speechnTo obtain the time domain waveform of the n-th frame of the enhanced voice
Step 3-2, on the basis of voice enhancement of the frequency domain, a phase compensation algorithm is fused to serve as a new training target, and the new training target comprises a logarithmic power spectrum T of pure voicenAnd a phase compensation factor QnTwo parts, i.e.
Wherein the phase compensation factor is adjustedThen the phase compensation function is obtained, and the enhanced phase of the nth frame is obtained by the phase compensation function, so that the phase alpha of the voice with noise can be replacednAnd a well-modified voice enhancement effect is achieved.
Step 3-3, for the training target fusing the phase spectrum compensation and provided in the last step, defining the loss function of the neural network as
Wherein, MSE is the joint mean square error of the compensation of the log power spectrum and the phase spectrum, and the log power spectrum and the phase spectrum jointly influence the parameter. M is the frame length, T (n, k) is the log power spectrum of the nth frame of pure speech,the approximate value of the clean voice logarithmic power spectrum obtained by training the logarithmic power spectrum of the nth frame of the voice with noise through the neural network,is a phase compensation factorAnd obtaining an approximate value through neural network training. Beta is a modulation parameter and is a constant for balancing the influence of the logarithmic power spectrum and the phase compensation on the network.
Step 3-4, constructing a full convolution neural network model for training
The invention adopts a complete convolution neural network (FCNN), which is different from the convolution neural network, the FCNN replaces a complete connection layer with a new convolution layer, and the convolution layer is utilized to extract the voice characteristics of deeper abstraction. The FCNN model is largely divided into an input layer, an encoder layer, a decoder layer, and an output layer.
The input layer of the model is a matrix of 5 multiplied by 257, the length of the context representing the input characteristics is 5 frames, each frame has 257 data points, and the input data is composed of the logarithmic power spectrum of the noisy speech of the adjacent 5 frames, so that the correlation of the input data in two dimensions of time and frequency is increased.
The encoder layer of the model consists of two convolution layers and two dropout layers, wherein the convolution layers have three hyper-parameters which are respectively the size of a convolution kernel, a moving step length and a filling mode. The parameter settings of the convolutional layer 1 are as follows: the convolution kernel size is 5; the moving step length is 1; the filling mode is a same mode; the number of convolution kernels is designated 32; the activation function is selected from a ReLU (reconstructed Linear Unit, Re-LU) function. The adoption of the ReLU activation function can greatly reduce the calculation amount. After passing through convolution layer 1, the weight loss ratio is set to 0.2. The parameter settings of the convolutional layer 2 are as follows: the convolution kernel size is 7; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The encoder is used for preliminarily extracting shallow layer characteristics of the noisy speech through the convolutional layer 1 and further extracting more abstract deep layer speech characteristics through the convolutional layer 2 for encoding.
The decoder layer of the model also consists of two convolution layers and two dropout layers, and the network structure of the model is symmetrical to that of the encoder layer. The parameter settings of the convolution layer 3 are as follows: the convolution kernel size is 7; the filling mode is set to be a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The convolutional layer 4 parameters are set as follows: the convolution kernel size is 5; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 32; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The decoder acts in reverse to the encoder to restore the features compressed by the encoder to produce clean speech features.
The output layer of the model is a full connection layer which is directly connected with a decoder and comprises 257 neurons, the activation function selects a linear function, and the output is 1 frame of voice characteristic data and 1 frame of phase spectrum compensation factors.
And 3-5, taking the logarithmic power spectrum of the phase compensation factor combined with the pure voice as a training target of the full convolution neural network, and training the full convolution neural network model to obtain the trained full convolution neural network model.
And 4, inputting the test voice into the trained model to obtain an estimated value of the logarithmic power spectrum and a phase compensation function.
And (2) extracting the characteristics of the voice with noise to be tested according to the method in the step (1), and inputting the extracted characteristic parameters into a trained full convolution neural network model to obtain a logarithmic power spectrum estimation value and a phase compensation factor of the voice signal.
And 5, respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by using the estimated value of the logarithmic power spectrum and the phase compensation function obtained in the previous step to obtain the final enhanced voice.
And reconstructing the magnitude spectrum and the phase spectrum of the voice signal by using the logarithmic power spectrum and the phase compensation factor, and obtaining the final enhanced voice through the ISTFT.
Examples
In order to verify the effectiveness of the method, the traditional PSC voice enhancement algorithm, the traditional FCNN algorithm and the method are compared and tested, and the algorithm performance under the environments of different types of interference noise and different input signal-to-noise ratios is tested. The speech quality Perceptual Evaluation (PESQ) and the short-time objective intelligibility (STOI) are selected as the speech evaluation indexes. PESQ is obtained by comparing an enhanced speech signal with a clean speech signal, and takes a value of [ -0.5, 4.5], and a larger value indicates a higher speech quality. The STOI is also obtained by comparing the enhanced speech signal with the pure speech signal, and takes a value of [0, 1], and the larger the value is, the higher the intelligibility of speech is, which is specifically as follows:
the method comprises the following steps: PSC
The second method comprises the following steps: FCNN algorithm
The third method comprises the following steps: the method of the invention
The three methods are respectively used for enhancing the noise-carrying voice with signal-to-noise ratios of-5 dB, 0dB and 5dB, and the noise types are factory2(fac), babble (bab) and buccaneer1 (buc). The results are shown in tables 1 to 3.
TABLE-5 dB noise
TABLE two 0dB noise
TABLE III 5dB noise
As can be seen from the above results, compared with the conventional phase compensation algorithm (PSC), the method of the present invention can effectively suppress various types of noise in the speech signal, and better ensure the speech intelligibility. Compared with the FCNN algorithm, the method adds estimation of the phase compensation factor, and improves the effect under fac and buc noise conditions, so that the overall effect of speech enhancement is improved.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.
Claims (8)
1. A speech enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, comprising the steps of:
step 1, preprocessing the voice data of a training set to obtain characteristic data of noise-containing voice and pure voice;
step 2, combining the characteristic data of the voice with noise and the pure voice, introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm, and calculating a phase compensation factor by using the improved formula;
step 3, building a full convolution neural network model, taking the phase compensation factor and the logarithmic power spectrum of the pure voice as a training target of the full convolution neural network, and training the full convolution neural network model;
step 4, inputting the test voice into the trained model to obtain an estimated value of a logarithmic power spectrum and a phase compensation function;
and 5, respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by utilizing the estimated value of the logarithmic power spectrum and the phase compensation function to obtain the final enhanced voice.
2. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 1, wherein in step 1, the noisy speech and the clean speech are preprocessed and the feature parameters of the noisy speech and the clean speech are extracted, and the operations are as follows:
step 1-1, let y (n) represent a noisy speech signal and y (n) ═ d (n) + x (n), where d (n) is a noise signal and x (n) is a clean signal; transforming the noisy speech to the frequency domain by windowing w (n) to M samples of Y (n), and performing FFT of M points to obtain a frequency spectrum Y (l, k) of the noisy speech, where l is a frame number label, k represents a frequency component, and k is 0, 1, 2. The same spectrum X (l, k) of the clean speech X (n) and the spectrum D (l, k) of the noise signal D (n) can be obtained;
step 1-2, representing the frequency spectrum Y (l, k) of the voice with noise on a polar coordinate, and dividing the frequency spectrum into a magnitude spectrum and a phase spectrum, namely
Y(l,k)=|Y(l,k)|ej∠Y(l,k)
Wherein | Y (l, k) | is a short-time amplitude spectrum of the voice Y (n) with noise, and | Y (l, k) is a phase spectrum; similarly, a short-time amplitude spectrum | X (l, k) | of pure voice X (n) and a short-time amplitude spectrum | D (l, k) | of a noise signal D (n) can be obtained;
step 1-3, calculating the log power spectrum S (n) of the noisy speech and the log power spectrum T (n) of the clean speech by using the following formula,
S(n)=[loge(|Y(n,1)|2),loge(|Y(n,2)|2),...,loge(|Y(n,k)|2),...,loge(|Y(n,M-1)|2)]
T(n)=[loge(|X(n,1)|2),loge(|X(n,2)|2),...,loge(|X(n,k)|2),...,loge(|X(n,M-1)|2)]
wherein | Y (n, 1) & gtY2For the power of the 1 st frequency band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, 2) & gtY2For the power of the 2 nd band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, k) & gtY2For the power of the kth frequency band of the nth frame obtained by short-time Fourier transform of the noisy speech, | Y (n, M-1) & gt2Obtaining the power of the (M-1) th frequency band of the nth frame of the voice with noise through short-time Fourier transform; | X (n, 1) emittingphosphor2(ii) power of 1 st band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, 2) & ltY |)2(ii) power of 2 nd band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, k) & ltY |2Power of kth frequency band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, M-1) & ltY & gt2The power of the (M-1) th frequency band of the nth frame obtained by short-time Fourier transform for pure speech.
3. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network according to claim 2, wherein in the step 2, the phase compensation function is optimized as follows:
step 2-1, the conventional phase compensation function is expressed as
Wherein Λ (l, k) is a phase compensation function, λ is a phase compensation factor,for noise estimation, the amplitude spectrum Y (l, k) of the noisy speech can be generally substituted, λ is an empirical value, Ψ (k) is an antisymmetric function for phase correction, and denoted as
The action of Λ (l, k) is shown by the following formula,
YΛ(l,k)=Y(l,k)+Λ(l,k)
wherein, YΛ(l, k) is a compensated spectrum, and a phase spectrum < Y shown in the following formula is obtained by extracting the phase of the compensated spectrumΛ(l,k),
∠YΛ(l,k)=arg(YΛ(l,k))
Is less than YΛ(l, k) is combined with the magnitude spectrum Y (l, k) of the noisy speech to obtain the spectral expression of the enhanced speech as
Step 2-2, introducing frame signal-to-noise ratio to optimize a phase compensation function to obtain an optimized phase compensation function, namely
Where c is an empirical value, SNRlIs the signal-to-noise ratio of the l frame.
4. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 3, wherein the step 2 further comprises a step 2-3 of simplifying the training goal of the subsequent neural network by simplifying the optimized phase compensation function as the following formula,
Λ(l,k)=Ψ(k)×Q(l,k)
wherein the calculation formula of Q (l, k) is expressed as,
5. the speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 4, wherein in the step 3, a training target and a loss function design for fusing the phase spectrum compensation are proposed, and the specific method is as follows:
step 3-1, for the voice enhancement of the frequency domain, adopting the logarithmic power spectrum of the voice with noise and the logarithmic power spectrum of the pure voice as the training characteristic and the training target of the neural network respectively, namely
Wherein,is a pure voice logarithm power spectrum T obtained by the noisy voice logarithm power spectrum through the neural network trainingnOf the approximation value of SnIs a logarithmic power spectrum of the noisy speech; then combines the phase alpha of the n-th frame of the noisy speechnTo obtain the time domain waveform of the n-th frame of the enhanced voiceNamely, it is
Step 3-2, on the basis of voice enhancement of the frequency domain, a phase compensation algorithm is fused to serve as a new training target, and the new training target comprises a logarithmic power spectrum T of pure voicenAnd a phase compensation factor QnTwo parts, i.e.
Wherein the phase compensation factor is adjustedThen the phase compensation function is obtained, and the enhanced phase of the nth frame is obtained by the phase compensation function, so that the phase alpha of the voice with noise can be replacednAchieving the well-modified voice enhancement effect;
step 3-3, for the training target fusing the phase spectrum compensation and provided in the last step, defining the loss function of the neural network as
Where MSE is the combined mean square error of the log power spectrum and phase spectrum compensation, T (n, k) is the log power spectrum of the nth frame of pure speech,the approximate value of the clean voice logarithmic power spectrum obtained by training the logarithmic power spectrum of the nth frame of the noisy voice through the neural network,the method comprises the steps of obtaining an approximate value of a phase compensation factor through neural network training, wherein beta is a modulation parameter;
3-4, building a full convolution neural network model which is mainly divided into an input layer, an encoder layer, a decoder layer and an output layer;
and 3-5, taking the logarithmic power spectrum of the phase compensation factor combined with the pure voice as a training target of the full convolution neural network, and training the full convolution neural network model to obtain the trained full convolution neural network model.
6. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 5, wherein in the step 3-4, the input layer is a 5 x 257 matrix, the length of the context representing the input features is 5 frames, each frame has 257 data points, the input data is composed of the logarithmic power spectrum of the noisy speech of the adjacent 5 frames, and the correlation of the input data in the two dimensions of time and frequency is increased;
the encoder layer consists of two convolution layers and two dropout layers, wherein the convolution layers have three hyper-parameters which are respectively the size of a convolution kernel, a moving step length and a filling mode;
the decoder layer also comprises two convolution layers and two dropout layers, and the network structure of the decoder layer is symmetrical to that of the encoder layer;
the output layer is a full connection layer, is directly connected with a decoder and is provided with 257 neurons, the activation function selects a linear function, and the output is 1 frame of voice characteristic data and 1 frame of phase spectrum compensation factors.
7. The speech enhancement algorithm based on the improved phase spectrum compensation and the full convolution neural network as claimed in claim 6, wherein in step 4, the characteristic extraction is performed on the noisy speech to be tested according to the method in step 1, and the extracted characteristic parameters are input into the trained full convolution neural network model, so as to obtain the logarithm power spectrum estimation value and the phase compensation factor of the speech signal.
8. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 7, wherein in the step 5, the magnitude spectrum and the phase spectrum of the speech signal are reconstructed by using the logarithmic power spectrum and the phase compensation factor, and the final enhanced speech is obtained by performing short-time inverse fourier transform (ISTFT).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111534489.7A CN114242099A (en) | 2021-12-15 | 2021-12-15 | Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111534489.7A CN114242099A (en) | 2021-12-15 | 2021-12-15 | Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114242099A true CN114242099A (en) | 2022-03-25 |
Family
ID=80756355
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111534489.7A Pending CN114242099A (en) | 2021-12-15 | 2021-12-15 | Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114242099A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115295024A (en) * | 2022-04-11 | 2022-11-04 | 维沃移动通信有限公司 | Signal processing method, signal processing device, electronic apparatus, and medium |
CN115295001A (en) * | 2022-07-26 | 2022-11-04 | 中国科学技术大学 | Single-channel speech enhancement method based on progressive fusion correction network |
CN115497492A (en) * | 2022-08-24 | 2022-12-20 | 珠海全视通信息技术有限公司 | Real-time voice enhancement method based on full convolution neural network |
CN115881148A (en) * | 2022-11-15 | 2023-03-31 | 中国科学院声学研究所 | Acoustic feedback cancellation method based on deep learning |
CN116052706A (en) * | 2023-03-30 | 2023-05-02 | 苏州清听声学科技有限公司 | Low-complexity voice enhancement method based on neural network |
GB2620747A (en) * | 2022-07-19 | 2024-01-24 | Samsung Electronics Co Ltd | Method and apparatus for speech enhancement |
-
2021
- 2021-12-15 CN CN202111534489.7A patent/CN114242099A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115295024A (en) * | 2022-04-11 | 2022-11-04 | 维沃移动通信有限公司 | Signal processing method, signal processing device, electronic apparatus, and medium |
GB2620747A (en) * | 2022-07-19 | 2024-01-24 | Samsung Electronics Co Ltd | Method and apparatus for speech enhancement |
CN115295001A (en) * | 2022-07-26 | 2022-11-04 | 中国科学技术大学 | Single-channel speech enhancement method based on progressive fusion correction network |
CN115295001B (en) * | 2022-07-26 | 2024-05-10 | 中国科学技术大学 | Single-channel voice enhancement method based on progressive fusion correction network |
CN115497492A (en) * | 2022-08-24 | 2022-12-20 | 珠海全视通信息技术有限公司 | Real-time voice enhancement method based on full convolution neural network |
CN115881148A (en) * | 2022-11-15 | 2023-03-31 | 中国科学院声学研究所 | Acoustic feedback cancellation method based on deep learning |
CN115881148B (en) * | 2022-11-15 | 2024-01-26 | 中国科学院声学研究所 | Acoustic feedback cancellation method based on deep learning |
CN116052706A (en) * | 2023-03-30 | 2023-05-02 | 苏州清听声学科技有限公司 | Low-complexity voice enhancement method based on neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114242099A (en) | Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network | |
CN110619885B (en) | Method for generating confrontation network voice enhancement based on deep complete convolution neural network | |
Su et al. | HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks | |
Xia et al. | Speech enhancement with weighted denoising auto-encoder. | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
CN112735456B (en) | Speech enhancement method based on DNN-CLSTM network | |
Verteletskaya et al. | Noise reduction based on modified spectral subtraction method | |
CN110428849A (en) | A kind of sound enhancement method based on generation confrontation network | |
Daqrouq et al. | An investigation of speech enhancement using wavelet filtering method | |
CN112634926B (en) | Short wave channel voice anti-fading auxiliary enhancement method based on convolutional neural network | |
CN110808057A (en) | Voice enhancement method for generating confrontation network based on constraint naive | |
Wang et al. | Joint noise and mask aware training for DNN-based speech enhancement with sub-band features | |
CN114694670A (en) | Multi-task network-based microphone array speech enhancement system and method | |
Naik et al. | A literature survey on single channel speech enhancement techniques | |
CN112634927B (en) | Short wave channel voice enhancement method | |
CN115223583A (en) | Voice enhancement method, device, equipment and medium | |
CN103971697B (en) | Sound enhancement method based on non-local mean filtering | |
CN114401168B (en) | Voice enhancement method applicable to short wave Morse signal under complex strong noise environment | |
CN109215635B (en) | Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement | |
CN113066483B (en) | Sparse continuous constraint-based method for generating countermeasure network voice enhancement | |
CN115273884A (en) | Multi-stage full-band speech enhancement method based on spectrum compression and neural network | |
Yu et al. | Constrained Ratio Mask for Speech Enhancement Using DNN. | |
Fingscheidt et al. | Data-driven speech enhancement | |
Zhang et al. | Half-Temporal and Half-Frequency Attention U 2 Net for Speech Signal Improvement | |
Heitkaemper et al. | Neural network based carrier frequency offset estimation from speech transmitted over high frequency channels |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |