CN114242099A - Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network - Google Patents

Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network Download PDF

Info

Publication number
CN114242099A
CN114242099A CN202111534489.7A CN202111534489A CN114242099A CN 114242099 A CN114242099 A CN 114242099A CN 202111534489 A CN202111534489 A CN 202111534489A CN 114242099 A CN114242099 A CN 114242099A
Authority
CN
China
Prior art keywords
spectrum
voice
speech
phase
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111534489.7A
Other languages
Chinese (zh)
Inventor
邓立新
徐琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202111534489.7A priority Critical patent/CN114242099A/en
Publication of CN114242099A publication Critical patent/CN114242099A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention provides a voice enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, which is used for preprocessing training set voice data to obtain characteristic data of noisy voice and pure voice; introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm by combining the characteristic data of the noisy speech and the pure speech, and calculating a phase compensation factor by using an improved formula; building a full convolution neural network model, using the phase compensation factor and the logarithmic power spectrum of the pure voice as a training target of the full convolution neural network, and training the network model; inputting the test voice into the trained model to obtain an estimated value of a logarithmic power spectrum and a phase compensation function; and respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by utilizing the estimated value of the logarithmic power spectrum and the phase compensation function to obtain the final enhanced voice. The invention improves the noise elimination capability of the algorithm and simultaneously better ensures the speech intelligibility, thereby improving the overall effect of speech enhancement.

Description

Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network
Technical Field
The invention relates to a voice enhancement method, in particular to a voice enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, and belongs to the technical field of voice signal processing.
Background
It is known that voice is an important way of communicating information between people, but people always suffer from various noises during the communication and communication process by voice. The noisy speech not only increases the hearing fatigue of the person and degrades the speech communication quality, but also degrades the performance of the speech processing system based on feature parameter extraction. Therefore, to reduce the effect of background noise on speech quality, speech enhancement is required to suppress background noise.
The phase spectrum compensation is an enhancement algorithm for enhancing a voice signal by utilizing voice phase spectrum information, and the basic idea is as follows: respectively calculating the short-time amplitude spectrum of the voice signal with noise and the estimated short-time amplitude spectrum of the noise signal, calculating a compensation factor by using a phase spectrum compensation function, and then superposing the compensation factor and the frequency spectrum of the voice with noise to obtain a compensated voice frequency spectrum. When restoring the enhanced voice signal, the compensated voice frequency spectrum is used to calculate the compensated phase, and then the amplitude spectrum of the voice signal with noise is inserted to perform inverse discrete Fourier transform. The general form of the phase compensation function is:
Figure BDA0003412113330000011
for the estimation of the noise amplitude spectrum, Ψ (k) is an antisymmetric function, λ is a compensation factor, and a conventional phase compensation factor λ is an empirical value, typically taken to be 3.74.
The phase compensation has the advantages of small operand, easy realization and better enhancement effect. However, because the current speech enhancement algorithm processes non-stationary noise, the noise energy has uncertainty, and the fixed compensation factor cannot be dynamically adjusted to a proper value according to the change of the noise. The fixed parameters cannot fully exert the effect of the phase compensation algorithm, and cannot be combined with DNN to correct the voice phase spectrum. Therefore, optimizing a compensation function with fixed parameters, introducing supervised parameter learning, balancing and considering the enhanced speech distortion and denoising effect, and improving the phase compensation algorithm to fully exert the advantages of the phase compensation algorithm are important points.
Compared with the traditional method, the voice enhancement method based on the deep neural network obviously improves the voice enhancement performance under the non-stationary noise condition, and is a research hotspot in the field of voice enhancement in recent years. Existing work has been mainly directed to the design of training features and training goals and the development of improvements in network architecture. According to the training characteristics and the target design method, the speech enhancement method based on the deep neural network can be divided into two types, namely a time domain and a frequency domain. The speech enhancement in the frequency domain generally adopts the magnitude spectrum or the logarithmic power spectrum of the voice with noise as the training feature, and besides the magnitude spectrum and the logarithmic power spectrum, the magnitude spectrum can be masked as the training target. Time-domain speech enhancement generally employs time-domain waveforms of noisy speech and clean speech as training features and training targets, respectively. However, the existing solutions still have many problems. In recent years, researches show that the voice quality can be effectively improved by only enhancing the phase spectrum of the voice and keeping the amplitude spectrum of the voice with noise unchanged. Although the time-domain speech enhancement algorithm directly uses waveforms of noisy speech and clean speech as training features and training targets, the performance of the time-domain speech enhancement algorithm is very dependent on the design of a loss function, and the complex loss function greatly improves the training difficulty. On the contrary, the simple time domain minimum mean square error function is adopted as the loss function, a large amount of time is consumed for parameter adjustment, the problem of voice distortion is easily caused, the intelligibility of voice signals is influenced, the voice signals are damaged, and even the signal-to-noise ratio is reduced.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a speech enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, which overcomes the defects of the prior art, adopts the logarithmic power spectrum of the noisy speech as the training characteristic, utilizes the improved phase spectrum compensation algorithm to carry out phase spectrum estimation on the noisy signal, obtains a phase spectrum compensation factor as one of the training targets of the network, and uses the logarithmic power spectrum of the pure speech signal as a common training target by matching with the unique loss function design. Considering that the training features have correlation in both time and frequency, the present invention uses convolutional neural networks to obtain better training effect, so as to enhance the speech signal.
The invention provides a voice enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, which comprises the following steps:
step 1, preprocessing the voice data of a training set to obtain characteristic data of noise-containing voice and pure voice;
step 2, combining the calculated characteristic data of the voice with noise and the pure voice, introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm, and calculating a phase compensation factor by using the improved formula;
step 3, building a full convolution neural network model, using the solved phase compensation factor and the log power spectrum of the pure voice as a training target of a Full Convolution Neural Network (FCNN), and training the full convolution neural network model;
step 4, inputting the test voice into the trained model to obtain an estimated value of a logarithmic power spectrum and a phase compensation function;
and 5, respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by using the estimated value of the logarithmic power spectrum and the phase compensation function obtained in the previous step to obtain the final enhanced voice. As a further technical solution of the present invention,
the further optimized technical scheme of the invention is as follows:
in the step 1, the voice with noise and the pure voice are preprocessed, and the characteristic parameters of the voice with noise and the pure voice are extracted, specifically, the operations are as follows:
step 1-1, let y (n) represent a noisy speech signal and y (n) ═ d (n) + x (n), where d (n) is a noise signal and x (n) is a clean signal. Let x (n) and d (n) be statistically independent and have a zero mean. Transforming the noisy speech into the frequency domain by windowing M samples of Y (n), wherein the window function is w (n), and performing FFT (fast fourier transform) of M points, to obtain a frequency spectrum Y (l, k) of the noisy speech, wherein l is a frame number label, k represents a frequency component, and k is 0, 1, 2. The same spectrum X (l, k) of the clean speech X (n) and the spectrum D (l, k) of the noise signal D (n) can be obtained;
step 1-2, representing the frequency spectrum Y (l, k) of the voice with noise on a polar coordinate, and dividing the frequency spectrum into a magnitude spectrum and a phase spectrum, namely
Y(l,k)=|Y(l,k)|ej∠Y(l,k)
Wherein | Y (l, k) | is a short-time amplitude spectrum of the voice Y (n) with noise, and | Y (l, k) is a phase spectrum; similarly, a short-time amplitude spectrum | X (l, k) | of pure voice X (n) and a short-time amplitude spectrum | D (l, k) | of a noise signal D (n) can be obtained;
step 1-3, calculating the log power spectrum S (n) of the noisy speech and the log power spectrum T (n) of the clean speech by using the following formula,
S(n)=[loge(|Y(n,1)|2),loge(|Y(n,2)|2),...,loge(|Y(n,k)|2),...,loge(|Y(n,M-1)|2)]
T(n)=[loge(|X(n,1)|2),loge(|X(n,2)|2),...,loge(|X(n,k)|2),...,loge(|X(n,M-1)|2)]
wherein | Y (n, 1) & gtY2For the power of the 1 st frequency band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, 2) & gtY2For the power of the 2 nd band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, k) & gtY2For the power of the kth frequency band of the nth frame obtained by short-time Fourier transform of the noisy speech, | Y (n, M-1) & gt2Obtaining the power of the (M-1) th frequency band of the nth frame of the voice with noise through short-time Fourier transform; | X (n, 1) emittingphosphor2(ii) power of 1 st band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, 2) & ltY |)2(ii) power of 2 nd band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, k) & ltY |2Power of kth frequency band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, M-1) & ltY & gt2The power of the (M-1) th frequency band of the nth frame obtained by short-time Fourier transform for pure speech.
In step 2, the phase compensation function is optimized, specifically, the operations are as follows:
step 2-1, the conventional phase compensation function is expressed as
Figure BDA0003412113330000051
Wherein Λ (l, k) is a phase compensation function,
Figure BDA0003412113330000052
for noise estimation, the amplitude spectrum Y (l, k) of the noisy speech can be generally substituted, λ is an empirical value, Ψ (k) is an antisymmetric function for phase correction, and denoted as
Figure BDA0003412113330000053
The action of Λ (l, k) is shown by the following formula,
YΛ(l,k)=Y(l,k)+Λ(l,k)
wherein, YΛ(l, k) is a compensated spectrum, and a phase spectrum < Y shown in the following formula is obtained by extracting the phase of the compensated spectrumΛ(l,k),
∠YΛ(l,k)=arg(YΛ(l,k))
Is less than YΛ(l, k) is combined with the magnitude spectrum Y (l, k) of the noisy speech to obtain the spectral expression of the enhanced speech as
Figure BDA0003412113330000054
Wherein the exponential form of the complex number is represented by the Euler formula reWhere r is amplitude and α represents amplitude (phase), where j is a complex field representation;
step 2-2, introducing frame signal-to-noise ratio to optimize a phase compensation function to obtain an optimized phase compensation function, namely
Figure BDA0003412113330000055
Wherein,c is an empirical value, typically set to 2.7, Λ (l, k) with SNRlWhen the current frame is a voice frame, the influence of the phase compensation factor Λ (l, k) on the noise voice is reduced, and more voice details are reserved. SNRlThe signal-to-noise ratio of the l-th frame.
The step 2 further comprises: step 2-3, in order to simplify the training target of the subsequent neural network, the optimized phase compensation function is abbreviated as the following formula,
Λ(l,k)=Ψ(k)×Q(l,k)
the above equation is a product of a new compensation factor and a noise magnitude spectrum estimation value, and may also be referred to as a new phase compensation factor for short, and is also one of training targets of a subsequent neural network. Q (l, k) is a new phase compensation factor, an antisymmetric function is removed, and the Q (l, k) is used as one of training targets of a following neural network. The formula for Q (l, k) is expressed as,
Figure BDA0003412113330000061
. Q (l, k) is a total of N frames, Q (l, k) is M long two-dimensional data per frame in N frames (the symbols of all bands (l, k) are of this dimension). In actual operation, N frames of data with length of M in each frame are brought into operation, namely Q (l, k), with which Λ (l, k) can be solved, and spectrum S of enhanced voice can be obtained according to step 2-1Λ(l, k), and then performing ISTFT to obtain the time domain waveform of the enhanced voice.
In the step 3, a design of a training target and a loss function for integrating phase spectrum compensation is provided, and the specific method is as follows:
step 3-1, for the voice enhancement of the frequency domain, adopting the logarithmic power spectrum of the voice with noise and the logarithmic power spectrum of the pure voice as the training characteristic and the training target of the neural network respectively, namely
Figure BDA0003412113330000062
Wherein,
Figure BDA0003412113330000063
is a pure voice logarithm power spectrum T obtained by the noisy voice logarithm power spectrum through the neural network trainingnOf the approximation value of SnIs a logarithmic power spectrum of the noisy speech; then combines the phase alpha of the n-th frame of the noisy speechnTo obtain the time domain waveform of the n-th frame of the enhanced voice
Figure BDA0003412113330000064
Namely, it is
Figure BDA0003412113330000071
Step 3-2, on the basis of voice enhancement of the frequency domain, a phase compensation algorithm is fused to serve as a new training target, and the new training target comprises a logarithmic power spectrum T of pure voicenAnd a phase compensation factor QnTwo parts, i.e.
Figure BDA0003412113330000072
Wherein Q isnRepresents data of the n-th frame of Q (l, k) and Qn=Q(l,k)l=nQ (n, 0), Q (n, 1), Q (n, M-1)), by a phase compensation factor
Figure BDA0003412113330000073
Then the phase compensation function is obtained, and the enhanced phase of the nth frame is obtained by the phase compensation function, so that the phase alpha of the voice with noise can be replacednTo achieve better voice enhancement effect;
step 3-3, for the training target fusing the phase spectrum compensation and provided in the last step, defining the loss function of the neural network as
Figure BDA0003412113330000074
Where MSE is logarithmic workThe combined mean square error of the rate spectrum and phase spectrum compensation, the log power spectrum and the phase spectrum compensation jointly influence the parameter. M is the frame length, T (n, k) is the log power spectrum of the nth frame of clean speech,
Figure BDA0003412113330000075
the approximate value of the clean voice logarithmic power spectrum obtained by training the logarithmic power spectrum of the nth frame of the voice with noise through the neural network,
Figure BDA0003412113330000076
obtaining an approximate value of the phase compensation factor through neural network training; beta is a modulation parameter and is a constant, and is used for balancing the influence of the logarithmic power spectrum and the phase compensation on the network;
3-4, building a full convolution neural network model which is mainly divided into an input layer, an encoder layer, a decoder layer and an output layer;
and 3-5, taking the logarithmic power spectrum of the phase compensation factor combined with the pure voice as a training target of the full convolution neural network, and training the full convolution neural network model to obtain the trained full convolution neural network model.
The neural network training method adopts a conventional back propagation algorithm to train the neural network, adopts the combined mean square error of the step 3-3 as a loss function, and uses an Adam algorithm to carry out optimization, wherein the initial learning rate is set to be 0.05, the learning rate discarding period is set to be 10, the discarding parameter is set to be 0.2, the small batch (Batchsize) is set to be 128, and the iteration number (epoch) is set to be 100. The model trained by the method is obtained after the training.
In the step 3-4, the input layer is a matrix of 5 × 257, the length of the context representing the input features is 5 frames, each frame has 257 data points, and the input data is composed of the log power spectrum of the noisy speech of the adjacent 5 frames, so that the correlation of the input data in the two dimensions of time and frequency is increased.
The encoder layer consists of two convolution layers and two dropout layers, wherein the convolution layers have three hyper-parameters which are convolution kernel size, moving step length and filling mode respectively. The parameter settings of the convolutional layer 1 are as follows: the convolution kernel size is 5; the moving step length is 1; the filling mode is a same mode; the number of convolution kernels is designated 32; the activation function is selected from a ReLU (reconstructed Linear Unit, Re-LU) function. The adoption of the ReLU activation function can greatly reduce the calculation amount. After passing through convolution layer 1, the weight loss ratio is set to 0.2. The parameter settings of the convolutional layer 2 are as follows: the convolution kernel size is 7; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The encoder is used for preliminarily extracting shallow layer characteristics of the noisy speech through the convolutional layer 1 and further extracting more abstract deep layer speech characteristics through the convolutional layer 2 for encoding.
The decoder layer is also composed of two convolutional layers and two dropout layers, and the network structure of the decoder layer is symmetrical to that of the encoder layer. The parameter settings of the convolution layer 3 are as follows: the convolution kernel size is 7; the filling mode is set to be a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The convolutional layer 4 parameters are set as follows: the convolution kernel size is 5; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 32; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The decoder acts in reverse to the encoder to restore the features compressed by the encoder to produce clean speech features.
The output layer is a full connection layer, is directly connected with a decoder and is provided with 257 neurons, the activation function selects a linear function, and the output is 1 frame of voice characteristic data and 1 frame of phase spectrum compensation factors.
In the step 4, the characteristic extraction is carried out on the voice with noise to be tested according to the method in the step 1, and the extracted characteristic parameters are input into the trained full convolution neural network model, so that the logarithmic power spectrum estimation value and the phase compensation factor of the voice signal can be obtained.
In the invention, the logarithmic power spectrum S of the voice with noise is input into a trained neural network model to carry out forward propagation once, so that the logarithmic power spectrum estimated value and the phase compensation factor of the voice signal can be obtained. The speech enhancement process based on the neural network can be divided into two parts of training and enhancing, and when the model is trained (the model can be trained)Consider a laboratory environment) with all the information of noisy and clean speech, which are used to train a model, i.e., S → (T, Q), which is a function of the model (arrow, or function f). When enhancing speech (it can be considered as practical application), only speech information with noise is needed, and information of pure speech is estimated through the trained model obtained in the training stage, that is, information of pure speech is estimated
Figure BDA0003412113330000091
This step results in an approximation or estimate of the clean speech information
Figure BDA0003412113330000092
And a compensation factor
Figure BDA0003412113330000093
And (4) restoring pure voice (enhanced voice) through the approximation value, and directly substituting the estimation value into the step 2-2 for calculation.
In the step 5, the amplitude spectrum and the phase spectrum of the voice signal are reconstructed by using the logarithmic power spectrum and the phase compensation factor (S is finally obtained according to the method of the step 2-1)Λ(l, k)), and a final enhanced speech is obtained by short time inverse fourier transform (ISTFT).
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
(1) aiming at the problems that the phase spectrum compensation algorithm parameters are fixed and cannot be dynamically adjusted, the invention introduces the frame signal-to-noise ratio, improves the phase compensation function, overcomes the defect that the traditional phase compensation can not change along with the change of noise, can quickly correspond to the change of the noise spectrum, is more suitable for the training and estimation of a neural network, exerts the performance of the phase compensation algorithm to the maximum extent and achieves better enhancement effect;
(2) an improved phase compensation factor is introduced as one of training targets of the neural network, the defect that the phase is neglected in a traditional voice enhancement algorithm based on a frequency domain is overcome, and a better enhancement effect is achieved through simple and effective loss function design under the condition of the same data sample number.
In a word, the invention improves the noise elimination capability of the algorithm and simultaneously better ensures the speech intelligibility, thereby improving the overall effect of speech enhancement.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a block diagram of a full convolution neural network in accordance with the present invention.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings: the present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection authority of the present invention is not limited to the following embodiments.
The embodiment proposes a speech enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, as shown in fig. 1, which includes the following steps:
step 1, preprocessing the voice data of the training set to obtain the characteristic data of the voice with noise and the pure voice.
The speech used for training in the experiment was from the training set of TIMIT. 15 kinds of noise of NOISEX-92 were selected as the noise set. Under the conditions that signal-to-noise ratios are-5 dB, 0dB and 5dB, 100 pure voices and noise sets in the TIMIT training set are randomly mixed, 4500 voices containing noise are generated under each signal-to-noise ratio, and the voices are divided into a training set and a testing set according to the ratio of 9:1 to train a network model.
Preprocessing the voice with noise and the pure voice and extracting characteristic parameters of the voice with noise and the pure voice, wherein the method specifically comprises the following steps:
step 1-1, let y (n) represent a noisy speech signal and y (n) ═ d (n) + x (n), where d (n) is a noise signal and x (n) is a clean signal. Let x (n) and d (n) be statistically independent and have a zero mean. The noisy speech is transformed into the frequency domain by windowing M samples of Y (n) with w (n) and performing an FFT (fast fourier transform) of M points, resulting in a frequency spectrum Y (l, k) of the noisy speech, where l is a frame number marker, k represents a frequency component and k is 0, 1, 2.
The same spectrum X (l, k) of clean speech X (n) and the spectrum D (l, k) of noise signal D (n) are obtained. Namely:
window w (n) on M samples of X (n), and perform FFT of M points to transform the pure voice to frequency domain, so as to obtain the frequency spectrum X (l, k) of the pure voice;
and (3) windowing w (n) to the M samples of D (n), performing FFT of M points, and transforming the noise signal to a frequency domain to obtain a frequency spectrum D (l, k) of the noise signal.
Step 1-2, representing the frequency spectrum Y (l, k) of the voice with noise on a polar coordinate, and dividing the frequency spectrum into a magnitude spectrum and a phase spectrum, namely
Y(l,k)=|Y(l,k)|ej∠Y(l,k)
Wherein, | Y (l, k) | is the short-time amplitude spectrum of the noisy speech Y (n), and the phase spectrum thereof is ≈ Y (l, k). Similarly, a short-time amplitude spectrum | X (l, k) | of the clean speech X (n) and a short-time amplitude spectrum | D (l, k) | of the noise signal D (n) can be obtained. Namely:
the frequency spectrum X (l, k) of clean speech is represented on polar coordinates and can be divided into a magnitude spectrum and a phase spectrum, i.e. a spectrum with a high frequency
X(l,k)=|X(l,k)|ej∠X(l,k)
I X (l, k) I is a short-time amplitude spectrum of pure speech X (n), and a phase spectrum thereof is X (l, k); representing the frequency spectrum D (l, k) of the noise signal in polar coordinates, it can be divided into a magnitude spectrum and a phase spectrum, i.e.
D(l,k)=|D(l,k)|ej∠D(l,k)
| D (l, k) | represents a short-time amplitude spectrum of the noise signal D (n), and the phase spectrum is ≈ D (l, k).
Step 1-3, calculating the log power spectrum S (n) of the voice with noise and the log power spectrum T (n) of the pure voice by using the following formula
S(n)=[loge(|Y(n,1)|2),loge(|Y(n,2)|2),...,loge(|Y(n,k)|2),...,loge(|Y(n,M-1)|2)]
T(n)=[loge(|X(n,1)|2),loge(|X(n,2)|2),...,loge(|X(n,k)|2),...,loge(|X(n,M-1)|2)]
Wherein | Y (n, 1) & gtY2For the power of the 1 st frequency band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, 2) & gtY2For the power of the 2 nd band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, k) & gtY2For the power of the kth frequency band of the nth frame obtained by short-time Fourier transform of the noisy speech, | Y (n, M-1) & gt2Obtaining the power of the (M-1) th frequency band of the nth frame of the voice with noise through short-time Fourier transform; | X (n, 1) emittingphosphor2(ii) power of 1 st band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, 2) & ltY |)2(ii) power of 2 nd band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, k) & ltY |2Power of kth frequency band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, M-1) & ltY & gt2The power of the (M-1) th frequency band of the nth frame obtained by short-time Fourier transform for pure speech.
And 2, combining the calculated characteristic data of the voice with noise and the pure voice, introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm, and calculating a phase compensation factor by using the improved formula.
The introduction of the frame signal-to-noise ratio improves the phase compensation function, and the specific operation is as follows:
step 2-1, the conventional phase compensation function can be expressed as
Figure BDA0003412113330000121
Wherein Λ (l, k) is a phase compensation function, λ is a phase compensation factor,
Figure BDA0003412113330000122
for the estimation of noise, the amplitude spectrum | Y (l, k) | of the voice with noise can be used to replace the estimation; λ is an empirical value, which is originally a constant, modified later as
Figure BDA0003412113330000123
When model training is carried out, the model training device will
Figure BDA0003412113330000124
Substituted by the actual value Z of the amplitude of the noise used directly
Figure BDA0003412113330000131
The new compensation factor is
Figure BDA0003412113330000132
Figure BDA0003412113330000133
Ψ (k) is an inverse symmetric function for phase correction, denoted
Figure BDA0003412113330000134
The action of Λ (l, k) is shown by the following formula,
YΛ(l,k)=Y(l,k)+Λ(l,k)
wherein, YΛ(l, k) is a compensated spectrum, and a phase spectrum < Y shown in the following formula is obtained by extracting the phase of the compensated spectrumΛ(l,k),
∠YΛ(l,k)=arg(YΛ(l,k))
Is less than YΛ(l, k) is combined with the magnitude spectrum Y (l, k) of the noisy speech to obtain the spectral expression of the enhanced speech as
Figure BDA0003412113330000135
Step 2-2, introducing frame signal-to-noise ratio to optimize a phase compensation function to obtain an optimized phase compensation function, namely
Figure BDA0003412113330000136
Wherein c is a warpA value, typically set to 2.7, Λ (l, k) with SNRlWhen the current frame is a voice frame, the influence of the phase compensation function lambda (l, k) on the noise-carrying voice is reduced, and more voice details are reserved. SNRlThe signal-to-noise ratio of the signal is the l frame.
Step 2-3, in order to simplify the training target of the subsequent neural network, the optimized phase compensation function is abbreviated as the following formula,
Λ(l,k)=Ψ(k)×Q(l,k)
wherein Q (l, k) is a new compensation factor. And eliminating an antisymmetric function, and taking Q (l, k) as one of training targets of the following neural network. The formula for Q (l, k) is expressed as,
Figure BDA0003412113330000141
and 3, building a full convolution neural network model, using the solved phase compensation factor and the log power spectrum of the pure voice as a training target of a Full Convolution Neural Network (FCNN), and training the full convolution neural network model.
The design of a training target and a loss function for integrating phase spectrum compensation is provided, and the specific method comprises the following steps:
step 3-1, for the voice enhancement of the frequency domain, adopting the logarithmic power spectrum of the voice with noise and the logarithmic power spectrum of the pure voice as the training characteristic and the training target of the neural network respectively
Figure BDA0003412113330000142
Wherein,
Figure BDA0003412113330000143
is a pure voice logarithm power spectrum T obtained by the noisy voice logarithm power spectrum through the neural network trainingnOf the approximation value of SnIs the log power spectrum of noisy speech. Then combines the phase alpha of the n-th frame of the noisy speechnTo obtain the time domain waveform of the n-th frame of the enhanced voice
Figure BDA0003412113330000144
Figure BDA0003412113330000145
Step 3-2, on the basis of voice enhancement of the frequency domain, a phase compensation algorithm is fused to serve as a new training target, and the new training target comprises a logarithmic power spectrum T of pure voicenAnd a phase compensation factor QnTwo parts, i.e.
Figure BDA0003412113330000146
Wherein the phase compensation factor is adjusted
Figure BDA0003412113330000147
Then the phase compensation function is obtained, and the enhanced phase of the nth frame is obtained by the phase compensation function, so that the phase alpha of the voice with noise can be replacednAnd a well-modified voice enhancement effect is achieved.
Step 3-3, for the training target fusing the phase spectrum compensation and provided in the last step, defining the loss function of the neural network as
Figure BDA0003412113330000151
Wherein, MSE is the joint mean square error of the compensation of the log power spectrum and the phase spectrum, and the log power spectrum and the phase spectrum jointly influence the parameter. M is the frame length, T (n, k) is the log power spectrum of the nth frame of pure speech,
Figure BDA0003412113330000152
the approximate value of the clean voice logarithmic power spectrum obtained by training the logarithmic power spectrum of the nth frame of the voice with noise through the neural network,
Figure BDA0003412113330000153
is a phase compensation factorAnd obtaining an approximate value through neural network training. Beta is a modulation parameter and is a constant for balancing the influence of the logarithmic power spectrum and the phase compensation on the network.
Step 3-4, constructing a full convolution neural network model for training
The invention adopts a complete convolution neural network (FCNN), which is different from the convolution neural network, the FCNN replaces a complete connection layer with a new convolution layer, and the convolution layer is utilized to extract the voice characteristics of deeper abstraction. The FCNN model is largely divided into an input layer, an encoder layer, a decoder layer, and an output layer.
The input layer of the model is a matrix of 5 multiplied by 257, the length of the context representing the input characteristics is 5 frames, each frame has 257 data points, and the input data is composed of the logarithmic power spectrum of the noisy speech of the adjacent 5 frames, so that the correlation of the input data in two dimensions of time and frequency is increased.
The encoder layer of the model consists of two convolution layers and two dropout layers, wherein the convolution layers have three hyper-parameters which are respectively the size of a convolution kernel, a moving step length and a filling mode. The parameter settings of the convolutional layer 1 are as follows: the convolution kernel size is 5; the moving step length is 1; the filling mode is a same mode; the number of convolution kernels is designated 32; the activation function is selected from a ReLU (reconstructed Linear Unit, Re-LU) function. The adoption of the ReLU activation function can greatly reduce the calculation amount. After passing through convolution layer 1, the weight loss ratio is set to 0.2. The parameter settings of the convolutional layer 2 are as follows: the convolution kernel size is 7; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The encoder is used for preliminarily extracting shallow layer characteristics of the noisy speech through the convolutional layer 1 and further extracting more abstract deep layer speech characteristics through the convolutional layer 2 for encoding.
The decoder layer of the model also consists of two convolution layers and two dropout layers, and the network structure of the model is symmetrical to that of the encoder layer. The parameter settings of the convolution layer 3 are as follows: the convolution kernel size is 7; the filling mode is set to be a same mode; the moving step length is 1; the number of convolution kernels is 16; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The convolutional layer 4 parameters are set as follows: the convolution kernel size is 5; the filling mode is a same mode; the moving step length is 1; the number of convolution kernels is 32; the activating function is a ReLU function; the parameters of the dropout layer are still set to 0.2. The decoder acts in reverse to the encoder to restore the features compressed by the encoder to produce clean speech features.
The output layer of the model is a full connection layer which is directly connected with a decoder and comprises 257 neurons, the activation function selects a linear function, and the output is 1 frame of voice characteristic data and 1 frame of phase spectrum compensation factors.
And 3-5, taking the logarithmic power spectrum of the phase compensation factor combined with the pure voice as a training target of the full convolution neural network, and training the full convolution neural network model to obtain the trained full convolution neural network model.
And 4, inputting the test voice into the trained model to obtain an estimated value of the logarithmic power spectrum and a phase compensation function.
And (2) extracting the characteristics of the voice with noise to be tested according to the method in the step (1), and inputting the extracted characteristic parameters into a trained full convolution neural network model to obtain a logarithmic power spectrum estimation value and a phase compensation factor of the voice signal.
And 5, respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by using the estimated value of the logarithmic power spectrum and the phase compensation function obtained in the previous step to obtain the final enhanced voice.
And reconstructing the magnitude spectrum and the phase spectrum of the voice signal by using the logarithmic power spectrum and the phase compensation factor, and obtaining the final enhanced voice through the ISTFT.
Examples
In order to verify the effectiveness of the method, the traditional PSC voice enhancement algorithm, the traditional FCNN algorithm and the method are compared and tested, and the algorithm performance under the environments of different types of interference noise and different input signal-to-noise ratios is tested. The speech quality Perceptual Evaluation (PESQ) and the short-time objective intelligibility (STOI) are selected as the speech evaluation indexes. PESQ is obtained by comparing an enhanced speech signal with a clean speech signal, and takes a value of [ -0.5, 4.5], and a larger value indicates a higher speech quality. The STOI is also obtained by comparing the enhanced speech signal with the pure speech signal, and takes a value of [0, 1], and the larger the value is, the higher the intelligibility of speech is, which is specifically as follows:
the method comprises the following steps: PSC
The second method comprises the following steps: FCNN algorithm
The third method comprises the following steps: the method of the invention
The three methods are respectively used for enhancing the noise-carrying voice with signal-to-noise ratios of-5 dB, 0dB and 5dB, and the noise types are factory2(fac), babble (bab) and buccaneer1 (buc). The results are shown in tables 1 to 3.
TABLE-5 dB noise
Figure BDA0003412113330000171
TABLE two 0dB noise
Figure BDA0003412113330000181
TABLE III 5dB noise
Figure BDA0003412113330000182
As can be seen from the above results, compared with the conventional phase compensation algorithm (PSC), the method of the present invention can effectively suppress various types of noise in the speech signal, and better ensure the speech intelligibility. Compared with the FCNN algorithm, the method adds estimation of the phase compensation factor, and improves the effect under fac and buc noise conditions, so that the overall effect of speech enhancement is improved.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims (8)

1. A speech enhancement algorithm based on improved phase spectrum compensation and a full convolution neural network, comprising the steps of:
step 1, preprocessing the voice data of a training set to obtain characteristic data of noise-containing voice and pure voice;
step 2, combining the characteristic data of the voice with noise and the pure voice, introducing a compensation factor calculation formula of a frame signal-to-noise ratio optimization phase compensation algorithm, and calculating a phase compensation factor by using the improved formula;
step 3, building a full convolution neural network model, taking the phase compensation factor and the logarithmic power spectrum of the pure voice as a training target of the full convolution neural network, and training the full convolution neural network model;
step 4, inputting the test voice into the trained model to obtain an estimated value of a logarithmic power spectrum and a phase compensation function;
and 5, respectively reconstructing the amplitude spectrum and the phase spectrum of the voice signal by utilizing the estimated value of the logarithmic power spectrum and the phase compensation function to obtain the final enhanced voice.
2. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 1, wherein in step 1, the noisy speech and the clean speech are preprocessed and the feature parameters of the noisy speech and the clean speech are extracted, and the operations are as follows:
step 1-1, let y (n) represent a noisy speech signal and y (n) ═ d (n) + x (n), where d (n) is a noise signal and x (n) is a clean signal; transforming the noisy speech to the frequency domain by windowing w (n) to M samples of Y (n), and performing FFT of M points to obtain a frequency spectrum Y (l, k) of the noisy speech, where l is a frame number label, k represents a frequency component, and k is 0, 1, 2. The same spectrum X (l, k) of the clean speech X (n) and the spectrum D (l, k) of the noise signal D (n) can be obtained;
step 1-2, representing the frequency spectrum Y (l, k) of the voice with noise on a polar coordinate, and dividing the frequency spectrum into a magnitude spectrum and a phase spectrum, namely
Y(l,k)=|Y(l,k)|ej∠Y(l,k)
Wherein | Y (l, k) | is a short-time amplitude spectrum of the voice Y (n) with noise, and | Y (l, k) is a phase spectrum; similarly, a short-time amplitude spectrum | X (l, k) | of pure voice X (n) and a short-time amplitude spectrum | D (l, k) | of a noise signal D (n) can be obtained;
step 1-3, calculating the log power spectrum S (n) of the noisy speech and the log power spectrum T (n) of the clean speech by using the following formula,
S(n)=[loge(|Y(n,1)|2),loge(|Y(n,2)|2),...,loge(|Y(n,k)|2),...,loge(|Y(n,M-1)|2)]
T(n)=[loge(|X(n,1)|2),loge(|X(n,2)|2),...,loge(|X(n,k)|2),...,loge(|X(n,M-1)|2)]
wherein | Y (n, 1) & gtY2For the power of the 1 st frequency band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, 2) & gtY2For the power of the 2 nd band of the nth frame obtained by short-time Fourier transform of noisy speech, | Y (n, k) & gtY2For the power of the kth frequency band of the nth frame obtained by short-time Fourier transform of the noisy speech, | Y (n, M-1) & gt2Obtaining the power of the (M-1) th frequency band of the nth frame of the voice with noise through short-time Fourier transform; | X (n, 1) emittingphosphor2(ii) power of 1 st band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, 2) & ltY |)2(ii) power of 2 nd band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, k) & ltY |2Power of kth frequency band of nth frame obtained by short-time Fourier transform for pure speech, | X (n, M-1) & ltY & gt2The power of the (M-1) th frequency band of the nth frame obtained by short-time Fourier transform for pure speech.
3. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network according to claim 2, wherein in the step 2, the phase compensation function is optimized as follows:
step 2-1, the conventional phase compensation function is expressed as
Figure FDA0003412113320000021
Wherein Λ (l, k) is a phase compensation function, λ is a phase compensation factor,
Figure FDA0003412113320000022
for noise estimation, the amplitude spectrum Y (l, k) of the noisy speech can be generally substituted, λ is an empirical value, Ψ (k) is an antisymmetric function for phase correction, and denoted as
Figure FDA0003412113320000031
The action of Λ (l, k) is shown by the following formula,
YΛ(l,k)=Y(l,k)+Λ(l,k)
wherein, YΛ(l, k) is a compensated spectrum, and a phase spectrum < Y shown in the following formula is obtained by extracting the phase of the compensated spectrumΛ(l,k),
∠YΛ(l,k)=arg(YΛ(l,k))
Is less than YΛ(l, k) is combined with the magnitude spectrum Y (l, k) of the noisy speech to obtain the spectral expression of the enhanced speech as
Figure FDA0003412113320000032
Step 2-2, introducing frame signal-to-noise ratio to optimize a phase compensation function to obtain an optimized phase compensation function, namely
Figure FDA0003412113320000033
Where c is an empirical value, SNRlIs the signal-to-noise ratio of the l frame.
4. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 3, wherein the step 2 further comprises a step 2-3 of simplifying the training goal of the subsequent neural network by simplifying the optimized phase compensation function as the following formula,
Λ(l,k)=Ψ(k)×Q(l,k)
wherein the calculation formula of Q (l, k) is expressed as,
Figure FDA0003412113320000041
5. the speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 4, wherein in the step 3, a training target and a loss function design for fusing the phase spectrum compensation are proposed, and the specific method is as follows:
step 3-1, for the voice enhancement of the frequency domain, adopting the logarithmic power spectrum of the voice with noise and the logarithmic power spectrum of the pure voice as the training characteristic and the training target of the neural network respectively, namely
Figure FDA0003412113320000042
Wherein,
Figure FDA0003412113320000043
is a pure voice logarithm power spectrum T obtained by the noisy voice logarithm power spectrum through the neural network trainingnOf the approximation value of SnIs a logarithmic power spectrum of the noisy speech; then combines the phase alpha of the n-th frame of the noisy speechnTo obtain the time domain waveform of the n-th frame of the enhanced voice
Figure FDA0003412113320000044
Namely, it is
Figure FDA0003412113320000045
Step 3-2, on the basis of voice enhancement of the frequency domain, a phase compensation algorithm is fused to serve as a new training target, and the new training target comprises a logarithmic power spectrum T of pure voicenAnd a phase compensation factor QnTwo parts, i.e.
Figure FDA0003412113320000046
Wherein the phase compensation factor is adjusted
Figure FDA0003412113320000047
Then the phase compensation function is obtained, and the enhanced phase of the nth frame is obtained by the phase compensation function, so that the phase alpha of the voice with noise can be replacednAchieving the well-modified voice enhancement effect;
step 3-3, for the training target fusing the phase spectrum compensation and provided in the last step, defining the loss function of the neural network as
Figure FDA0003412113320000051
Where MSE is the combined mean square error of the log power spectrum and phase spectrum compensation, T (n, k) is the log power spectrum of the nth frame of pure speech,
Figure FDA0003412113320000052
the approximate value of the clean voice logarithmic power spectrum obtained by training the logarithmic power spectrum of the nth frame of the noisy voice through the neural network,
Figure FDA0003412113320000053
the method comprises the steps of obtaining an approximate value of a phase compensation factor through neural network training, wherein beta is a modulation parameter;
3-4, building a full convolution neural network model which is mainly divided into an input layer, an encoder layer, a decoder layer and an output layer;
and 3-5, taking the logarithmic power spectrum of the phase compensation factor combined with the pure voice as a training target of the full convolution neural network, and training the full convolution neural network model to obtain the trained full convolution neural network model.
6. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 5, wherein in the step 3-4, the input layer is a 5 x 257 matrix, the length of the context representing the input features is 5 frames, each frame has 257 data points, the input data is composed of the logarithmic power spectrum of the noisy speech of the adjacent 5 frames, and the correlation of the input data in the two dimensions of time and frequency is increased;
the encoder layer consists of two convolution layers and two dropout layers, wherein the convolution layers have three hyper-parameters which are respectively the size of a convolution kernel, a moving step length and a filling mode;
the decoder layer also comprises two convolution layers and two dropout layers, and the network structure of the decoder layer is symmetrical to that of the encoder layer;
the output layer is a full connection layer, is directly connected with a decoder and is provided with 257 neurons, the activation function selects a linear function, and the output is 1 frame of voice characteristic data and 1 frame of phase spectrum compensation factors.
7. The speech enhancement algorithm based on the improved phase spectrum compensation and the full convolution neural network as claimed in claim 6, wherein in step 4, the characteristic extraction is performed on the noisy speech to be tested according to the method in step 1, and the extracted characteristic parameters are input into the trained full convolution neural network model, so as to obtain the logarithm power spectrum estimation value and the phase compensation factor of the speech signal.
8. The speech enhancement algorithm based on the modified phase spectrum compensation and the full convolution neural network as claimed in claim 7, wherein in the step 5, the magnitude spectrum and the phase spectrum of the speech signal are reconstructed by using the logarithmic power spectrum and the phase compensation factor, and the final enhanced speech is obtained by performing short-time inverse fourier transform (ISTFT).
CN202111534489.7A 2021-12-15 2021-12-15 Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network Pending CN114242099A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111534489.7A CN114242099A (en) 2021-12-15 2021-12-15 Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111534489.7A CN114242099A (en) 2021-12-15 2021-12-15 Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network

Publications (1)

Publication Number Publication Date
CN114242099A true CN114242099A (en) 2022-03-25

Family

ID=80756355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111534489.7A Pending CN114242099A (en) 2021-12-15 2021-12-15 Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network

Country Status (1)

Country Link
CN (1) CN114242099A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115295024A (en) * 2022-04-11 2022-11-04 维沃移动通信有限公司 Signal processing method, signal processing device, electronic apparatus, and medium
CN115295001A (en) * 2022-07-26 2022-11-04 中国科学技术大学 Single-channel speech enhancement method based on progressive fusion correction network
CN115497492A (en) * 2022-08-24 2022-12-20 珠海全视通信息技术有限公司 Real-time voice enhancement method based on full convolution neural network
CN115881148A (en) * 2022-11-15 2023-03-31 中国科学院声学研究所 Acoustic feedback cancellation method based on deep learning
CN116052706A (en) * 2023-03-30 2023-05-02 苏州清听声学科技有限公司 Low-complexity voice enhancement method based on neural network
GB2620747A (en) * 2022-07-19 2024-01-24 Samsung Electronics Co Ltd Method and apparatus for speech enhancement

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115295024A (en) * 2022-04-11 2022-11-04 维沃移动通信有限公司 Signal processing method, signal processing device, electronic apparatus, and medium
GB2620747A (en) * 2022-07-19 2024-01-24 Samsung Electronics Co Ltd Method and apparatus for speech enhancement
CN115295001A (en) * 2022-07-26 2022-11-04 中国科学技术大学 Single-channel speech enhancement method based on progressive fusion correction network
CN115295001B (en) * 2022-07-26 2024-05-10 中国科学技术大学 Single-channel voice enhancement method based on progressive fusion correction network
CN115497492A (en) * 2022-08-24 2022-12-20 珠海全视通信息技术有限公司 Real-time voice enhancement method based on full convolution neural network
CN115881148A (en) * 2022-11-15 2023-03-31 中国科学院声学研究所 Acoustic feedback cancellation method based on deep learning
CN115881148B (en) * 2022-11-15 2024-01-26 中国科学院声学研究所 Acoustic feedback cancellation method based on deep learning
CN116052706A (en) * 2023-03-30 2023-05-02 苏州清听声学科技有限公司 Low-complexity voice enhancement method based on neural network

Similar Documents

Publication Publication Date Title
CN114242099A (en) Speech enhancement algorithm based on improved phase spectrum compensation and full convolution neural network
CN110619885B (en) Method for generating confrontation network voice enhancement based on deep complete convolution neural network
Su et al. HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks
Xia et al. Speech enhancement with weighted denoising auto-encoder.
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
Verteletskaya et al. Noise reduction based on modified spectral subtraction method
CN110428849A (en) A kind of sound enhancement method based on generation confrontation network
Daqrouq et al. An investigation of speech enhancement using wavelet filtering method
CN112634926B (en) Short wave channel voice anti-fading auxiliary enhancement method based on convolutional neural network
CN110808057A (en) Voice enhancement method for generating confrontation network based on constraint naive
Wang et al. Joint noise and mask aware training for DNN-based speech enhancement with sub-band features
CN114694670A (en) Multi-task network-based microphone array speech enhancement system and method
Naik et al. A literature survey on single channel speech enhancement techniques
CN112634927B (en) Short wave channel voice enhancement method
CN115223583A (en) Voice enhancement method, device, equipment and medium
CN103971697B (en) Sound enhancement method based on non-local mean filtering
CN114401168B (en) Voice enhancement method applicable to short wave Morse signal under complex strong noise environment
CN109215635B (en) Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement
CN113066483B (en) Sparse continuous constraint-based method for generating countermeasure network voice enhancement
CN115273884A (en) Multi-stage full-band speech enhancement method based on spectrum compression and neural network
Yu et al. Constrained Ratio Mask for Speech Enhancement Using DNN.
Fingscheidt et al. Data-driven speech enhancement
Zhang et al. Half-Temporal and Half-Frequency Attention U 2 Net for Speech Signal Improvement
Heitkaemper et al. Neural network based carrier frequency offset estimation from speech transmitted over high frequency channels

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination