CN116913303A - Single-channel voice enhancement method based on step-by-step amplitude compensation network - Google Patents
Single-channel voice enhancement method based on step-by-step amplitude compensation network Download PDFInfo
- Publication number
- CN116913303A CN116913303A CN202310969308.6A CN202310969308A CN116913303A CN 116913303 A CN116913303 A CN 116913303A CN 202310969308 A CN202310969308 A CN 202310969308A CN 116913303 A CN116913303 A CN 116913303A
- Authority
- CN
- China
- Prior art keywords
- branch
- amplitude
- spectrum
- complex
- convolution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000001228 spectrum Methods 0.000 claims abstract description 276
- 230000004927 fusion Effects 0.000 claims abstract description 59
- 238000000605 extraction Methods 0.000 claims description 58
- 230000006870 function Effects 0.000 claims description 27
- 230000004913 activation Effects 0.000 claims description 18
- 238000010586 diagram Methods 0.000 claims description 18
- 238000010606 normalization Methods 0.000 claims description 18
- 238000013507 mapping Methods 0.000 claims description 15
- 238000011176 pooling Methods 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 238000009432 framing Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000000750 progressive effect Effects 0.000 claims description 9
- 238000007670 refining Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 6
- 230000008878 coupling Effects 0.000 claims description 3
- 238000010168 coupling process Methods 0.000 claims description 3
- 238000005859 coupling reaction Methods 0.000 claims description 3
- 230000037433 frameshift Effects 0.000 claims description 3
- 230000005236 sound signal Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 4
- 230000000295 complement effect Effects 0.000 abstract description 2
- 238000005457 optimization Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2131—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on a transform domain processing, e.g. wavelet transform
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0264—Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2123/00—Data types
- G06F2123/02—Data types in the time domain, e.g. time-series data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/08—Feature extraction
- G06F2218/10—Feature extraction by analysing the shape of a waveform, e.g. extracting parameters relating to peaks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02163—Only one microphone
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Human Computer Interaction (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Quality & Reliability (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Complex Calculations (AREA)
- Spectroscopy & Molecular Physics (AREA)
Abstract
The invention relates to a single-channel voice enhancement method based on a step-by-step amplitude compensation network, which adopts a three-branch structure based on an encoder-decoder and is respectively an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch. The amplitude spectrum estimation branch is utilized to filter main noise components, the complex spectrum refinement branch is utilized to carry out complementary drawing on missing details, and on the basis of implicitly estimating phase information, the time domain waveform is utilized to correct the two branches. In order to fully utilize the information of the three branches, a cross-domain information fusion module is provided and embedded into the three branches, the characteristics of the three branches are gradually extracted and fused, and the information of the amplitude spectrum estimation branch and the complex spectrum refinement branch is corrected and amplitude compensated. The invention can effectively make up the implicit compensation effect between the amplitude and the phase, improves the quality and the understandability of the voice signal, and is superior to the current most advanced cross-domain voice enhancement method and the previous advanced system.
Description
Technical Field
The invention relates to the field of voice enhancement, in particular to a single-channel voice enhancement method based on a step-by-step amplitude compensation network.
Background
Single channel speech enhancement refers to removing background noise from a noisy speech signal captured by only one microphone. Single channel speech enhancement is a very challenging task since no speech signals from other microphones are used as references. In recent years, a speech enhancement method utilizing deep learning has outstanding performance in the field, and particularly when the problems of non-stationary noise, low signal-to-noise ratio and the like are solved, the deep learning method is obviously superior to the traditional single-channel speech enhancement algorithm. Convolutional neural networks and recurrent neural networks are two commonly used methods for speech enhancement.
In 2020, a deep complex neural network has been proposed, which combines a complex convolutional neural network and an LSTM neural network, to obtain the first name of Real-Time Track (RT) for a DNS (Deep Noise Suppression) challenge race in 2020. However, such single-branch speech enhancement systems can introduce compensation problems between amplitude and phase, potentially resulting in convergence of the real and imaginary parts to a local suboptimal solution, which can degrade performance in challenging scenarios.
To solve this problem, a target decoupling strategy is proposed that decomposes the original optimization target into a number of interrelated sub-targets. For this purpose, two effective network architectures are designed in the time-frequency domain, namely a multi-stage deep neural network and a dual-path deep neural network. For the former, the network jointly optimizes the output of each stage to gradually enhance the quality of speech. For the latter, the two paths of the network optimize the respective objectives of each path in parallel and cooperatively reconstruct the enhanced speech spectrum. However, these time-frequency domain methods ignore the feature that the time domain method can avoid the problem of amplitude and phase compensation. And moreover, the information of each branch of the dual-path deep neural network is simply and interactively fused, so that the dynamic adjustment process of the information among each branch is omitted, and the quality and the intelligibility of the enhanced voice can be influenced finally.
CN202210885817.6 is a single-channel voice enhancement method based on progressive fusion correction network, which only utilizes the amplitude spectrum characteristic and complex spectrum characteristic of a time-frequency domain to carry out voice enhancement, meanwhile, causality cannot be guaranteed, calculation complexity is high, model parameters are large, and the method is difficult to be deployed in an actual terminal system. The invention ensures the causality of the model, has small trainable parameters and can be flexibly applied to a large number of actual scenes.
CN202210885819.5 is a single-channel speech enhancement method based on an interactive time-frequency attention mechanism, which only uses complex spectrum characteristics of a time-frequency domain, and cannot effectively solve the problem of compensation between amplitude and phase; compared with the method, the method decouples the traditional complex spectrum estimation into the step-by-step optimization of the amplitude and the phase, thereby relieving the compensation problem between the amplitude and the phase, avoiding the mutual influence between the amplitude and the phase and improving the voice enhancement performance.
In addition, compared with other previous patent applications, the previous patent applications learn harmonic information from the complex domain or waveform information from the time domain, and do not consider information from the complex domain, the amplitude domain and the time domain at the same time, so that information loss or amplitude and phase compensation problems are caused, and the performance of speech enhancement is limited. The method and the device perform preliminary denoising on the noisy signal through the amplitude spectrum estimation branch, then add the noise with the residual error output by the complex spectrum refinement branch, reconstruct the spectrum of the finally output enhanced voice signal, and the strategy can effectively improve the voice enhancement performance. According to the cross-domain information fusion module, the multi-scale characteristic extraction is carried out on three branches from a complex domain, an amplitude domain and a time domain through the multi-scale convolution block, so that more effective amplitude compensation can be completed, and the voice enhancement performance is improved.
Disclosure of Invention
The invention solves the technical problems: the method solves the problems of compensation between amplitude and phase caused by traditional complex spectrum estimation and insufficient utilization of time domain waveform information, and provides a single-channel voice enhancement method based on a step-by-step amplitude compensation network.
The invention combines the advantages of time domain and time frequency domain, introduces a time frequency domain-based branch and a time domain-based branch into the network, effectively utilizes the harmonic information and time domain waveform information in the time frequency spectrum, and performs step-by-step amplitude compensation and dynamic adjustment of information on the amplitude spectrum estimation branch and the complex spectrum refinement branch of each stage through a cross-domain information fusion module, thereby improving the quality and the intelligibility of the voice signal, and having obvious advantages compared with the common voice enhancement neural network in enhancement effect.
The technical solution of the invention is realized by the following technical scheme:
in a first aspect, the present invention provides a single-channel speech enhancement method based on a progressive amplitude compensation network, including the following:
step 1: performing short-time Fourier transform (STFT) on the voice signal with noise to obtain a complex spectrum, a magnitude spectrum and a phase of each frame in a voice signal spectrum with noise;
step 2: the complex spectrum is input into a complex spectrum refining branch in a three-branch network; the amplitude spectrum is input into an amplitude spectrum estimation branch in a three-branch network; inputting the voice signal with noise into a time domain waveform correction branch in a three-branch network after framing;
The amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refining branch and the time domain information output by each middle layer of the amplitude spectrum estimation branch, the complex spectrum refining branch and the time domain waveform correction branch are respectively input into a cross-domain information fusion module;
the cross-domain information fusion module respectively performs feature extraction, fusion and projection on the amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refinement branch and the time domain information of the time domain waveform correction branch to obtain two cross-domain enhanced correction masks aiming at the amplitude of the amplitude spectrum branch and the complex spectrum refinement branch, and realizes correction by using the time domain information to complete the compensation of the amplitude spectrum branch and the complex spectrum refinement branch;
the cross-domain information fusion module comprises three stages, namely a feature extraction stage, a feature fusion stage and a feature projection stage;
in the characteristic extraction stage, deep characteristic extraction is carried out on amplitude information of an amplitude spectrum estimation branch, and a characteristic diagram aiming at the amplitude spectrum estimation branch is obtained; deep feature extraction is carried out on the amplitude information of the complex spectrum refinement branch, so that a feature map aiming at the amplitude information of the complex spectrum refinement branch is obtained; deep feature extraction is carried out on time domain information of the time domain waveform branch, and a feature map aiming at the time domain waveform correction branch is obtained;
In the feature fusion stage, the feature images of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch are fused to obtain a cross-domain fused feature image;
in a characteristic projection stage, projecting the cross-domain fused characteristic diagram onto the amplitudes of an amplitude spectrum estimation branch and a complex spectrum refinement branch respectively to obtain two cross-domain enhancement correction masks for the amplitudes of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the middle layer respectively, so that the amplitude compensation of the two branches is completed;
introducing a plurality of cross-domain information fusion modules into each intermediate layer of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch, and performing step-by-step amplitude compensation on an input voice signal with noise to form a step-by-step amplitude compensation network;
the final output of the amplitude spectrum estimation branch is used as an estimated ideal ratio mask for the amplitude spectrum, and main noise components are filtered; the final output in the complex spectrum refinement branch is used as the residual error between the primarily denoised voice signal and the enhanced voice signal;
Step 3: and (3) multiplying the ideal ratio mask outputted by the amplitude spectrum estimation branch in the step (2) with the amplitude spectrum point in the step (1), coupling with the phase in the step (1) to form a preliminary denoising voice signal, adding the spectrum of the preliminary denoising voice signal with the residual outputted by the complex spectrum refinement branch in the step (2), reconstructing the spectrum of the finally outputted enhanced voice signal, and performing short-time Fourier inverse transform (iSTFT) on the spectrum of the enhanced voice signal to obtain the enhanced voice signal.
Further, the amplitude spectrum estimation branch comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an amplitude spectrum of an input voice signal with noise to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then the amplitude spectrum of the enhanced voice signal and a phase pair of an original voice signal with noise are reconstructed into a voice signal with primary enhancement;
the complex spectrum refinement branch comprises a complex convolution encoder, a complex long-short time memory network LSTM and a complex convolution decoder; the complex convolution encoder carries out depth feature extraction on the complex spectrum of the input noise voice signal to obtain a feature map with depth feature information, the feature map is input into a complex long-short time memory network LSTM, the time dependence relationship is modeled, and the details of the loss of the complex spectrum of the voice after preliminary enhancement are recovered through the complex convolution decoder;
The time domain waveform correction branch comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the real number convolution encoder carries out depth feature extraction on the input framing time domain noisy speech waveform to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, the time dependence relationship is modeled, and then the real number convolution encoder decodes the time dependence relationship.
Further, the complex convolutional encoder is formed by stacking six convolutional blocks, and each convolutional block consists of a complex convolutional layer, a complex batch normalization layer and a complex parametric ReLU activation function; the complex convolution layer is obtained by simulating four convolution layers according to the operation rule of complex multiplication, and a complex filter matrix W=W is set r +jW i Complex form input vector x=x r +jX i, wherein ,Wr and Wi Is a real tensor filter matrix, X r and Xi Is a real input tensor, the real part is used to simulate complex operation, and then the output of complex convolution operation is expressed as:
F out =(X r *W r -X i *W i )+j(X r *W i +X i *W r ) (1)
in the formula ,Fout Is the output of the complex form convolution layer; similarly, there are also plural LSTM layers outputting F LSTM The definition is as follows:
F LSTM =(F rr -F ii )+j(F ri +F ir )
F rr =LSTM r (X r ),F ii =LSTM i (X i )
F ri =LSTM i (X r ),F ir =LSTM r (X i ) (2)
wherein LSTM represents a traditional LSTM neural network, and subscripts r and i respectively represent a real part and an imaginary part of a corresponding network; the complex convolutional decoder is formed by stacking six deconvolution blocks with corresponding sizes, and residual connection is used between the encoder and the decoder.
Further, the characteristic extraction path of the amplitude spectrum estimation branch and the characteristic extraction path of the complex spectrum refinement branch are both composed of a time-frequency domain multi-scale convolution block, the time-frequency domain multi-scale convolution block respectively passes through three convolution layers with the convolution kernel sizes of 3*1, 1*3 and 3*3, the output of the three convolution layers is spliced and then sent into a convolution block, and the convolution block is composed of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; the outputs of the feature extraction path of the magnitude spectrum estimation branch and the feature extraction path of the complex spectrum refinement branch are expressed as:
wherein , and />Feature graphs obtained by respectively representing feature extraction paths of amplitude spectrum estimation branches and feature extraction paths of complex spectrum refinement branches, M i and |Ci I respectively represents that the ith layer cross-domain information fusion module comes from the amplitude spectrum estimation branchAmplitude information from complex spectrum refinement branch, θ i Representing the phase, W, of a complex spectrum refinement branch 1 、W 2 and W3 Weight matrix representing convolution layers of convolution kernel sizes 3*1, 1*3 and 3*3, b, respectively 1 、b 2 and b3 Respectively represent the deviation, the concat function represents the splicing in the channel dimension, F m and Fc For each mapping function of the convolution blocks of the two branches, representing convolution operations; wherein W of formulas (3) and (4) 1 、W 2 、W 3 Different, the same as b 1 、b 2 、b 3 Different;
the characteristic extraction path of the time domain waveform correction branch consists of a time domain multi-scale convolution block, wherein the time domain multi-scale convolution block firstly passes through three convolution layers with convolution kernel sizes of 1*3 and 3*3 respectively, and after the output of the three convolution layers are spliced, the three convolution layers are fed into a convolution block, and the convolution block consists of a convolution layer with a convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; and outputting a characteristic diagram obtained by respectively carrying out average pooling and maximum pooling on the output, and summing the output to be used as a characteristic extraction path of a time domain waveform correction branch, wherein the characteristic diagram is as follows:
X out =F t (concat(W 1 *w i +b 1 ,W 2 *w i +b 2 ))
wherein ,Xout Representing the output of the time-domain multi-scale convolution block,feature map obtained by feature extraction path representing time domain waveform correction branch, w i Representing the input of the ith layer cross-domain information fusion module from the time domain waveform correction branch, W 1 and W2 Is the weight of the convolution kernel, b 1 and b2 Is a deviation; f (F) t Correcting the mapping function of the convolution blocks of the branches for the time domain waveform, avgPool and MaxPool represents average pooling and maximum pooling, respectively;
feature graphs extracted in the feature extraction stage and respectively coming from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are sent to a feature fusion stage, three feature graph points are multiplied firstly, then the feature graphs are sent to a convolution block, the convolution block consists of a convolution layer, a batch normalization and activation function Sigmoid, and finally a cross-domain fused feature graph is output, wherein the cross-domain fused feature graph is as follows:
wherein ,representing a fusion tensor of an i-layer cross-domain information fusion module, BN represents batch normalization, and sigma represents a Sigmoid activation function; />Characteristic diagrams from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are respectively represented, W is the weight of a convolution kernel, and b is the deviation;
the feature mapping stage receives the cross-domain fused feature graphs from the feature fusion stage and maps the cross-domain fused feature graphs to the amplitude of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively; both the process of mapping to the amplitude spectrum estimation branch and the process of mapping to the complex spectrum refinement branch consist of a layer of convolution layer, the batch normalization and activation function Sigmoid, as follows:
wherein , and />A cross-domain enhancement correction mask representing the magnitudes mapped to the magnitude spectrum estimation branch and the complex spectrum refinement branch, respectively;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the intermediate layer, so that the amplitude compensation of the two branches is completed, as follows:
wherein ,representing the output of the amplitude spectrum estimation branch after amplitude compensation,/I>Output representing the amplitude of the complex spectrum refinement branch after amplitude compensation,/>Representing the final complex output of the complex spectrum refinement branch after amplitude compensation.
Further, in the step 3, reconstructing the spectrum of the finally output enhanced speech signal includes:
prediction output of an ideal ratio mask for a given amplitude spectrum estimation branchResidual output from complex spectrum refinement branch> wherein />Respectively represent->The final spectrum reconstruction is as follows:
in the formula ,represents complex spectrum of the enhanced voice signal, wherein X is the amplitude of the voice to be enhanced, and theta is the angle X Representing the phase spectrum of the noisy speech signal.
Further, in the step 1, the short-time fourier transform STFT includes:
The method comprises the steps of sampling noisy speech so that the sampling rate of all audio signals is 16KHz, framing by using a Hanning window with a frame length of 30ms and a frame shift of 10ms, and then performing short-time Fourier transform (STFT) to obtain the real part and the imaginary part of each frame in the spectrum of the noisy speech signals, wherein the real part and the imaginary part are as follows:
Y(t,f)=S(t,f)+N(t,f) (12)
in the formula ,
Y(t,f)=Y r (t,f)+jY i (t,f)
S(t,f)=S r (t,f)+jS i (t,f)
wherein Y (t, f) represents the noisy speech spectrum after short-time Fourier transform, t represents the time dimension, and f represents the frequency dimension; s (t, f) and N (t, f) represent clean speech and background noise, subscripts r and i represent the real and imaginary parts of the spectrum, respectively, the number of short-time Fourier transform points is 320, and 161 dimensions after transformation correspond to a frequency range from 0 to 8000Hz.
Further, in the step 2, the ideal ratio mask IRM is as follows:
the ideal ratio mask IRM is used as a training target to reconstruct a time-frequency diagram of the speech to be enhanced, and is a defined ideal mask, wherein |x| is the magnitude spectrum of the speech to be enhanced, and |s| is the magnitude spectrum of the pure speech signal.
In a second aspect, the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor implements the single channel speech enhancement method based on a progressive amplitude compensation network when executing the program.
In a third aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a single channel speech enhancement method based on a step-wise amplitude compensation network as described above.
Compared with the prior art, the invention has the advantages that:
(1) The invention relates to a single-channel voice enhancement method based on a step-by-step amplitude compensation network, which is used for compensating an implicit compensation effect between amplitude and phase caused by simultaneously reconstructing a real part and an imaginary part of voice in a complex domain. The invention adopts a three-branch structure based on an encoder-decoder, which is an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch respectively. The invention utilizes the time domain waveform to correct two branches on the basis of implicitly estimating phase information by utilizing the amplitude spectrum estimation branch to filter out main noise components and utilizing the complex spectrum refinement branch to carry out complementary drawing on missing details. In order to fully utilize the information of the three branches, a cross-domain information fusion module is provided and embedded into the three branches, the characteristics of the three branches are extracted and fused step by step, and the information of the amplitude spectrum estimation branch and the complex spectrum refinement branch is corrected and amplitude compensated. The invention can effectively make up the implicit compensation effect between the amplitude and the phase, improves the quality and the understandability of the voice signal, and is superior to the current most advanced cross-domain voice enhancement method and the previous advanced system.
(2) The invention is based on cross-domain, combines the advantages of time domain and time-frequency domain, introduces the branch based on complex domain, the branch based on amplitude domain and the branch based on time domain into the network, effectively utilizes the harmonic information and the time domain waveform information in the frequency spectrum, can better recover the details of the voice signal, and improves the voice quality and the intelligibility.
(3) The invention decouples the traditional complex spectrum estimation into the step-by-step optimization of the amplitude and the phase, thereby alleviating the compensation problem between the amplitude and the phase, namely alleviating the problem that the accuracy of the amplitude is sacrificed for compensating the phase in the traditional complex spectrum estimation. Thus, the invention can lighten the mutual influence between the amplitude and the phase and improve the performance of voice enhancement.
(4) The invention provides a cross-domain information fusion module which performs step-by-step amplitude compensation on an amplitude spectrum estimation branch and a complex spectrum refinement branch at each stage to generate a hierarchical cross-domain enhancement correction mask, thereby promoting dynamic adjustment of information between the amplitude spectrum estimation branch and the complex spectrum refinement branch, and further improving the performance of voice enhancement by utilizing information from a time domain waveform to perform step-by-step enhancement on the amplitudes of the two branches.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an overall network architecture of the present invention;
FIG. 2 is a specific structure of a cross-domain information fusion module according to the present invention;
FIG. 3 is a specific structure of a time-frequency domain multi-scale convolution block according to the present invention;
fig. 4 is a specific structure of a time domain multi-scale convolution block in the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is evident that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
As shown in fig. 1, the single-channel voice enhancement method based on the step-by-step amplitude compensation network provided by the embodiment of the invention includes the following contents:
Step 1: performing short-time Fourier transform (STFT) on the voice signal with noise to obtain a complex spectrum, a magnitude spectrum and a phase of each frame in a voice signal spectrum with noise;
step 2: the complex spectrum is input into a complex spectrum refining branch in a three-branch network; the amplitude spectrum is input into an amplitude spectrum estimation branch in a three-branch network; inputting the voice signal with noise into a time domain waveform correction branch in a three-branch network after framing;
the amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refining branch and the time domain information output by each middle layer of the amplitude spectrum estimation branch, the complex spectrum refining branch and the time domain waveform correction branch are respectively input into a cross-domain information fusion module;
the cross-domain information fusion module respectively performs feature extraction, fusion and projection on the amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refinement branch and the time domain information of the time domain waveform correction branch to obtain two cross-domain enhanced correction masks aiming at the amplitude of the amplitude spectrum branch and the complex spectrum refinement branch, and realizes correction by using the time domain information to complete the compensation of the amplitude spectrum branch and the complex spectrum refinement branch;
the cross-domain information fusion module comprises three stages, namely a feature extraction stage, a feature fusion stage and a feature projection stage;
In the characteristic extraction stage, deep characteristic extraction is carried out on amplitude information of an amplitude spectrum estimation branch, and a characteristic diagram aiming at the amplitude spectrum estimation branch is obtained; deep feature extraction is carried out on the amplitude information of the complex spectrum refinement branch, so that a feature map aiming at the amplitude information of the complex spectrum refinement branch is obtained; deep feature extraction is carried out on time domain information of the time domain waveform branch, and a feature map aiming at the time domain waveform correction branch is obtained;
in the feature fusion stage, the feature images of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch are fused to obtain a cross-domain fused feature image;
in a characteristic projection stage, projecting the cross-domain fused characteristic diagram onto the amplitudes of an amplitude spectrum estimation branch and a complex spectrum refinement branch respectively to obtain two cross-domain enhancement correction masks for the amplitudes of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the middle layer respectively, so that the amplitude compensation of the two branches is completed;
Introducing a plurality of cross-domain information fusion modules into each intermediate layer of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch, and performing step-by-step amplitude compensation on an input voice signal with noise to form a step-by-step amplitude compensation network;
the final output of the amplitude spectrum estimation branch is used as an estimated ideal ratio mask for the amplitude spectrum, and main noise components are filtered; the final output in the complex spectrum refinement branch is used as the residual error between the primarily denoised voice signal and the enhanced voice signal;
step 3: and (3) multiplying the ideal ratio mask outputted by the amplitude spectrum estimation branch in the step (2) with the amplitude spectrum point in the step (1), coupling with the phase in the step (1) to form a preliminary denoising voice signal, adding the spectrum of the preliminary denoising voice signal with the residual outputted by the complex spectrum refinement branch in the step (2), reconstructing the spectrum of the finally outputted enhanced voice signal, and performing short-time Fourier inverse transform (iSTFT) on the spectrum of the enhanced voice signal to obtain the enhanced voice signal.
The amplitude spectrum estimation branch includes: a real convolutional encoder, a real long-short time memory network LSTM and a real convolutional decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an amplitude spectrum of an input voice signal with noise to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then the amplitude spectrum of the enhanced voice signal and a phase pair of an original voice signal with noise are reconstructed into a voice signal with primary enhancement;
The complex spectrum refinement branch circuit comprises: a complex convolutional encoder, a complex long-short time memory network LSTM and a complex convolutional decoder; the complex convolution encoder carries out depth feature extraction on the complex spectrum of the input noise voice signal to obtain a feature map with depth feature information, the feature map is input into a complex long-short time memory network LSTM, the time dependence relationship is modeled, and the details of the loss of the complex spectrum of the voice after preliminary enhancement are recovered through the complex convolution decoder;
the time domain waveform correction branch circuit includes: the system comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an input framing time domain noisy speech waveform to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then decoding is carried out through a real number convolution decoder;
the complex convolution encoder is formed by stacking six convolution blocks, wherein each convolution block consists of a complex convolution layer, a complex batch normalization layer and a complex parametric ReLU activation function; plural number type The convolution layer is obtained by simulating four convolution layers according to the operation rule of complex multiplication, and a complex form filter matrix W=W is set r +jW i Complex form input vector x=x r +jX i, wherein ,Wr and Wi Is a real tensor filter matrix, X r and Xi Is a real input tensor, the real part is used to simulate complex operation, and then the output of complex convolution operation is expressed as:
F out =(X r *W r -X i *W i )+j(X r *W i +X i *W r ) (1)
in the formula ,Fout Is the output of the complex form convolution layer; similarly, there are also plural LSTM layers outputting F LSTM The definition is as follows:
F LSTM =(F rr -F ii )+j(F ri +F ir )
F rr =LSTM r (X r ),F ii =LSTM i (X i )
F ri =LSTM i (X r ),F ir =LSTM r (X i ) (2)
wherein LSTM represents a traditional LSTM neural network, and subscripts r and i respectively represent a real part and an imaginary part of a corresponding network; the complex convolutional decoder is formed by stacking six deconvolution blocks with corresponding sizes, and residual connection is used between the encoder and the decoder.
As shown in fig. 2, the above-mentioned cross-domain information fusion module includes: an amplitude spectrum estimation branch characteristic extraction path and a characteristic extraction path of a complex spectrum refinement branch.
The characteristic extraction paths of the amplitude spectrum estimation branch and the complex spectrum refinement branch are composed of a time-frequency domain multi-scale convolution block, as shown in fig. 3, the time-frequency domain multi-scale convolution block is firstly subjected to three convolution layers with convolution kernel sizes of 3*1, 1*3 and 3*3 respectively, after the output of the three convolution layers are spliced, the three convolution layers are sent into a convolution block, and the convolution block is composed of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; the outputs of the feature extraction path of the magnitude spectrum estimation branch and the feature extraction path of the complex spectrum refinement branch are expressed as:
wherein , and />Feature graphs obtained by respectively representing feature extraction paths of amplitude spectrum estimation branches and feature extraction paths of complex spectrum refinement branches, M i and |Ci I respectively represents the amplitude information from the amplitude spectrum estimation branch and the amplitude from the complex spectrum refinement branch of the ith layer cross-domain information fusion module, and theta i Representing the phase, W, of a complex spectrum refinement branch 1 、W 2 and W3 Weight matrix representing convolution layers of convolution kernel sizes 3*1, 1*3 and 3*3, b, respectively 1 、b 2 and b3 Respectively represent the deviation, the concat function represents the splicing in the channel dimension, F m and Fc For each mapping function of the convolution blocks of the two branches, representing convolution operations; wherein W of formulas (3) and (4) 1 、W 2 、W 3 Different, the same as b 1 、b 2 、b 3 Different;
the characteristic extraction path of the time domain waveform correction branch consists of a time domain multi-scale convolution block, as shown in fig. 4, the time domain multi-scale convolution block firstly passes through three convolution layers with the convolution kernel sizes of 1*3 and 3*3 respectively, and after the output of the three convolution layers are spliced, the three convolution layers are sent into a convolution block, and the convolution block consists of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; and outputting a characteristic diagram obtained by respectively carrying out average pooling and maximum pooling on the output, and summing the output to be used as a characteristic extraction path of a time domain waveform correction branch, wherein the characteristic diagram is as follows:
X out =F t (concat(W 1 *w i +b 1 ,W 2 *w i +b 2 ))
wherein ,Xout Representing the output of the time-domain multi-scale convolution block,feature map obtained by feature extraction path representing time domain waveform correction branch, w i Representing the input of the ith layer cross-domain information fusion module from the time domain waveform correction branch, W 1 and W2 Is the weight of the convolution kernel, b 1 and b2 Is a deviation; f (F) t For the mapping function of the convolution block of the time domain waveform correction branch, avgPool and MaxPool represent average pooling and maximum pooling respectively;
feature graphs extracted in the feature extraction stage and respectively coming from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are sent to a feature fusion stage, three feature graph points are multiplied firstly, then the feature graphs are sent to a convolution block, the convolution block consists of a convolution layer, a batch normalization and activation function Sigmoid, and finally a cross-domain fused feature graph is output, wherein the cross-domain fused feature graph is as follows:
wherein ,fusion representing an i-th layer cross-domain information fusion moduleThe tensor, BN, represents the batch normalization and sigma represents the Sigmoid activation function; />Characteristic diagrams from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are respectively represented, W is the weight of a convolution kernel, and b is the deviation;
the feature mapping stage receives the cross-domain fused feature graphs from the feature fusion stage and maps the cross-domain fused feature graphs to the amplitude of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively; both the process of mapping to the amplitude spectrum estimation branch and the process of mapping to the complex spectrum refinement branch consist of a layer of convolution layer, the batch normalization and activation function Sigmoid, as follows:
wherein , and />A cross-domain enhancement correction mask representing the magnitudes mapped to the magnitude spectrum estimation branch and the complex spectrum refinement branch, respectively;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the intermediate layer, so that the amplitude compensation of the two branches is completed, as follows:
wherein ,representing the output of the amplitude spectrum estimation branch after amplitude compensation,/I>Output representing the amplitude of the complex spectrum refinement branch after amplitude compensation,/>Representing the final complex output of the complex spectrum refinement branch after amplitude compensation.
The reconstructing the spectrum of the finally output enhanced speech signal includes:
prediction output of an ideal ratio mask for a given amplitude spectrum estimation branchResidual output from complex spectrum refinement branch> wherein />Respectively represent->The final spectrum reconstruction is as follows:
in the formula ,represents complex spectrum of the enhanced voice signal, wherein X is the amplitude of the voice to be enhanced, and theta is the angle X Representing the phase spectrum of the noisy speech signal.
In the above step 1, the short-time fourier transform STFT includes:
the method comprises the steps of sampling noisy speech so that the sampling rate of all audio signals is 16KHz, framing by using a Hanning window with a frame length of 30ms and a frame shift of 10ms, and then performing short-time Fourier transform (STFT) to obtain the real part and the imaginary part of each frame in the spectrum of the noisy speech signals, wherein the real part and the imaginary part are as follows:
Y(t,f)=S(t,f)+N(t,f) (12)
in the formula ,
Y(t,f)=Y r (t,f)+jY i (t,f)
S(t,f)=S r (t,f)+jS i (t,f)
wherein Y (t, f) represents the noisy speech spectrum after short-time Fourier transform, t represents the time dimension, and f represents the frequency dimension; s (t, f) and N (t, f) represent clean speech and background noise, subscripts r and i represent the real and imaginary parts of the spectrum, respectively, the number of short-time Fourier transform points is 320, and 161 dimensions after transformation correspond to a frequency range from 0 to 8000Hz.
In the above step 2, the ideal ratio mask IRM includes:
the ideal ratio mask IRM is used as a training target to reconstruct a time-frequency diagram of the speech to be enhanced, and is a defined ideal mask, wherein |x| is the magnitude spectrum of the speech to be enhanced, and |s| is the magnitude spectrum of the pure speech signal.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be as defined in the claims.
Claims (9)
1. A single-channel speech enhancement method based on a step-by-step amplitude compensation network, comprising:
step 1: performing short-time Fourier transform (STFT) on the voice signal with noise to obtain a complex spectrum, a magnitude spectrum and a phase of each frame in a voice signal spectrum with noise;
step 2: the complex spectrum is input into a complex spectrum refining branch in a three-branch network; the amplitude spectrum is input into an amplitude spectrum estimation branch in a three-branch network; inputting the voice signal with noise into a time domain waveform correction branch in a three-branch network after framing;
the amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refining branch and the time domain information output by each middle layer of the amplitude spectrum estimation branch, the complex spectrum refining branch and the time domain waveform correction branch are respectively input into a cross-domain information fusion module;
The cross-domain information fusion module respectively performs feature extraction, fusion and projection on the amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refinement branch and the time domain information of the time domain waveform correction branch to obtain two cross-domain enhanced correction masks aiming at the amplitude of the amplitude spectrum branch and the complex spectrum refinement branch, and realizes correction by using the time domain information to complete the compensation of the amplitude spectrum branch and the complex spectrum refinement branch;
the cross-domain information fusion module comprises three stages, namely a feature extraction stage, a feature fusion stage and a feature projection stage;
in the characteristic extraction stage, deep characteristic extraction is carried out on amplitude information of an amplitude spectrum estimation branch, and a characteristic diagram aiming at the amplitude spectrum estimation branch is obtained; deep feature extraction is carried out on the amplitude information of the complex spectrum refinement branch, so that a feature map aiming at the amplitude information of the complex spectrum refinement branch is obtained; deep feature extraction is carried out on time domain information of the time domain waveform branch, and a feature map aiming at the time domain waveform correction branch is obtained;
in the feature fusion stage, the feature images of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch are fused to obtain a cross-domain fused feature image;
In a characteristic projection stage, projecting the cross-domain fused characteristic diagram onto the amplitudes of an amplitude spectrum estimation branch and a complex spectrum refinement branch respectively to obtain two cross-domain enhancement correction masks for the amplitudes of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the middle layer respectively, so that the amplitude compensation of the two branches is completed;
introducing a cross-domain information fusion module into each intermediate layer of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch, and performing step-by-step amplitude compensation on an input voice signal with noise to form a step-by-step amplitude compensation network;
the final output of the amplitude spectrum estimation branch is used as an estimated ideal ratio mask for the amplitude spectrum, and main noise components are filtered; the final output in the complex spectrum refinement branch is used as the residual error between the primarily denoised voice signal and the enhanced voice signal;
step 3: and (3) multiplying the ideal ratio mask outputted by the amplitude spectrum estimation branch in the step (2) with the amplitude spectrum point in the step (1), coupling with the phase in the step (1) to form a preliminary denoising voice signal, adding the spectrum of the preliminary denoising voice signal with the residual outputted by the complex spectrum refinement branch in the step (2), reconstructing the spectrum of the finally outputted enhanced voice signal, and performing short-time Fourier inverse transform (iSTFT) on the spectrum of the enhanced voice signal to obtain the enhanced voice signal.
2. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: the amplitude spectrum estimation branch comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an amplitude spectrum of an input voice signal with noise to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then the amplitude spectrum of the enhanced voice signal and a phase pair of an original voice signal with noise are reconstructed into a voice signal with primary enhancement;
the complex spectrum refinement branch comprises a complex convolution encoder, a complex long-short time memory network LSTM and a complex convolution decoder; the complex convolution encoder carries out depth feature extraction on the complex spectrum of the input noise voice signal to obtain a feature map with depth feature information, the feature map is input into a complex long-short time memory network LSTM, the time dependence relationship is modeled, and the details of the loss of the complex spectrum of the voice after preliminary enhancement are recovered through the complex convolution decoder;
The time domain waveform correction branch comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the real number convolution encoder carries out depth feature extraction on the input framing time domain noisy speech waveform to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, the time dependence relationship is modeled, and then the real number convolution encoder decodes the time dependence relationship.
3. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 2, wherein: the complex convolution encoder is formed by stacking six convolution blocks, and each convolution block consists of a complex convolution layer, a complex batch normalization layer and a complex parametric ReLU activation function; the complex convolution layer is obtained by simulating four convolution layers according to the operation rule of complex multiplication, and a complex filter matrix W=W is set r +jW i Complex form input vector x=x r +jX i, wherein ,Wr and Wi Is a real tensor filter matrix, X r and Xi Is a real input tensor, the real part is used to simulate complex operation, and then the output of complex convolution operation is expressed as:
F out =(X r *W r -X i *W i )+j(X r *W i +X i *W r ) (1)
in the formula ,Fout Is the output of the complex form convolution layer; similarly, there are also plural LSTM layers outputting F LSTM The definition is as follows:
F LSTM =(F rr -F ii )+j(F ri +F ir )
F rr =LSTM r (X r ),F ii =LSTM i (X i )
F ri =LSTM i (X r ),F ir =LSTM r (X i ) (2)
wherein LSTM represents an LSTM neural network, and subscripts r and i respectively represent a real part and an imaginary part of a corresponding network; the complex convolutional decoder is formed by stacking six deconvolution blocks with corresponding sizes, and residual connection is used between the encoder and the decoder.
4. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: the characteristic extraction paths of the amplitude spectrum estimation branch and the complex spectrum refinement branch are composed of a time-frequency domain multi-scale convolution block, the time-frequency domain multi-scale convolution block respectively passes through three convolution layers with the convolution kernel sizes of 3*1, 1*3 and 3*3, the output of the three convolution layers is spliced and then is sent into a convolution block, and the convolution block is composed of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; the outputs of the feature extraction path of the magnitude spectrum estimation branch and the feature extraction path of the complex spectrum refinement branch are expressed as:
wherein , and />Feature graphs obtained by respectively representing feature extraction paths of amplitude spectrum estimation branches and feature extraction paths of complex spectrum refinement branches, M i and |Ci I respectively represents the amplitude information from the amplitude spectrum estimation branch and the amplitude from the complex spectrum refinement branch of the ith layer cross-domain information fusion module, and theta i Representing the phase, W, of a complex spectrum refinement branch 1 、W 2 and W3 Weight matrix representing convolution layers of convolution kernel sizes 3*1, 1*3 and 3*3, b, respectively 1 、b 2 and b3 Respectively represent the deviation, the concat function represents the splicing in the channel dimension, F m and Fc For each mapping function of the convolution blocks of the two branches, representing convolution operations; wherein W of formulas (3) and (4) 1 、W 2 、W 3 Different, the same as b 1 、b 2 、b 3 Different;
the characteristic extraction path of the time domain waveform correction branch consists of a time domain multi-scale convolution block, wherein the time domain multi-scale convolution block firstly passes through three convolution layers with convolution kernel sizes of 1*3 and 3*3 respectively, and after the output of the three convolution layers are spliced, the three convolution layers are fed into a convolution block, and the convolution block consists of a convolution layer with a convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; and outputting a characteristic diagram obtained by respectively carrying out average pooling and maximum pooling on the output, and summing the output to be used as a characteristic extraction path of a time domain waveform correction branch, wherein the characteristic diagram is as follows:
X out =F t (concat(W 1 *w i +b 1 ,W 2 *w i +b 2 ))
wherein ,Xout Representing the output of the time-domain multi-scale convolution block,feature map obtained by feature extraction path representing time domain waveform correction branch, w i Representing the input of the ith layer cross-domain information fusion module from the time domain waveform correction branch, W 1 and W2 Is the weight of the convolution kernel, b 1 and b2 Is a deviation; f (F) t For the mapping function of the convolution block of the time domain waveform correction branch, avgPool and MaxPool represent average pooling and maximum pooling respectively;
feature graphs extracted in the feature extraction stage and respectively coming from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are sent to a feature fusion stage, three feature graph points are multiplied firstly, then the feature graphs are sent to a convolution block, the convolution block consists of a convolution layer, a batch normalization and activation function Sigmoid, and finally a cross-domain fused feature graph is output, wherein the cross-domain fused feature graph is as follows:
wherein ,representing a fusion tensor of an i-layer cross-domain information fusion module, BN represents batch normalization, and sigma represents a Sigmoid activation function; />Characteristic diagrams from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are respectively represented, W is the weight of a convolution kernel, and b is the deviation;
the feature mapping stage receives the cross-domain fused feature graphs from the feature fusion stage and maps the cross-domain fused feature graphs to the amplitude of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively; both the process of mapping to the amplitude spectrum estimation branch and the process of mapping to the complex spectrum refinement branch consist of a layer of convolution layer, the batch normalization and activation function Sigmoid, as follows:
wherein , and />Representing the amplitudes mapped to the amplitude spectrum estimation branch and the complex spectrum refinement branch, respectivelyIs a cross-domain enhanced correction mask;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the intermediate layer, so that the amplitude compensation of the two branches is completed, as follows:
wherein ,representing the output of the amplitude spectrum estimation branch after amplitude compensation,/I>Output representing the amplitude of the complex spectrum refinement branch after amplitude compensation,/>Representing the final complex output of the complex spectrum refinement branch after amplitude compensation.
5. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: in the step 3, reconstructing the spectrum of the finally output enhanced speech signal includes:
prediction output of an ideal ratio mask for a given amplitude spectrum estimation branchResidual output from complex spectrum refinement branch wherein /> Respectively represent->The final spectrum reconstruction is as follows:
in the formula ,represents complex spectrum of the enhanced voice signal, wherein X is the amplitude of the voice to be enhanced, and theta is the angle X Representing the phase spectrum of the noisy speech signal.
6. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: in the step 1, the short-time fourier transform STFT includes:
the method comprises the steps of sampling noisy speech so that the sampling rate of all audio signals is 16KHz, framing by using a Hanning window with a frame length of 30ms and a frame shift of 10ms, and then performing short-time Fourier transform (STFT) to obtain the real part and the imaginary part of each frame in the spectrum of the noisy speech signals, wherein the real part and the imaginary part are as follows:
Y(t,f)=S(t,f)+N(t,f) (12)
in the formula ,
Y(t,f)=Y r (t,f)+jY i (t,f)
S(t,f)=S r (t,f)+jS i (t,f)
wherein Y (t, f) represents the noisy speech spectrum after short-time Fourier transform, t represents the time dimension, and f represents the frequency dimension; s (t, f) and N (t, f) represent clean speech and background noise, subscripts r and i represent the real and imaginary parts of the spectrum, respectively, the number of short-time Fourier transform points is 320, and 161 dimensions after transformation correspond to a frequency range from 0 to 8000Hz.
7. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: in the step 2, the ideal ratio mask IRM is as follows:
the ideal ratio mask IRM is used as a training target to reconstruct a time-frequency diagram of the speech to be enhanced, and is a defined ideal mask, wherein |x| is the magnitude spectrum of the speech to be enhanced, and |s| is the magnitude spectrum of the pure speech signal.
8. An electronic device comprising a memory, a processor, and a computer program stored on the memory and running on the processor, characterized by: the processor implementing the method of any of claims 1-7 when executing the program.
9. A computer-readable storage medium having stored thereon a computer program, characterized by: which computer program, when being executed by a processor, implements the method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310969308.6A CN116913303A (en) | 2023-08-01 | 2023-08-01 | Single-channel voice enhancement method based on step-by-step amplitude compensation network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310969308.6A CN116913303A (en) | 2023-08-01 | 2023-08-01 | Single-channel voice enhancement method based on step-by-step amplitude compensation network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116913303A true CN116913303A (en) | 2023-10-20 |
Family
ID=88364795
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310969308.6A Pending CN116913303A (en) | 2023-08-01 | 2023-08-01 | Single-channel voice enhancement method based on step-by-step amplitude compensation network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116913303A (en) |
-
2023
- 2023-08-01 CN CN202310969308.6A patent/CN116913303A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10580430B2 (en) | Noise reduction using machine learning | |
CN111081268A (en) | Phase-correlated shared deep convolutional neural network speech enhancement method | |
CN110148420A (en) | A kind of audio recognition method suitable under noise circumstance | |
CN111768796A (en) | Acoustic echo cancellation and dereverberation method and device | |
CN112863535A (en) | Residual echo and noise elimination method and device | |
CN115295001B (en) | Single-channel voice enhancement method based on progressive fusion correction network | |
CN111883154B (en) | Echo cancellation method and device, computer-readable storage medium, and electronic device | |
CN114792524B (en) | Audio data processing method, apparatus, program product, computer device and medium | |
CN117174105A (en) | Speech noise reduction and dereverberation method based on improved deep convolutional network | |
CN116682444A (en) | Single-channel voice enhancement method based on waveform spectrum fusion network | |
CN114495960A (en) | Audio noise reduction filtering method, noise reduction filtering device, electronic equipment and storage medium | |
CN115295002B (en) | Single-channel voice enhancement method based on interactive time-frequency attention mechanism | |
CN117351983A (en) | Transformer-based voice noise reduction method and system | |
CN116913303A (en) | Single-channel voice enhancement method based on step-by-step amplitude compensation network | |
CN110992966B (en) | Human voice separation method and system | |
CN114067825A (en) | Comfort noise generation method based on time-frequency masking estimation and application thereof | |
CN114220451A (en) | Audio denoising method, electronic device, and storage medium | |
CN114360560A (en) | Speech enhancement post-processing method and device based on harmonic structure prediction | |
CN112652321A (en) | Voice noise reduction system and method based on deep learning phase friendlier | |
CN113763978A (en) | Voice signal processing method, device, electronic equipment and storage medium | |
CN113724723B (en) | Reverberation and noise suppression method and device, electronic equipment and storage medium | |
CN117219107B (en) | Training method, device, equipment and storage medium of echo cancellation model | |
US20240161766A1 (en) | Robustness/performance improvement for deep learning based speech enhancement against artifacts and distortion | |
CN114566176B (en) | Residual echo cancellation method and system based on deep neural network | |
Sheng et al. | Speech noise reduction algorithm based on CA-DCDCCRN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |