CN116913303A - Single-channel voice enhancement method based on step-by-step amplitude compensation network - Google Patents

Single-channel voice enhancement method based on step-by-step amplitude compensation network Download PDF

Info

Publication number
CN116913303A
CN116913303A CN202310969308.6A CN202310969308A CN116913303A CN 116913303 A CN116913303 A CN 116913303A CN 202310969308 A CN202310969308 A CN 202310969308A CN 116913303 A CN116913303 A CN 116913303A
Authority
CN
China
Prior art keywords
branch
amplitude
spectrum
complex
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310969308.6A
Other languages
Chinese (zh)
Inventor
叶中付
陈雯卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202310969308.6A priority Critical patent/CN116913303A/en
Publication of CN116913303A publication Critical patent/CN116913303A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2131Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on a transform domain processing, e.g. wavelet transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2123/00Data types
    • G06F2123/02Data types in the time domain, e.g. time-series data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/08Feature extraction
    • G06F2218/10Feature extraction by analysing the shape of a waveform, e.g. extracting parameters relating to peaks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02163Only one microphone
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Complex Calculations (AREA)
  • Spectroscopy & Molecular Physics (AREA)

Abstract

The invention relates to a single-channel voice enhancement method based on a step-by-step amplitude compensation network, which adopts a three-branch structure based on an encoder-decoder and is respectively an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch. The amplitude spectrum estimation branch is utilized to filter main noise components, the complex spectrum refinement branch is utilized to carry out complementary drawing on missing details, and on the basis of implicitly estimating phase information, the time domain waveform is utilized to correct the two branches. In order to fully utilize the information of the three branches, a cross-domain information fusion module is provided and embedded into the three branches, the characteristics of the three branches are gradually extracted and fused, and the information of the amplitude spectrum estimation branch and the complex spectrum refinement branch is corrected and amplitude compensated. The invention can effectively make up the implicit compensation effect between the amplitude and the phase, improves the quality and the understandability of the voice signal, and is superior to the current most advanced cross-domain voice enhancement method and the previous advanced system.

Description

Single-channel voice enhancement method based on step-by-step amplitude compensation network
Technical Field
The invention relates to the field of voice enhancement, in particular to a single-channel voice enhancement method based on a step-by-step amplitude compensation network.
Background
Single channel speech enhancement refers to removing background noise from a noisy speech signal captured by only one microphone. Single channel speech enhancement is a very challenging task since no speech signals from other microphones are used as references. In recent years, a speech enhancement method utilizing deep learning has outstanding performance in the field, and particularly when the problems of non-stationary noise, low signal-to-noise ratio and the like are solved, the deep learning method is obviously superior to the traditional single-channel speech enhancement algorithm. Convolutional neural networks and recurrent neural networks are two commonly used methods for speech enhancement.
In 2020, a deep complex neural network has been proposed, which combines a complex convolutional neural network and an LSTM neural network, to obtain the first name of Real-Time Track (RT) for a DNS (Deep Noise Suppression) challenge race in 2020. However, such single-branch speech enhancement systems can introduce compensation problems between amplitude and phase, potentially resulting in convergence of the real and imaginary parts to a local suboptimal solution, which can degrade performance in challenging scenarios.
To solve this problem, a target decoupling strategy is proposed that decomposes the original optimization target into a number of interrelated sub-targets. For this purpose, two effective network architectures are designed in the time-frequency domain, namely a multi-stage deep neural network and a dual-path deep neural network. For the former, the network jointly optimizes the output of each stage to gradually enhance the quality of speech. For the latter, the two paths of the network optimize the respective objectives of each path in parallel and cooperatively reconstruct the enhanced speech spectrum. However, these time-frequency domain methods ignore the feature that the time domain method can avoid the problem of amplitude and phase compensation. And moreover, the information of each branch of the dual-path deep neural network is simply and interactively fused, so that the dynamic adjustment process of the information among each branch is omitted, and the quality and the intelligibility of the enhanced voice can be influenced finally.
CN202210885817.6 is a single-channel voice enhancement method based on progressive fusion correction network, which only utilizes the amplitude spectrum characteristic and complex spectrum characteristic of a time-frequency domain to carry out voice enhancement, meanwhile, causality cannot be guaranteed, calculation complexity is high, model parameters are large, and the method is difficult to be deployed in an actual terminal system. The invention ensures the causality of the model, has small trainable parameters and can be flexibly applied to a large number of actual scenes.
CN202210885819.5 is a single-channel speech enhancement method based on an interactive time-frequency attention mechanism, which only uses complex spectrum characteristics of a time-frequency domain, and cannot effectively solve the problem of compensation between amplitude and phase; compared with the method, the method decouples the traditional complex spectrum estimation into the step-by-step optimization of the amplitude and the phase, thereby relieving the compensation problem between the amplitude and the phase, avoiding the mutual influence between the amplitude and the phase and improving the voice enhancement performance.
In addition, compared with other previous patent applications, the previous patent applications learn harmonic information from the complex domain or waveform information from the time domain, and do not consider information from the complex domain, the amplitude domain and the time domain at the same time, so that information loss or amplitude and phase compensation problems are caused, and the performance of speech enhancement is limited. The method and the device perform preliminary denoising on the noisy signal through the amplitude spectrum estimation branch, then add the noise with the residual error output by the complex spectrum refinement branch, reconstruct the spectrum of the finally output enhanced voice signal, and the strategy can effectively improve the voice enhancement performance. According to the cross-domain information fusion module, the multi-scale characteristic extraction is carried out on three branches from a complex domain, an amplitude domain and a time domain through the multi-scale convolution block, so that more effective amplitude compensation can be completed, and the voice enhancement performance is improved.
Disclosure of Invention
The invention solves the technical problems: the method solves the problems of compensation between amplitude and phase caused by traditional complex spectrum estimation and insufficient utilization of time domain waveform information, and provides a single-channel voice enhancement method based on a step-by-step amplitude compensation network.
The invention combines the advantages of time domain and time frequency domain, introduces a time frequency domain-based branch and a time domain-based branch into the network, effectively utilizes the harmonic information and time domain waveform information in the time frequency spectrum, and performs step-by-step amplitude compensation and dynamic adjustment of information on the amplitude spectrum estimation branch and the complex spectrum refinement branch of each stage through a cross-domain information fusion module, thereby improving the quality and the intelligibility of the voice signal, and having obvious advantages compared with the common voice enhancement neural network in enhancement effect.
The technical solution of the invention is realized by the following technical scheme:
in a first aspect, the present invention provides a single-channel speech enhancement method based on a progressive amplitude compensation network, including the following:
step 1: performing short-time Fourier transform (STFT) on the voice signal with noise to obtain a complex spectrum, a magnitude spectrum and a phase of each frame in a voice signal spectrum with noise;
step 2: the complex spectrum is input into a complex spectrum refining branch in a three-branch network; the amplitude spectrum is input into an amplitude spectrum estimation branch in a three-branch network; inputting the voice signal with noise into a time domain waveform correction branch in a three-branch network after framing;
The amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refining branch and the time domain information output by each middle layer of the amplitude spectrum estimation branch, the complex spectrum refining branch and the time domain waveform correction branch are respectively input into a cross-domain information fusion module;
the cross-domain information fusion module respectively performs feature extraction, fusion and projection on the amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refinement branch and the time domain information of the time domain waveform correction branch to obtain two cross-domain enhanced correction masks aiming at the amplitude of the amplitude spectrum branch and the complex spectrum refinement branch, and realizes correction by using the time domain information to complete the compensation of the amplitude spectrum branch and the complex spectrum refinement branch;
the cross-domain information fusion module comprises three stages, namely a feature extraction stage, a feature fusion stage and a feature projection stage;
in the characteristic extraction stage, deep characteristic extraction is carried out on amplitude information of an amplitude spectrum estimation branch, and a characteristic diagram aiming at the amplitude spectrum estimation branch is obtained; deep feature extraction is carried out on the amplitude information of the complex spectrum refinement branch, so that a feature map aiming at the amplitude information of the complex spectrum refinement branch is obtained; deep feature extraction is carried out on time domain information of the time domain waveform branch, and a feature map aiming at the time domain waveform correction branch is obtained;
In the feature fusion stage, the feature images of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch are fused to obtain a cross-domain fused feature image;
in a characteristic projection stage, projecting the cross-domain fused characteristic diagram onto the amplitudes of an amplitude spectrum estimation branch and a complex spectrum refinement branch respectively to obtain two cross-domain enhancement correction masks for the amplitudes of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the middle layer respectively, so that the amplitude compensation of the two branches is completed;
introducing a plurality of cross-domain information fusion modules into each intermediate layer of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch, and performing step-by-step amplitude compensation on an input voice signal with noise to form a step-by-step amplitude compensation network;
the final output of the amplitude spectrum estimation branch is used as an estimated ideal ratio mask for the amplitude spectrum, and main noise components are filtered; the final output in the complex spectrum refinement branch is used as the residual error between the primarily denoised voice signal and the enhanced voice signal;
Step 3: and (3) multiplying the ideal ratio mask outputted by the amplitude spectrum estimation branch in the step (2) with the amplitude spectrum point in the step (1), coupling with the phase in the step (1) to form a preliminary denoising voice signal, adding the spectrum of the preliminary denoising voice signal with the residual outputted by the complex spectrum refinement branch in the step (2), reconstructing the spectrum of the finally outputted enhanced voice signal, and performing short-time Fourier inverse transform (iSTFT) on the spectrum of the enhanced voice signal to obtain the enhanced voice signal.
Further, the amplitude spectrum estimation branch comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an amplitude spectrum of an input voice signal with noise to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then the amplitude spectrum of the enhanced voice signal and a phase pair of an original voice signal with noise are reconstructed into a voice signal with primary enhancement;
the complex spectrum refinement branch comprises a complex convolution encoder, a complex long-short time memory network LSTM and a complex convolution decoder; the complex convolution encoder carries out depth feature extraction on the complex spectrum of the input noise voice signal to obtain a feature map with depth feature information, the feature map is input into a complex long-short time memory network LSTM, the time dependence relationship is modeled, and the details of the loss of the complex spectrum of the voice after preliminary enhancement are recovered through the complex convolution decoder;
The time domain waveform correction branch comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the real number convolution encoder carries out depth feature extraction on the input framing time domain noisy speech waveform to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, the time dependence relationship is modeled, and then the real number convolution encoder decodes the time dependence relationship.
Further, the complex convolutional encoder is formed by stacking six convolutional blocks, and each convolutional block consists of a complex convolutional layer, a complex batch normalization layer and a complex parametric ReLU activation function; the complex convolution layer is obtained by simulating four convolution layers according to the operation rule of complex multiplication, and a complex filter matrix W=W is set r +jW i Complex form input vector x=x r +jX i, wherein ,Wr and Wi Is a real tensor filter matrix, X r and Xi Is a real input tensor, the real part is used to simulate complex operation, and then the output of complex convolution operation is expressed as:
F out =(X r *W r -X i *W i )+j(X r *W i +X i *W r ) (1)
in the formula ,Fout Is the output of the complex form convolution layer; similarly, there are also plural LSTM layers outputting F LSTM The definition is as follows:
F LSTM =(F rr -F ii )+j(F ri +F ir )
F rr =LSTM r (X r ),F ii =LSTM i (X i )
F ri =LSTM i (X r ),F ir =LSTM r (X i ) (2)
wherein LSTM represents a traditional LSTM neural network, and subscripts r and i respectively represent a real part and an imaginary part of a corresponding network; the complex convolutional decoder is formed by stacking six deconvolution blocks with corresponding sizes, and residual connection is used between the encoder and the decoder.
Further, the characteristic extraction path of the amplitude spectrum estimation branch and the characteristic extraction path of the complex spectrum refinement branch are both composed of a time-frequency domain multi-scale convolution block, the time-frequency domain multi-scale convolution block respectively passes through three convolution layers with the convolution kernel sizes of 3*1, 1*3 and 3*3, the output of the three convolution layers is spliced and then sent into a convolution block, and the convolution block is composed of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; the outputs of the feature extraction path of the magnitude spectrum estimation branch and the feature extraction path of the complex spectrum refinement branch are expressed as:
wherein , and />Feature graphs obtained by respectively representing feature extraction paths of amplitude spectrum estimation branches and feature extraction paths of complex spectrum refinement branches, M i and |Ci I respectively represents that the ith layer cross-domain information fusion module comes from the amplitude spectrum estimation branchAmplitude information from complex spectrum refinement branch, θ i Representing the phase, W, of a complex spectrum refinement branch 1 、W 2 and W3 Weight matrix representing convolution layers of convolution kernel sizes 3*1, 1*3 and 3*3, b, respectively 1 、b 2 and b3 Respectively represent the deviation, the concat function represents the splicing in the channel dimension, F m and Fc For each mapping function of the convolution blocks of the two branches, representing convolution operations; wherein W of formulas (3) and (4) 1 、W 2 、W 3 Different, the same as b 1 、b 2 、b 3 Different;
the characteristic extraction path of the time domain waveform correction branch consists of a time domain multi-scale convolution block, wherein the time domain multi-scale convolution block firstly passes through three convolution layers with convolution kernel sizes of 1*3 and 3*3 respectively, and after the output of the three convolution layers are spliced, the three convolution layers are fed into a convolution block, and the convolution block consists of a convolution layer with a convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; and outputting a characteristic diagram obtained by respectively carrying out average pooling and maximum pooling on the output, and summing the output to be used as a characteristic extraction path of a time domain waveform correction branch, wherein the characteristic diagram is as follows:
X out =F t (concat(W 1 *w i +b 1 ,W 2 *w i +b 2 ))
wherein ,Xout Representing the output of the time-domain multi-scale convolution block,feature map obtained by feature extraction path representing time domain waveform correction branch, w i Representing the input of the ith layer cross-domain information fusion module from the time domain waveform correction branch, W 1 and W2 Is the weight of the convolution kernel, b 1 and b2 Is a deviation; f (F) t Correcting the mapping function of the convolution blocks of the branches for the time domain waveform, avgPool and MaxPool represents average pooling and maximum pooling, respectively;
feature graphs extracted in the feature extraction stage and respectively coming from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are sent to a feature fusion stage, three feature graph points are multiplied firstly, then the feature graphs are sent to a convolution block, the convolution block consists of a convolution layer, a batch normalization and activation function Sigmoid, and finally a cross-domain fused feature graph is output, wherein the cross-domain fused feature graph is as follows:
wherein ,representing a fusion tensor of an i-layer cross-domain information fusion module, BN represents batch normalization, and sigma represents a Sigmoid activation function; />Characteristic diagrams from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are respectively represented, W is the weight of a convolution kernel, and b is the deviation;
the feature mapping stage receives the cross-domain fused feature graphs from the feature fusion stage and maps the cross-domain fused feature graphs to the amplitude of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively; both the process of mapping to the amplitude spectrum estimation branch and the process of mapping to the complex spectrum refinement branch consist of a layer of convolution layer, the batch normalization and activation function Sigmoid, as follows:
wherein , and />A cross-domain enhancement correction mask representing the magnitudes mapped to the magnitude spectrum estimation branch and the complex spectrum refinement branch, respectively;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the intermediate layer, so that the amplitude compensation of the two branches is completed, as follows:
wherein ,representing the output of the amplitude spectrum estimation branch after amplitude compensation,/I>Output representing the amplitude of the complex spectrum refinement branch after amplitude compensation,/>Representing the final complex output of the complex spectrum refinement branch after amplitude compensation.
Further, in the step 3, reconstructing the spectrum of the finally output enhanced speech signal includes:
prediction output of an ideal ratio mask for a given amplitude spectrum estimation branchResidual output from complex spectrum refinement branch> wherein />Respectively represent->The final spectrum reconstruction is as follows:
in the formula ,represents complex spectrum of the enhanced voice signal, wherein X is the amplitude of the voice to be enhanced, and theta is the angle X Representing the phase spectrum of the noisy speech signal.
Further, in the step 1, the short-time fourier transform STFT includes:
The method comprises the steps of sampling noisy speech so that the sampling rate of all audio signals is 16KHz, framing by using a Hanning window with a frame length of 30ms and a frame shift of 10ms, and then performing short-time Fourier transform (STFT) to obtain the real part and the imaginary part of each frame in the spectrum of the noisy speech signals, wherein the real part and the imaginary part are as follows:
Y(t,f)=S(t,f)+N(t,f) (12)
in the formula ,
Y(t,f)=Y r (t,f)+jY i (t,f)
S(t,f)=S r (t,f)+jS i (t,f)
wherein Y (t, f) represents the noisy speech spectrum after short-time Fourier transform, t represents the time dimension, and f represents the frequency dimension; s (t, f) and N (t, f) represent clean speech and background noise, subscripts r and i represent the real and imaginary parts of the spectrum, respectively, the number of short-time Fourier transform points is 320, and 161 dimensions after transformation correspond to a frequency range from 0 to 8000Hz.
Further, in the step 2, the ideal ratio mask IRM is as follows:
the ideal ratio mask IRM is used as a training target to reconstruct a time-frequency diagram of the speech to be enhanced, and is a defined ideal mask, wherein |x| is the magnitude spectrum of the speech to be enhanced, and |s| is the magnitude spectrum of the pure speech signal.
In a second aspect, the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor implements the single channel speech enhancement method based on a progressive amplitude compensation network when executing the program.
In a third aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a single channel speech enhancement method based on a step-wise amplitude compensation network as described above.
Compared with the prior art, the invention has the advantages that:
(1) The invention relates to a single-channel voice enhancement method based on a step-by-step amplitude compensation network, which is used for compensating an implicit compensation effect between amplitude and phase caused by simultaneously reconstructing a real part and an imaginary part of voice in a complex domain. The invention adopts a three-branch structure based on an encoder-decoder, which is an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch respectively. The invention utilizes the time domain waveform to correct two branches on the basis of implicitly estimating phase information by utilizing the amplitude spectrum estimation branch to filter out main noise components and utilizing the complex spectrum refinement branch to carry out complementary drawing on missing details. In order to fully utilize the information of the three branches, a cross-domain information fusion module is provided and embedded into the three branches, the characteristics of the three branches are extracted and fused step by step, and the information of the amplitude spectrum estimation branch and the complex spectrum refinement branch is corrected and amplitude compensated. The invention can effectively make up the implicit compensation effect between the amplitude and the phase, improves the quality and the understandability of the voice signal, and is superior to the current most advanced cross-domain voice enhancement method and the previous advanced system.
(2) The invention is based on cross-domain, combines the advantages of time domain and time-frequency domain, introduces the branch based on complex domain, the branch based on amplitude domain and the branch based on time domain into the network, effectively utilizes the harmonic information and the time domain waveform information in the frequency spectrum, can better recover the details of the voice signal, and improves the voice quality and the intelligibility.
(3) The invention decouples the traditional complex spectrum estimation into the step-by-step optimization of the amplitude and the phase, thereby alleviating the compensation problem between the amplitude and the phase, namely alleviating the problem that the accuracy of the amplitude is sacrificed for compensating the phase in the traditional complex spectrum estimation. Thus, the invention can lighten the mutual influence between the amplitude and the phase and improve the performance of voice enhancement.
(4) The invention provides a cross-domain information fusion module which performs step-by-step amplitude compensation on an amplitude spectrum estimation branch and a complex spectrum refinement branch at each stage to generate a hierarchical cross-domain enhancement correction mask, thereby promoting dynamic adjustment of information between the amplitude spectrum estimation branch and the complex spectrum refinement branch, and further improving the performance of voice enhancement by utilizing information from a time domain waveform to perform step-by-step enhancement on the amplitudes of the two branches.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an overall network architecture of the present invention;
FIG. 2 is a specific structure of a cross-domain information fusion module according to the present invention;
FIG. 3 is a specific structure of a time-frequency domain multi-scale convolution block according to the present invention;
fig. 4 is a specific structure of a time domain multi-scale convolution block in the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is evident that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
As shown in fig. 1, the single-channel voice enhancement method based on the step-by-step amplitude compensation network provided by the embodiment of the invention includes the following contents:
Step 1: performing short-time Fourier transform (STFT) on the voice signal with noise to obtain a complex spectrum, a magnitude spectrum and a phase of each frame in a voice signal spectrum with noise;
step 2: the complex spectrum is input into a complex spectrum refining branch in a three-branch network; the amplitude spectrum is input into an amplitude spectrum estimation branch in a three-branch network; inputting the voice signal with noise into a time domain waveform correction branch in a three-branch network after framing;
the amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refining branch and the time domain information output by each middle layer of the amplitude spectrum estimation branch, the complex spectrum refining branch and the time domain waveform correction branch are respectively input into a cross-domain information fusion module;
the cross-domain information fusion module respectively performs feature extraction, fusion and projection on the amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refinement branch and the time domain information of the time domain waveform correction branch to obtain two cross-domain enhanced correction masks aiming at the amplitude of the amplitude spectrum branch and the complex spectrum refinement branch, and realizes correction by using the time domain information to complete the compensation of the amplitude spectrum branch and the complex spectrum refinement branch;
the cross-domain information fusion module comprises three stages, namely a feature extraction stage, a feature fusion stage and a feature projection stage;
In the characteristic extraction stage, deep characteristic extraction is carried out on amplitude information of an amplitude spectrum estimation branch, and a characteristic diagram aiming at the amplitude spectrum estimation branch is obtained; deep feature extraction is carried out on the amplitude information of the complex spectrum refinement branch, so that a feature map aiming at the amplitude information of the complex spectrum refinement branch is obtained; deep feature extraction is carried out on time domain information of the time domain waveform branch, and a feature map aiming at the time domain waveform correction branch is obtained;
in the feature fusion stage, the feature images of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch are fused to obtain a cross-domain fused feature image;
in a characteristic projection stage, projecting the cross-domain fused characteristic diagram onto the amplitudes of an amplitude spectrum estimation branch and a complex spectrum refinement branch respectively to obtain two cross-domain enhancement correction masks for the amplitudes of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the middle layer respectively, so that the amplitude compensation of the two branches is completed;
Introducing a plurality of cross-domain information fusion modules into each intermediate layer of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch, and performing step-by-step amplitude compensation on an input voice signal with noise to form a step-by-step amplitude compensation network;
the final output of the amplitude spectrum estimation branch is used as an estimated ideal ratio mask for the amplitude spectrum, and main noise components are filtered; the final output in the complex spectrum refinement branch is used as the residual error between the primarily denoised voice signal and the enhanced voice signal;
step 3: and (3) multiplying the ideal ratio mask outputted by the amplitude spectrum estimation branch in the step (2) with the amplitude spectrum point in the step (1), coupling with the phase in the step (1) to form a preliminary denoising voice signal, adding the spectrum of the preliminary denoising voice signal with the residual outputted by the complex spectrum refinement branch in the step (2), reconstructing the spectrum of the finally outputted enhanced voice signal, and performing short-time Fourier inverse transform (iSTFT) on the spectrum of the enhanced voice signal to obtain the enhanced voice signal.
The amplitude spectrum estimation branch includes: a real convolutional encoder, a real long-short time memory network LSTM and a real convolutional decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an amplitude spectrum of an input voice signal with noise to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then the amplitude spectrum of the enhanced voice signal and a phase pair of an original voice signal with noise are reconstructed into a voice signal with primary enhancement;
The complex spectrum refinement branch circuit comprises: a complex convolutional encoder, a complex long-short time memory network LSTM and a complex convolutional decoder; the complex convolution encoder carries out depth feature extraction on the complex spectrum of the input noise voice signal to obtain a feature map with depth feature information, the feature map is input into a complex long-short time memory network LSTM, the time dependence relationship is modeled, and the details of the loss of the complex spectrum of the voice after preliminary enhancement are recovered through the complex convolution decoder;
the time domain waveform correction branch circuit includes: the system comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an input framing time domain noisy speech waveform to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then decoding is carried out through a real number convolution decoder;
the complex convolution encoder is formed by stacking six convolution blocks, wherein each convolution block consists of a complex convolution layer, a complex batch normalization layer and a complex parametric ReLU activation function; plural number type The convolution layer is obtained by simulating four convolution layers according to the operation rule of complex multiplication, and a complex form filter matrix W=W is set r +jW i Complex form input vector x=x r +jX i, wherein ,Wr and Wi Is a real tensor filter matrix, X r and Xi Is a real input tensor, the real part is used to simulate complex operation, and then the output of complex convolution operation is expressed as:
F out =(X r *W r -X i *W i )+j(X r *W i +X i *W r ) (1)
in the formula ,Fout Is the output of the complex form convolution layer; similarly, there are also plural LSTM layers outputting F LSTM The definition is as follows:
F LSTM =(F rr -F ii )+j(F ri +F ir )
F rr =LSTM r (X r ),F ii =LSTM i (X i )
F ri =LSTM i (X r ),F ir =LSTM r (X i ) (2)
wherein LSTM represents a traditional LSTM neural network, and subscripts r and i respectively represent a real part and an imaginary part of a corresponding network; the complex convolutional decoder is formed by stacking six deconvolution blocks with corresponding sizes, and residual connection is used between the encoder and the decoder.
As shown in fig. 2, the above-mentioned cross-domain information fusion module includes: an amplitude spectrum estimation branch characteristic extraction path and a characteristic extraction path of a complex spectrum refinement branch.
The characteristic extraction paths of the amplitude spectrum estimation branch and the complex spectrum refinement branch are composed of a time-frequency domain multi-scale convolution block, as shown in fig. 3, the time-frequency domain multi-scale convolution block is firstly subjected to three convolution layers with convolution kernel sizes of 3*1, 1*3 and 3*3 respectively, after the output of the three convolution layers are spliced, the three convolution layers are sent into a convolution block, and the convolution block is composed of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; the outputs of the feature extraction path of the magnitude spectrum estimation branch and the feature extraction path of the complex spectrum refinement branch are expressed as:
wherein , and />Feature graphs obtained by respectively representing feature extraction paths of amplitude spectrum estimation branches and feature extraction paths of complex spectrum refinement branches, M i and |Ci I respectively represents the amplitude information from the amplitude spectrum estimation branch and the amplitude from the complex spectrum refinement branch of the ith layer cross-domain information fusion module, and theta i Representing the phase, W, of a complex spectrum refinement branch 1 、W 2 and W3 Weight matrix representing convolution layers of convolution kernel sizes 3*1, 1*3 and 3*3, b, respectively 1 、b 2 and b3 Respectively represent the deviation, the concat function represents the splicing in the channel dimension, F m and Fc For each mapping function of the convolution blocks of the two branches, representing convolution operations; wherein W of formulas (3) and (4) 1 、W 2 、W 3 Different, the same as b 1 、b 2 、b 3 Different;
the characteristic extraction path of the time domain waveform correction branch consists of a time domain multi-scale convolution block, as shown in fig. 4, the time domain multi-scale convolution block firstly passes through three convolution layers with the convolution kernel sizes of 1*3 and 3*3 respectively, and after the output of the three convolution layers are spliced, the three convolution layers are sent into a convolution block, and the convolution block consists of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; and outputting a characteristic diagram obtained by respectively carrying out average pooling and maximum pooling on the output, and summing the output to be used as a characteristic extraction path of a time domain waveform correction branch, wherein the characteristic diagram is as follows:
X out =F t (concat(W 1 *w i +b 1 ,W 2 *w i +b 2 ))
wherein ,Xout Representing the output of the time-domain multi-scale convolution block,feature map obtained by feature extraction path representing time domain waveform correction branch, w i Representing the input of the ith layer cross-domain information fusion module from the time domain waveform correction branch, W 1 and W2 Is the weight of the convolution kernel, b 1 and b2 Is a deviation; f (F) t For the mapping function of the convolution block of the time domain waveform correction branch, avgPool and MaxPool represent average pooling and maximum pooling respectively;
feature graphs extracted in the feature extraction stage and respectively coming from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are sent to a feature fusion stage, three feature graph points are multiplied firstly, then the feature graphs are sent to a convolution block, the convolution block consists of a convolution layer, a batch normalization and activation function Sigmoid, and finally a cross-domain fused feature graph is output, wherein the cross-domain fused feature graph is as follows:
wherein ,fusion representing an i-th layer cross-domain information fusion moduleThe tensor, BN, represents the batch normalization and sigma represents the Sigmoid activation function; />Characteristic diagrams from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are respectively represented, W is the weight of a convolution kernel, and b is the deviation;
the feature mapping stage receives the cross-domain fused feature graphs from the feature fusion stage and maps the cross-domain fused feature graphs to the amplitude of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively; both the process of mapping to the amplitude spectrum estimation branch and the process of mapping to the complex spectrum refinement branch consist of a layer of convolution layer, the batch normalization and activation function Sigmoid, as follows:
wherein , and />A cross-domain enhancement correction mask representing the magnitudes mapped to the magnitude spectrum estimation branch and the complex spectrum refinement branch, respectively;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the intermediate layer, so that the amplitude compensation of the two branches is completed, as follows:
wherein ,representing the output of the amplitude spectrum estimation branch after amplitude compensation,/I>Output representing the amplitude of the complex spectrum refinement branch after amplitude compensation,/>Representing the final complex output of the complex spectrum refinement branch after amplitude compensation.
The reconstructing the spectrum of the finally output enhanced speech signal includes:
prediction output of an ideal ratio mask for a given amplitude spectrum estimation branchResidual output from complex spectrum refinement branch> wherein />Respectively represent->The final spectrum reconstruction is as follows:
in the formula ,represents complex spectrum of the enhanced voice signal, wherein X is the amplitude of the voice to be enhanced, and theta is the angle X Representing the phase spectrum of the noisy speech signal.
In the above step 1, the short-time fourier transform STFT includes:
the method comprises the steps of sampling noisy speech so that the sampling rate of all audio signals is 16KHz, framing by using a Hanning window with a frame length of 30ms and a frame shift of 10ms, and then performing short-time Fourier transform (STFT) to obtain the real part and the imaginary part of each frame in the spectrum of the noisy speech signals, wherein the real part and the imaginary part are as follows:
Y(t,f)=S(t,f)+N(t,f) (12)
in the formula ,
Y(t,f)=Y r (t,f)+jY i (t,f)
S(t,f)=S r (t,f)+jS i (t,f)
wherein Y (t, f) represents the noisy speech spectrum after short-time Fourier transform, t represents the time dimension, and f represents the frequency dimension; s (t, f) and N (t, f) represent clean speech and background noise, subscripts r and i represent the real and imaginary parts of the spectrum, respectively, the number of short-time Fourier transform points is 320, and 161 dimensions after transformation correspond to a frequency range from 0 to 8000Hz.
In the above step 2, the ideal ratio mask IRM includes:
the ideal ratio mask IRM is used as a training target to reconstruct a time-frequency diagram of the speech to be enhanced, and is a defined ideal mask, wherein |x| is the magnitude spectrum of the speech to be enhanced, and |s| is the magnitude spectrum of the pure speech signal.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be as defined in the claims.

Claims (9)

1. A single-channel speech enhancement method based on a step-by-step amplitude compensation network, comprising:
step 1: performing short-time Fourier transform (STFT) on the voice signal with noise to obtain a complex spectrum, a magnitude spectrum and a phase of each frame in a voice signal spectrum with noise;
step 2: the complex spectrum is input into a complex spectrum refining branch in a three-branch network; the amplitude spectrum is input into an amplitude spectrum estimation branch in a three-branch network; inputting the voice signal with noise into a time domain waveform correction branch in a three-branch network after framing;
the amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refining branch and the time domain information output by each middle layer of the amplitude spectrum estimation branch, the complex spectrum refining branch and the time domain waveform correction branch are respectively input into a cross-domain information fusion module;
The cross-domain information fusion module respectively performs feature extraction, fusion and projection on the amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refinement branch and the time domain information of the time domain waveform correction branch to obtain two cross-domain enhanced correction masks aiming at the amplitude of the amplitude spectrum branch and the complex spectrum refinement branch, and realizes correction by using the time domain information to complete the compensation of the amplitude spectrum branch and the complex spectrum refinement branch;
the cross-domain information fusion module comprises three stages, namely a feature extraction stage, a feature fusion stage and a feature projection stage;
in the characteristic extraction stage, deep characteristic extraction is carried out on amplitude information of an amplitude spectrum estimation branch, and a characteristic diagram aiming at the amplitude spectrum estimation branch is obtained; deep feature extraction is carried out on the amplitude information of the complex spectrum refinement branch, so that a feature map aiming at the amplitude information of the complex spectrum refinement branch is obtained; deep feature extraction is carried out on time domain information of the time domain waveform branch, and a feature map aiming at the time domain waveform correction branch is obtained;
in the feature fusion stage, the feature images of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch are fused to obtain a cross-domain fused feature image;
In a characteristic projection stage, projecting the cross-domain fused characteristic diagram onto the amplitudes of an amplitude spectrum estimation branch and a complex spectrum refinement branch respectively to obtain two cross-domain enhancement correction masks for the amplitudes of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the middle layer respectively, so that the amplitude compensation of the two branches is completed;
introducing a cross-domain information fusion module into each intermediate layer of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch, and performing step-by-step amplitude compensation on an input voice signal with noise to form a step-by-step amplitude compensation network;
the final output of the amplitude spectrum estimation branch is used as an estimated ideal ratio mask for the amplitude spectrum, and main noise components are filtered; the final output in the complex spectrum refinement branch is used as the residual error between the primarily denoised voice signal and the enhanced voice signal;
step 3: and (3) multiplying the ideal ratio mask outputted by the amplitude spectrum estimation branch in the step (2) with the amplitude spectrum point in the step (1), coupling with the phase in the step (1) to form a preliminary denoising voice signal, adding the spectrum of the preliminary denoising voice signal with the residual outputted by the complex spectrum refinement branch in the step (2), reconstructing the spectrum of the finally outputted enhanced voice signal, and performing short-time Fourier inverse transform (iSTFT) on the spectrum of the enhanced voice signal to obtain the enhanced voice signal.
2. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: the amplitude spectrum estimation branch comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an amplitude spectrum of an input voice signal with noise to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then the amplitude spectrum of the enhanced voice signal and a phase pair of an original voice signal with noise are reconstructed into a voice signal with primary enhancement;
the complex spectrum refinement branch comprises a complex convolution encoder, a complex long-short time memory network LSTM and a complex convolution decoder; the complex convolution encoder carries out depth feature extraction on the complex spectrum of the input noise voice signal to obtain a feature map with depth feature information, the feature map is input into a complex long-short time memory network LSTM, the time dependence relationship is modeled, and the details of the loss of the complex spectrum of the voice after preliminary enhancement are recovered through the complex convolution decoder;
The time domain waveform correction branch comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the real number convolution encoder carries out depth feature extraction on the input framing time domain noisy speech waveform to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, the time dependence relationship is modeled, and then the real number convolution encoder decodes the time dependence relationship.
3. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 2, wherein: the complex convolution encoder is formed by stacking six convolution blocks, and each convolution block consists of a complex convolution layer, a complex batch normalization layer and a complex parametric ReLU activation function; the complex convolution layer is obtained by simulating four convolution layers according to the operation rule of complex multiplication, and a complex filter matrix W=W is set r +jW i Complex form input vector x=x r +jX i, wherein ,Wr and Wi Is a real tensor filter matrix, X r and Xi Is a real input tensor, the real part is used to simulate complex operation, and then the output of complex convolution operation is expressed as:
F out =(X r *W r -X i *W i )+j(X r *W i +X i *W r ) (1)
in the formula ,Fout Is the output of the complex form convolution layer; similarly, there are also plural LSTM layers outputting F LSTM The definition is as follows:
F LSTM =(F rr -F ii )+j(F ri +F ir )
F rr =LSTM r (X r ),F ii =LSTM i (X i )
F ri =LSTM i (X r ),F ir =LSTM r (X i ) (2)
wherein LSTM represents an LSTM neural network, and subscripts r and i respectively represent a real part and an imaginary part of a corresponding network; the complex convolutional decoder is formed by stacking six deconvolution blocks with corresponding sizes, and residual connection is used between the encoder and the decoder.
4. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: the characteristic extraction paths of the amplitude spectrum estimation branch and the complex spectrum refinement branch are composed of a time-frequency domain multi-scale convolution block, the time-frequency domain multi-scale convolution block respectively passes through three convolution layers with the convolution kernel sizes of 3*1, 1*3 and 3*3, the output of the three convolution layers is spliced and then is sent into a convolution block, and the convolution block is composed of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; the outputs of the feature extraction path of the magnitude spectrum estimation branch and the feature extraction path of the complex spectrum refinement branch are expressed as:
wherein , and />Feature graphs obtained by respectively representing feature extraction paths of amplitude spectrum estimation branches and feature extraction paths of complex spectrum refinement branches, M i and |Ci I respectively represents the amplitude information from the amplitude spectrum estimation branch and the amplitude from the complex spectrum refinement branch of the ith layer cross-domain information fusion module, and theta i Representing the phase, W, of a complex spectrum refinement branch 1 、W 2 and W3 Weight matrix representing convolution layers of convolution kernel sizes 3*1, 1*3 and 3*3, b, respectively 1 、b 2 and b3 Respectively represent the deviation, the concat function represents the splicing in the channel dimension, F m and Fc For each mapping function of the convolution blocks of the two branches, representing convolution operations; wherein W of formulas (3) and (4) 1 、W 2 、W 3 Different, the same as b 1 、b 2 、b 3 Different;
the characteristic extraction path of the time domain waveform correction branch consists of a time domain multi-scale convolution block, wherein the time domain multi-scale convolution block firstly passes through three convolution layers with convolution kernel sizes of 1*3 and 3*3 respectively, and after the output of the three convolution layers are spliced, the three convolution layers are fed into a convolution block, and the convolution block consists of a convolution layer with a convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; and outputting a characteristic diagram obtained by respectively carrying out average pooling and maximum pooling on the output, and summing the output to be used as a characteristic extraction path of a time domain waveform correction branch, wherein the characteristic diagram is as follows:
X out =F t (concat(W 1 *w i +b 1 ,W 2 *w i +b 2 ))
wherein ,Xout Representing the output of the time-domain multi-scale convolution block,feature map obtained by feature extraction path representing time domain waveform correction branch, w i Representing the input of the ith layer cross-domain information fusion module from the time domain waveform correction branch, W 1 and W2 Is the weight of the convolution kernel, b 1 and b2 Is a deviation; f (F) t For the mapping function of the convolution block of the time domain waveform correction branch, avgPool and MaxPool represent average pooling and maximum pooling respectively;
feature graphs extracted in the feature extraction stage and respectively coming from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are sent to a feature fusion stage, three feature graph points are multiplied firstly, then the feature graphs are sent to a convolution block, the convolution block consists of a convolution layer, a batch normalization and activation function Sigmoid, and finally a cross-domain fused feature graph is output, wherein the cross-domain fused feature graph is as follows:
wherein ,representing a fusion tensor of an i-layer cross-domain information fusion module, BN represents batch normalization, and sigma represents a Sigmoid activation function; />Characteristic diagrams from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are respectively represented, W is the weight of a convolution kernel, and b is the deviation;
the feature mapping stage receives the cross-domain fused feature graphs from the feature fusion stage and maps the cross-domain fused feature graphs to the amplitude of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively; both the process of mapping to the amplitude spectrum estimation branch and the process of mapping to the complex spectrum refinement branch consist of a layer of convolution layer, the batch normalization and activation function Sigmoid, as follows:
wherein , and />Representing the amplitudes mapped to the amplitude spectrum estimation branch and the complex spectrum refinement branch, respectivelyIs a cross-domain enhanced correction mask;
the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the intermediate layer, so that the amplitude compensation of the two branches is completed, as follows:
wherein ,representing the output of the amplitude spectrum estimation branch after amplitude compensation,/I>Output representing the amplitude of the complex spectrum refinement branch after amplitude compensation,/>Representing the final complex output of the complex spectrum refinement branch after amplitude compensation.
5. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: in the step 3, reconstructing the spectrum of the finally output enhanced speech signal includes:
prediction output of an ideal ratio mask for a given amplitude spectrum estimation branchResidual output from complex spectrum refinement branch wherein /> Respectively represent->The final spectrum reconstruction is as follows:
in the formula ,represents complex spectrum of the enhanced voice signal, wherein X is the amplitude of the voice to be enhanced, and theta is the angle X Representing the phase spectrum of the noisy speech signal.
6. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: in the step 1, the short-time fourier transform STFT includes:
the method comprises the steps of sampling noisy speech so that the sampling rate of all audio signals is 16KHz, framing by using a Hanning window with a frame length of 30ms and a frame shift of 10ms, and then performing short-time Fourier transform (STFT) to obtain the real part and the imaginary part of each frame in the spectrum of the noisy speech signals, wherein the real part and the imaginary part are as follows:
Y(t,f)=S(t,f)+N(t,f) (12)
in the formula ,
Y(t,f)=Y r (t,f)+jY i (t,f)
S(t,f)=S r (t,f)+jS i (t,f)
wherein Y (t, f) represents the noisy speech spectrum after short-time Fourier transform, t represents the time dimension, and f represents the frequency dimension; s (t, f) and N (t, f) represent clean speech and background noise, subscripts r and i represent the real and imaginary parts of the spectrum, respectively, the number of short-time Fourier transform points is 320, and 161 dimensions after transformation correspond to a frequency range from 0 to 8000Hz.
7. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: in the step 2, the ideal ratio mask IRM is as follows:
the ideal ratio mask IRM is used as a training target to reconstruct a time-frequency diagram of the speech to be enhanced, and is a defined ideal mask, wherein |x| is the magnitude spectrum of the speech to be enhanced, and |s| is the magnitude spectrum of the pure speech signal.
8. An electronic device comprising a memory, a processor, and a computer program stored on the memory and running on the processor, characterized by: the processor implementing the method of any of claims 1-7 when executing the program.
9. A computer-readable storage medium having stored thereon a computer program, characterized by: which computer program, when being executed by a processor, implements the method of any of claims 1-7.
CN202310969308.6A 2023-08-01 2023-08-01 Single-channel voice enhancement method based on step-by-step amplitude compensation network Pending CN116913303A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310969308.6A CN116913303A (en) 2023-08-01 2023-08-01 Single-channel voice enhancement method based on step-by-step amplitude compensation network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310969308.6A CN116913303A (en) 2023-08-01 2023-08-01 Single-channel voice enhancement method based on step-by-step amplitude compensation network

Publications (1)

Publication Number Publication Date
CN116913303A true CN116913303A (en) 2023-10-20

Family

ID=88364795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310969308.6A Pending CN116913303A (en) 2023-08-01 2023-08-01 Single-channel voice enhancement method based on step-by-step amplitude compensation network

Country Status (1)

Country Link
CN (1) CN116913303A (en)

Similar Documents

Publication Publication Date Title
US10580430B2 (en) Noise reduction using machine learning
CN111081268A (en) Phase-correlated shared deep convolutional neural network speech enhancement method
CN110148420A (en) A kind of audio recognition method suitable under noise circumstance
CN111768796A (en) Acoustic echo cancellation and dereverberation method and device
CN112863535A (en) Residual echo and noise elimination method and device
CN115295001B (en) Single-channel voice enhancement method based on progressive fusion correction network
CN111883154B (en) Echo cancellation method and device, computer-readable storage medium, and electronic device
CN114792524B (en) Audio data processing method, apparatus, program product, computer device and medium
CN117174105A (en) Speech noise reduction and dereverberation method based on improved deep convolutional network
CN116682444A (en) Single-channel voice enhancement method based on waveform spectrum fusion network
CN114495960A (en) Audio noise reduction filtering method, noise reduction filtering device, electronic equipment and storage medium
CN115295002B (en) Single-channel voice enhancement method based on interactive time-frequency attention mechanism
CN117351983A (en) Transformer-based voice noise reduction method and system
CN116913303A (en) Single-channel voice enhancement method based on step-by-step amplitude compensation network
CN110992966B (en) Human voice separation method and system
CN114067825A (en) Comfort noise generation method based on time-frequency masking estimation and application thereof
CN114220451A (en) Audio denoising method, electronic device, and storage medium
CN114360560A (en) Speech enhancement post-processing method and device based on harmonic structure prediction
CN112652321A (en) Voice noise reduction system and method based on deep learning phase friendlier
CN113763978A (en) Voice signal processing method, device, electronic equipment and storage medium
CN113724723B (en) Reverberation and noise suppression method and device, electronic equipment and storage medium
CN117219107B (en) Training method, device, equipment and storage medium of echo cancellation model
US20240161766A1 (en) Robustness/performance improvement for deep learning based speech enhancement against artifacts and distortion
CN114566176B (en) Residual echo cancellation method and system based on deep neural network
Sheng et al. Speech noise reduction algorithm based on CA-DCDCCRN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination