CN116913303A

CN116913303A - Single-channel voice enhancement method based on step-by-step amplitude compensation network

Info

Publication number: CN116913303A
Application number: CN202310969308.6A
Authority: CN
Inventors: 叶中付; 陈雯卓
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-08-01
Filing date: 2023-08-01
Publication date: 2023-10-20

Abstract

The invention relates to a single-channel voice enhancement method based on a step-by-step amplitude compensation network, which adopts a three-branch structure based on an encoder-decoder and is respectively an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch. The amplitude spectrum estimation branch is utilized to filter main noise components, the complex spectrum refinement branch is utilized to carry out complementary drawing on missing details, and on the basis of implicitly estimating phase information, the time domain waveform is utilized to correct the two branches. In order to fully utilize the information of the three branches, a cross-domain information fusion module is provided and embedded into the three branches, the characteristics of the three branches are gradually extracted and fused, and the information of the amplitude spectrum estimation branch and the complex spectrum refinement branch is corrected and amplitude compensated. The invention can effectively make up the implicit compensation effect between the amplitude and the phase, improves the quality and the understandability of the voice signal, and is superior to the current most advanced cross-domain voice enhancement method and the previous advanced system.

Description

Single-channel voice enhancement method based on step-by-step amplitude compensation network

Technical Field

The invention relates to the field of voice enhancement, in particular to a single-channel voice enhancement method based on a step-by-step amplitude compensation network.

Background

Single channel speech enhancement refers to removing background noise from a noisy speech signal captured by only one microphone. Single channel speech enhancement is a very challenging task since no speech signals from other microphones are used as references. In recent years, a speech enhancement method utilizing deep learning has outstanding performance in the field, and particularly when the problems of non-stationary noise, low signal-to-noise ratio and the like are solved, the deep learning method is obviously superior to the traditional single-channel speech enhancement algorithm. Convolutional neural networks and recurrent neural networks are two commonly used methods for speech enhancement.

In 2020, a deep complex neural network has been proposed, which combines a complex convolutional neural network and an LSTM neural network, to obtain the first name of Real-Time Track (RT) for a DNS (Deep Noise Suppression) challenge race in 2020. However, such single-branch speech enhancement systems can introduce compensation problems between amplitude and phase, potentially resulting in convergence of the real and imaginary parts to a local suboptimal solution, which can degrade performance in challenging scenarios.

To solve this problem, a target decoupling strategy is proposed that decomposes the original optimization target into a number of interrelated sub-targets. For this purpose, two effective network architectures are designed in the time-frequency domain, namely a multi-stage deep neural network and a dual-path deep neural network. For the former, the network jointly optimizes the output of each stage to gradually enhance the quality of speech. For the latter, the two paths of the network optimize the respective objectives of each path in parallel and cooperatively reconstruct the enhanced speech spectrum. However, these time-frequency domain methods ignore the feature that the time domain method can avoid the problem of amplitude and phase compensation. And moreover, the information of each branch of the dual-path deep neural network is simply and interactively fused, so that the dynamic adjustment process of the information among each branch is omitted, and the quality and the intelligibility of the enhanced voice can be influenced finally.

CN202210885817.6 is a single-channel voice enhancement method based on progressive fusion correction network, which only utilizes the amplitude spectrum characteristic and complex spectrum characteristic of a time-frequency domain to carry out voice enhancement, meanwhile, causality cannot be guaranteed, calculation complexity is high, model parameters are large, and the method is difficult to be deployed in an actual terminal system. The invention ensures the causality of the model, has small trainable parameters and can be flexibly applied to a large number of actual scenes.

CN202210885819.5 is a single-channel speech enhancement method based on an interactive time-frequency attention mechanism, which only uses complex spectrum characteristics of a time-frequency domain, and cannot effectively solve the problem of compensation between amplitude and phase; compared with the method, the method decouples the traditional complex spectrum estimation into the step-by-step optimization of the amplitude and the phase, thereby relieving the compensation problem between the amplitude and the phase, avoiding the mutual influence between the amplitude and the phase and improving the voice enhancement performance.

In addition, compared with other previous patent applications, the previous patent applications learn harmonic information from the complex domain or waveform information from the time domain, and do not consider information from the complex domain, the amplitude domain and the time domain at the same time, so that information loss or amplitude and phase compensation problems are caused, and the performance of speech enhancement is limited. The method and the device perform preliminary denoising on the noisy signal through the amplitude spectrum estimation branch, then add the noise with the residual error output by the complex spectrum refinement branch, reconstruct the spectrum of the finally output enhanced voice signal, and the strategy can effectively improve the voice enhancement performance. According to the cross-domain information fusion module, the multi-scale characteristic extraction is carried out on three branches from a complex domain, an amplitude domain and a time domain through the multi-scale convolution block, so that more effective amplitude compensation can be completed, and the voice enhancement performance is improved.

Disclosure of Invention

The invention solves the technical problems: the method solves the problems of compensation between amplitude and phase caused by traditional complex spectrum estimation and insufficient utilization of time domain waveform information, and provides a single-channel voice enhancement method based on a step-by-step amplitude compensation network.

The invention combines the advantages of time domain and time frequency domain, introduces a time frequency domain-based branch and a time domain-based branch into the network, effectively utilizes the harmonic information and time domain waveform information in the time frequency spectrum, and performs step-by-step amplitude compensation and dynamic adjustment of information on the amplitude spectrum estimation branch and the complex spectrum refinement branch of each stage through a cross-domain information fusion module, thereby improving the quality and the intelligibility of the voice signal, and having obvious advantages compared with the common voice enhancement neural network in enhancement effect.

The technical solution of the invention is realized by the following technical scheme:

in a first aspect, the present invention provides a single-channel speech enhancement method based on a progressive amplitude compensation network, including the following:

step 1: performing short-time Fourier transform (STFT) on the voice signal with noise to obtain a complex spectrum, a magnitude spectrum and a phase of each frame in a voice signal spectrum with noise;

step 2: the complex spectrum is input into a complex spectrum refining branch in a three-branch network; the amplitude spectrum is input into an amplitude spectrum estimation branch in a three-branch network; inputting the voice signal with noise into a time domain waveform correction branch in a three-branch network after framing;

The amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refining branch and the time domain information output by each middle layer of the amplitude spectrum estimation branch, the complex spectrum refining branch and the time domain waveform correction branch are respectively input into a cross-domain information fusion module;

the cross-domain information fusion module respectively performs feature extraction, fusion and projection on the amplitude of the amplitude spectrum branch, the amplitude of the complex spectrum refinement branch and the time domain information of the time domain waveform correction branch to obtain two cross-domain enhanced correction masks aiming at the amplitude of the amplitude spectrum branch and the complex spectrum refinement branch, and realizes correction by using the time domain information to complete the compensation of the amplitude spectrum branch and the complex spectrum refinement branch;

the cross-domain information fusion module comprises three stages, namely a feature extraction stage, a feature fusion stage and a feature projection stage;

in the characteristic extraction stage, deep characteristic extraction is carried out on amplitude information of an amplitude spectrum estimation branch, and a characteristic diagram aiming at the amplitude spectrum estimation branch is obtained; deep feature extraction is carried out on the amplitude information of the complex spectrum refinement branch, so that a feature map aiming at the amplitude information of the complex spectrum refinement branch is obtained; deep feature extraction is carried out on time domain information of the time domain waveform branch, and a feature map aiming at the time domain waveform correction branch is obtained;

In the feature fusion stage, the feature images of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch are fused to obtain a cross-domain fused feature image;

in a characteristic projection stage, projecting the cross-domain fused characteristic diagram onto the amplitudes of an amplitude spectrum estimation branch and a complex spectrum refinement branch respectively to obtain two cross-domain enhancement correction masks for the amplitudes of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively;

the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the middle layer respectively, so that the amplitude compensation of the two branches is completed;

introducing a plurality of cross-domain information fusion modules into each intermediate layer of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch, and performing step-by-step amplitude compensation on an input voice signal with noise to form a step-by-step amplitude compensation network;

the final output of the amplitude spectrum estimation branch is used as an estimated ideal ratio mask for the amplitude spectrum, and main noise components are filtered; the final output in the complex spectrum refinement branch is used as the residual error between the primarily denoised voice signal and the enhanced voice signal;

Step 3: and (3) multiplying the ideal ratio mask outputted by the amplitude spectrum estimation branch in the step (2) with the amplitude spectrum point in the step (1), coupling with the phase in the step (1) to form a preliminary denoising voice signal, adding the spectrum of the preliminary denoising voice signal with the residual outputted by the complex spectrum refinement branch in the step (2), reconstructing the spectrum of the finally outputted enhanced voice signal, and performing short-time Fourier inverse transform (iSTFT) on the spectrum of the enhanced voice signal to obtain the enhanced voice signal.

Further, the amplitude spectrum estimation branch comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an amplitude spectrum of an input voice signal with noise to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then the amplitude spectrum of the enhanced voice signal and a phase pair of an original voice signal with noise are reconstructed into a voice signal with primary enhancement;

the complex spectrum refinement branch comprises a complex convolution encoder, a complex long-short time memory network LSTM and a complex convolution decoder; the complex convolution encoder carries out depth feature extraction on the complex spectrum of the input noise voice signal to obtain a feature map with depth feature information, the feature map is input into a complex long-short time memory network LSTM, the time dependence relationship is modeled, and the details of the loss of the complex spectrum of the voice after preliminary enhancement are recovered through the complex convolution decoder;

The time domain waveform correction branch comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the real number convolution encoder carries out depth feature extraction on the input framing time domain noisy speech waveform to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, the time dependence relationship is modeled, and then the real number convolution encoder decodes the time dependence relationship.

Further, the complex convolutional encoder is formed by stacking six convolutional blocks, and each convolutional block consists of a complex convolutional layer, a complex batch normalization layer and a complex parametric ReLU activation function; the complex convolution layer is obtained by simulating four convolution layers according to the operation rule of complex multiplication, and a complex filter matrix W=W is set _r +jW _i Complex form input vector x=x _r +jX _i, wherein ,W_r and W_i Is a real tensor filter matrix, X _r and X_i Is a real input tensor, the real part is used to simulate complex operation, and then the output of complex convolution operation is expressed as:

F _out ＝(X _r *W _r -X _i *W _i )+j(X _r *W _i +X _i *W _r ) (1)

in the formula ,F_out Is the output of the complex form convolution layer; similarly, there are also plural LSTM layers outputting F _LSTM The definition is as follows:

F _LSTM ＝(F _rr -F _ii )+j(F _ri +F _ir )

F _rr ＝LSTM _r (X _r )，F _ii ＝LSTM _i (X _i )

F _ri ＝LSTM _i (X _r )，F _ir ＝LSTM _r (X _i ) (2)

wherein LSTM represents a traditional LSTM neural network, and subscripts r and i respectively represent a real part and an imaginary part of a corresponding network; the complex convolutional decoder is formed by stacking six deconvolution blocks with corresponding sizes, and residual connection is used between the encoder and the decoder.

Further, the characteristic extraction path of the amplitude spectrum estimation branch and the characteristic extraction path of the complex spectrum refinement branch are both composed of a time-frequency domain multi-scale convolution block, the time-frequency domain multi-scale convolution block respectively passes through three convolution layers with the convolution kernel sizes of 3*1, 1*3 and 3*3, the output of the three convolution layers is spliced and then sent into a convolution block, and the convolution block is composed of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; the outputs of the feature extraction path of the magnitude spectrum estimation branch and the feature extraction path of the complex spectrum refinement branch are expressed as:

wherein , and />Feature graphs obtained by respectively representing feature extraction paths of amplitude spectrum estimation branches and feature extraction paths of complex spectrum refinement branches, M ⁱ and |Cⁱ I respectively represents that the ith layer cross-domain information fusion module comes from the amplitude spectrum estimation branchAmplitude information from complex spectrum refinement branch, θ ⁱ Representing the phase, W, of a complex spectrum refinement branch ¹ 、W ² and W³ Weight matrix representing convolution layers of convolution kernel sizes 3*1, 1*3 and 3*3, b, respectively ¹ 、b ² and b³ Respectively represent the deviation, the concat function represents the splicing in the channel dimension, F ^m and F^c For each mapping function of the convolution blocks of the two branches, representing convolution operations; wherein W of formulas (3) and (4) ¹ 、W ² 、W ³ Different, the same as b ¹ 、b ² 、b ³ Different;

the characteristic extraction path of the time domain waveform correction branch consists of a time domain multi-scale convolution block, wherein the time domain multi-scale convolution block firstly passes through three convolution layers with convolution kernel sizes of 1*3 and 3*3 respectively, and after the output of the three convolution layers are spliced, the three convolution layers are fed into a convolution block, and the convolution block consists of a convolution layer with a convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; and outputting a characteristic diagram obtained by respectively carrying out average pooling and maximum pooling on the output, and summing the output to be used as a characteristic extraction path of a time domain waveform correction branch, wherein the characteristic diagram is as follows:

X _out ＝F ^t (concat(W ¹ *w ⁱ +b ¹ ，W ² *w ⁱ +b ² ))

wherein ,X_out Representing the output of the time-domain multi-scale convolution block,feature map obtained by feature extraction path representing time domain waveform correction branch, w ⁱ Representing the input of the ith layer cross-domain information fusion module from the time domain waveform correction branch, W ¹ and W² Is the weight of the convolution kernel, b ¹ and b² Is a deviation; f (F) ^t Correcting the mapping function of the convolution blocks of the branches for the time domain waveform, avgPool and MaxPool represents average pooling and maximum pooling, respectively;

feature graphs extracted in the feature extraction stage and respectively coming from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are sent to a feature fusion stage, three feature graph points are multiplied firstly, then the feature graphs are sent to a convolution block, the convolution block consists of a convolution layer, a batch normalization and activation function Sigmoid, and finally a cross-domain fused feature graph is output, wherein the cross-domain fused feature graph is as follows:

wherein ,representing a fusion tensor of an i-layer cross-domain information fusion module, BN represents batch normalization, and sigma represents a Sigmoid activation function; />Characteristic diagrams from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are respectively represented, W is the weight of a convolution kernel, and b is the deviation;

the feature mapping stage receives the cross-domain fused feature graphs from the feature fusion stage and maps the cross-domain fused feature graphs to the amplitude of the amplitude spectrum estimation branch and the complex spectrum refinement branch respectively; both the process of mapping to the amplitude spectrum estimation branch and the process of mapping to the complex spectrum refinement branch consist of a layer of convolution layer, the batch normalization and activation function Sigmoid, as follows:

wherein , and />A cross-domain enhancement correction mask representing the magnitudes mapped to the magnitude spectrum estimation branch and the complex spectrum refinement branch, respectively;

the amplitude information of the amplitude spectrum estimation branch and the complex spectrum refinement branch input into the cross-domain information fusion module are multiplied by the cross-domain enhancement correction mask output by the cross-domain information fusion module of the intermediate layer, so that the amplitude compensation of the two branches is completed, as follows:

wherein ,representing the output of the amplitude spectrum estimation branch after amplitude compensation,/I>Output representing the amplitude of the complex spectrum refinement branch after amplitude compensation,/>Representing the final complex output of the complex spectrum refinement branch after amplitude compensation.

Further, in the step 3, reconstructing the spectrum of the finally output enhanced speech signal includes:

prediction output of an ideal ratio mask for a given amplitude spectrum estimation branchResidual output from complex spectrum refinement branch> wherein />Respectively represent->The final spectrum reconstruction is as follows:

in the formula ,represents complex spectrum of the enhanced voice signal, wherein X is the amplitude of the voice to be enhanced, and theta is the angle _X Representing the phase spectrum of the noisy speech signal.

Further, in the step 1, the short-time fourier transform STFT includes:

The method comprises the steps of sampling noisy speech so that the sampling rate of all audio signals is 16KHz, framing by using a Hanning window with a frame length of 30ms and a frame shift of 10ms, and then performing short-time Fourier transform (STFT) to obtain the real part and the imaginary part of each frame in the spectrum of the noisy speech signals, wherein the real part and the imaginary part are as follows:

Y(t，f)＝S(t，f)+N(t，f) (12)

in the formula ,

Y(t，f)＝Y _r (t，f)+jY _i (t，f)

S(t，f)＝S _r (t，f)+jS _i (t，f)

wherein Y (t, f) represents the noisy speech spectrum after short-time Fourier transform, t represents the time dimension, and f represents the frequency dimension; s (t, f) and N (t, f) represent clean speech and background noise, subscripts r and i represent the real and imaginary parts of the spectrum, respectively, the number of short-time Fourier transform points is 320, and 161 dimensions after transformation correspond to a frequency range from 0 to 8000Hz.

Further, in the step 2, the ideal ratio mask IRM is as follows:

the ideal ratio mask IRM is used as a training target to reconstruct a time-frequency diagram of the speech to be enhanced, and is a defined ideal mask, wherein |x| is the magnitude spectrum of the speech to be enhanced, and |s| is the magnitude spectrum of the pure speech signal.

In a second aspect, the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor implements the single channel speech enhancement method based on a progressive amplitude compensation network when executing the program.

In a third aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a single channel speech enhancement method based on a step-wise amplitude compensation network as described above.

Compared with the prior art, the invention has the advantages that:

(1) The invention relates to a single-channel voice enhancement method based on a step-by-step amplitude compensation network, which is used for compensating an implicit compensation effect between amplitude and phase caused by simultaneously reconstructing a real part and an imaginary part of voice in a complex domain. The invention adopts a three-branch structure based on an encoder-decoder, which is an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch respectively. The invention utilizes the time domain waveform to correct two branches on the basis of implicitly estimating phase information by utilizing the amplitude spectrum estimation branch to filter out main noise components and utilizing the complex spectrum refinement branch to carry out complementary drawing on missing details. In order to fully utilize the information of the three branches, a cross-domain information fusion module is provided and embedded into the three branches, the characteristics of the three branches are extracted and fused step by step, and the information of the amplitude spectrum estimation branch and the complex spectrum refinement branch is corrected and amplitude compensated. The invention can effectively make up the implicit compensation effect between the amplitude and the phase, improves the quality and the understandability of the voice signal, and is superior to the current most advanced cross-domain voice enhancement method and the previous advanced system.

(2) The invention is based on cross-domain, combines the advantages of time domain and time-frequency domain, introduces the branch based on complex domain, the branch based on amplitude domain and the branch based on time domain into the network, effectively utilizes the harmonic information and the time domain waveform information in the frequency spectrum, can better recover the details of the voice signal, and improves the voice quality and the intelligibility.

(3) The invention decouples the traditional complex spectrum estimation into the step-by-step optimization of the amplitude and the phase, thereby alleviating the compensation problem between the amplitude and the phase, namely alleviating the problem that the accuracy of the amplitude is sacrificed for compensating the phase in the traditional complex spectrum estimation. Thus, the invention can lighten the mutual influence between the amplitude and the phase and improve the performance of voice enhancement.

(4) The invention provides a cross-domain information fusion module which performs step-by-step amplitude compensation on an amplitude spectrum estimation branch and a complex spectrum refinement branch at each stage to generate a hierarchical cross-domain enhancement correction mask, thereby promoting dynamic adjustment of information between the amplitude spectrum estimation branch and the complex spectrum refinement branch, and further improving the performance of voice enhancement by utilizing information from a time domain waveform to perform step-by-step enhancement on the amplitudes of the two branches.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an overall network architecture of the present invention;

FIG. 2 is a specific structure of a cross-domain information fusion module according to the present invention;

FIG. 3 is a specific structure of a time-frequency domain multi-scale convolution block according to the present invention;

fig. 4 is a specific structure of a time domain multi-scale convolution block in the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is evident that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

As shown in fig. 1, the single-channel voice enhancement method based on the step-by-step amplitude compensation network provided by the embodiment of the invention includes the following contents:

The amplitude spectrum estimation branch includes: a real convolutional encoder, a real long-short time memory network LSTM and a real convolutional decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an amplitude spectrum of an input voice signal with noise to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then the amplitude spectrum of the enhanced voice signal and a phase pair of an original voice signal with noise are reconstructed into a voice signal with primary enhancement;

The complex spectrum refinement branch circuit comprises: a complex convolutional encoder, a complex long-short time memory network LSTM and a complex convolutional decoder; the complex convolution encoder carries out depth feature extraction on the complex spectrum of the input noise voice signal to obtain a feature map with depth feature information, the feature map is input into a complex long-short time memory network LSTM, the time dependence relationship is modeled, and the details of the loss of the complex spectrum of the voice after preliminary enhancement are recovered through the complex convolution decoder;

the time domain waveform correction branch circuit includes: the system comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an input framing time domain noisy speech waveform to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then decoding is carried out through a real number convolution decoder;

the complex convolution encoder is formed by stacking six convolution blocks, wherein each convolution block consists of a complex convolution layer, a complex batch normalization layer and a complex parametric ReLU activation function; plural number type The convolution layer is obtained by simulating four convolution layers according to the operation rule of complex multiplication, and a complex form filter matrix W=W is set _r +jW _i Complex form input vector x=x _r +jX _i, wherein ,W_r and W_i Is a real tensor filter matrix, X _r and X_i Is a real input tensor, the real part is used to simulate complex operation, and then the output of complex convolution operation is expressed as:

F _out ＝(X _r *W _r -X _i *W _i )+j(X _r *W _i +X _i *W _r ) (1)

F _LSTM ＝(F _rr -F _ii )+j(F _ri +F _ir )

F _rr ＝LSTM _r (X _r )，F _ii ＝LSTM _i (X _i )

F _ri ＝LSTM _i (X _r )，F _ir ＝LSTM _r (X _i ) (2)

As shown in fig. 2, the above-mentioned cross-domain information fusion module includes: an amplitude spectrum estimation branch characteristic extraction path and a characteristic extraction path of a complex spectrum refinement branch.

The characteristic extraction paths of the amplitude spectrum estimation branch and the complex spectrum refinement branch are composed of a time-frequency domain multi-scale convolution block, as shown in fig. 3, the time-frequency domain multi-scale convolution block is firstly subjected to three convolution layers with convolution kernel sizes of 3*1, 1*3 and 3*3 respectively, after the output of the three convolution layers are spliced, the three convolution layers are sent into a convolution block, and the convolution block is composed of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; the outputs of the feature extraction path of the magnitude spectrum estimation branch and the feature extraction path of the complex spectrum refinement branch are expressed as:

wherein , and />Feature graphs obtained by respectively representing feature extraction paths of amplitude spectrum estimation branches and feature extraction paths of complex spectrum refinement branches, M ⁱ and |Cⁱ I respectively represents the amplitude information from the amplitude spectrum estimation branch and the amplitude from the complex spectrum refinement branch of the ith layer cross-domain information fusion module, and theta ⁱ Representing the phase, W, of a complex spectrum refinement branch ¹ 、W ² and W³ Weight matrix representing convolution layers of convolution kernel sizes 3*1, 1*3 and 3*3, b, respectively ¹ 、b ² and b³ Respectively represent the deviation, the concat function represents the splicing in the channel dimension, F ^m and F^c For each mapping function of the convolution blocks of the two branches, representing convolution operations; wherein W of formulas (3) and (4) ¹ 、W ² 、W ³ Different, the same as b ¹ 、b ² 、b ³ Different;

the characteristic extraction path of the time domain waveform correction branch consists of a time domain multi-scale convolution block, as shown in fig. 4, the time domain multi-scale convolution block firstly passes through three convolution layers with the convolution kernel sizes of 1*3 and 3*3 respectively, and after the output of the three convolution layers are spliced, the three convolution layers are sent into a convolution block, and the convolution block consists of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; and outputting a characteristic diagram obtained by respectively carrying out average pooling and maximum pooling on the output, and summing the output to be used as a characteristic extraction path of a time domain waveform correction branch, wherein the characteristic diagram is as follows:

X _out ＝F ^t (concat(W ¹ *w ⁱ +b ¹ ，W ² *w ⁱ +b ² ))

wherein ,X_out Representing the output of the time-domain multi-scale convolution block,feature map obtained by feature extraction path representing time domain waveform correction branch, w ⁱ Representing the input of the ith layer cross-domain information fusion module from the time domain waveform correction branch, W ¹ and W² Is the weight of the convolution kernel, b ¹ and b² Is a deviation; f (F) ^t For the mapping function of the convolution block of the time domain waveform correction branch, avgPool and MaxPool represent average pooling and maximum pooling respectively;

wherein ,fusion representing an i-th layer cross-domain information fusion moduleThe tensor, BN, represents the batch normalization and sigma represents the Sigmoid activation function; />Characteristic diagrams from an amplitude spectrum estimation branch, a complex spectrum refinement branch and a time domain waveform correction branch are respectively represented, W is the weight of a convolution kernel, and b is the deviation;

The reconstructing the spectrum of the finally output enhanced speech signal includes:

In the above step 1, the short-time fourier transform STFT includes:

Y(t，f)＝S(t，f)+N(t，f) (12)

in the formula ,

Y(t，f)＝Y _r (t，f)+jY _i (t，f)

S(t，f)＝S _r (t，f)+jS _i (t，f)

In the above step 2, the ideal ratio mask IRM includes:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be as defined in the claims.

Claims

1. A single-channel speech enhancement method based on a step-by-step amplitude compensation network, comprising:

introducing a cross-domain information fusion module into each intermediate layer of the amplitude spectrum estimation branch, the complex spectrum refinement branch and the time domain waveform correction branch, and performing step-by-step amplitude compensation on an input voice signal with noise to form a step-by-step amplitude compensation network;

2. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: the amplitude spectrum estimation branch comprises a real number convolution encoder, a real number long-short time memory network LSTM and a real number convolution decoder; the method comprises the steps that a real number convolution encoder carries out depth feature extraction on an amplitude spectrum of an input voice signal with noise to obtain a feature map with depth feature information, the feature map is input into a real number long-short time memory network LSTM, modeling is carried out on a time dependence relation, and then the amplitude spectrum of the enhanced voice signal and a phase pair of an original voice signal with noise are reconstructed into a voice signal with primary enhancement;

3. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 2, wherein: the complex convolution encoder is formed by stacking six convolution blocks, and each convolution block consists of a complex convolution layer, a complex batch normalization layer and a complex parametric ReLU activation function; the complex convolution layer is obtained by simulating four convolution layers according to the operation rule of complex multiplication, and a complex filter matrix W=W is set _r +jW _i Complex form input vector x=x _r +jX _i, wherein ,W_r and W_i Is a real tensor filter matrix, X _r and X_i Is a real input tensor, the real part is used to simulate complex operation, and then the output of complex convolution operation is expressed as:

F _out ＝(X _r *W _r -X _i *W _i )+j(X _r *W _i +X _i *W _r ) (1)

F _LSTM ＝(F _rr -F _ii )+j(F _ri +F _ir )

F _rr ＝LSTM _r (X _r )，F _ii ＝LSTM _i (X _i )

F _ri ＝LSTM _i (X _r )，F _ir ＝LSTM _r (X _i ) (2)

wherein LSTM represents an LSTM neural network, and subscripts r and i respectively represent a real part and an imaginary part of a corresponding network; the complex convolutional decoder is formed by stacking six deconvolution blocks with corresponding sizes, and residual connection is used between the encoder and the decoder.

4. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: the characteristic extraction paths of the amplitude spectrum estimation branch and the complex spectrum refinement branch are composed of a time-frequency domain multi-scale convolution block, the time-frequency domain multi-scale convolution block respectively passes through three convolution layers with the convolution kernel sizes of 3*1, 1*3 and 3*3, the output of the three convolution layers is spliced and then is sent into a convolution block, and the convolution block is composed of a convolution layer with the convolution kernel size of 1*1, batch normalization and Sigmoid activation functions; the outputs of the feature extraction path of the magnitude spectrum estimation branch and the feature extraction path of the complex spectrum refinement branch are expressed as:

X _out ＝F ^t (concat(W ¹ *w ⁱ +b ¹ ，W ² *w ⁱ +b ² ))

wherein , and />Representing the amplitudes mapped to the amplitude spectrum estimation branch and the complex spectrum refinement branch, respectivelyIs a cross-domain enhanced correction mask;

5. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: in the step 3, reconstructing the spectrum of the finally output enhanced speech signal includes:

prediction output of an ideal ratio mask for a given amplitude spectrum estimation branchResidual output from complex spectrum refinement branch wherein /> Respectively represent->The final spectrum reconstruction is as follows:

6. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: in the step 1, the short-time fourier transform STFT includes:

Y(t，f)＝S(t，f)+N(t，f) (12)

in the formula ,

Y(t，f)＝Y _r (t，f)+jY _i (t，f)

S(t，f)＝S _r (t，f)+jS _i (t，f)

7. The single-channel speech enhancement method based on a progressive amplitude compensation network of claim 1, wherein: in the step 2, the ideal ratio mask IRM is as follows:

8. An electronic device comprising a memory, a processor, and a computer program stored on the memory and running on the processor, characterized by: the processor implementing the method of any of claims 1-7 when executing the program.

9. A computer-readable storage medium having stored thereon a computer program, characterized by: which computer program, when being executed by a processor, implements the method of any of claims 1-7.