CN112037809A - Residual echo suppression method based on multi-feature flow structure deep neural network - Google Patents

Residual echo suppression method based on multi-feature flow structure deep neural network Download PDF

Info

Publication number
CN112037809A
CN112037809A CN202010940284.8A CN202010940284A CN112037809A CN 112037809 A CN112037809 A CN 112037809A CN 202010940284 A CN202010940284 A CN 202010940284A CN 112037809 A CN112037809 A CN 112037809A
Authority
CN
China
Prior art keywords
signal
feature
neural network
model
residual echo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010940284.8A
Other languages
Chinese (zh)
Inventor
陈宏圣
卢晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202010940284.8A priority Critical patent/CN112037809A/en
Publication of CN112037809A publication Critical patent/CN112037809A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Abstract

The invention discloses a residual echo suppression method based on a multi-feature flow structure deep neural network. The method comprises the following specific steps: (1) constructing a noisy near-end voice with residual echo and background noise and an output signal of an adaptive filter through an adaptive filtering algorithm; using the self-adaptive filter output signal or the far-end signal or the two signals as a reference signal of a neural network model with a multi-feature flow structure; (2) using the voice signal with noise and the reference signal as the input characteristics of the neural network model; training the model by using pure near-end speech as a training target of the model; (3) and (3) taking the trained neural network model with the multi-feature flow structure as a post-processing filter, inhibiting residual echoes and background noise in the signal processed by the self-adaptive filter, and enhancing the audio signal of the near-end speaker. The invention can effectively remove the influence of the residual echo on the near-end voice in the scene of high residual echo.

Description

Residual echo suppression method based on multi-feature flow structure deep neural network
Technical Field
The invention belongs to the field of echo suppression, and particularly relates to a nonlinear residual echo suppression method based on a deep neural network with a multi-feature flow structure.
Background
In a communication system, a far-end signal is converted into an acoustic signal by a loudspeaker system, and the acoustic signal is collected by a microphone system through an echo acoustic path to generate an echo signal. The echo signal will severely interfere with the quality of the voice communication and degrade the accuracy of the voice recognition system. The technique of suppressing echo signals and extracting the speech signals of the near-end speaker is called echo suppression.
A typical echo suppression method is to use an adaptive Linear Acoustic Echo Cancellation (LAEC) algorithm to match the transfer function corresponding to the acoustic echo transfer path and further suppress the residual echo signal using a post-processing filter. Among various adaptive algorithms, the frequency domain least square adaptive filter algorithm and its derivative algorithm have a faster convergence rate and a lower computational burden, and are often applied to an actual echo suppression task.
When the echo path has non-negligible non-linear effect, the performance of the echo suppression system based on the assumption of the linear system will be greatly reduced, and therefore, it is necessary to suppress the residual echo in the signal processed by the LAEC system. Residual echo suppression systems often use the far-end signal, adaptive filter coefficients, and the signal processed by the LAEC system to estimate the amplitude of the residual echo and suppress the residual echo signal accordingly. This part of the signal processing based approach is often difficult to balance well with respect to residual echo suppression and near-end speech distortion. In response to this problem, scholars introduce a deep neural network into the residual echo suppression system to improve the suppression effect on the nonlinear residual echo. Most of the time domain features are extracted by using short-time Fourier transform, and the amplitude of a time frequency spectrum is used as an input feature. On one hand, conflict exists between the processing time delay of the short-time Fourier transform and the frequency domain resolution; on the other hand, using the magnitude spectrum or its mask as a training target does not allow the phase information to be recovered, limiting the performance of the network.
Conv-TasNet network (Luo Y, Mesgarani N.Conv-TasNet: weighting Ideal Time-Frequency mapping for Speech Separation [ J ]. IEEE Transactions on Audio, Speech, and Language Processing,2019,27(8):1256-1266.), that is, the full convolution Time domain Speech Separation network, is an end-to-end Speech Separation network. In the voice separation task, the network can have shorter processing delay due to end-to-end processing of the network, and the network can obtain better effect compared with a method based on a time-frequency spectrum mask. Considering that the residual echo suppression task can be regarded as a voice enhancement task for extracting only near-end voice, it is feasible to extend the Conv-TasNet model related to the voice separation task to the field of residual echo suppression.
Disclosure of Invention
In the existing residual echo suppression technology, under the condition that near-end voice exists and the interference of the residual echo is high, the residual echo signal is often difficult to be effectively suppressed, and excessive suppression of the near-end voice also often exists, so that the effect of the residual echo suppression is influenced. The invention provides a residual echo suppression method based on a deep neural network with a multi-feature flow structure, which can effectively extract a near-end voice signal under the condition of high residual echo interference.
The technical scheme adopted by the invention is as follows:
the residual echo suppression method based on the multi-feature flow structure deep neural network comprises the following steps:
step 1, constructing a noisy near-end voice with residual echo and background noise and an output signal of an adaptive filter by using a pure voice signal, background noise, an echo signal and a far-end signal corresponding to the echo signal through an adaptive filtering algorithm;
step 2, taking the output signal of the adaptive filter or the far-end signal or the two signals as a reference signal; using the reference signal and the noisy near-end speech signal constructed in the step 1 as input features of a neural network model with a multi-feature flow structure, wherein the model performs joint processing on a feature flow A containing the reference signal and a feature flow B containing shallow layer estimation information of near-end speech; training the model by using pure near-end speech as a training target of the model;
and 3, taking the trained neural network model with the multi-feature flow structure as a post-processing filter, suppressing residual echo and background noise in the signal processed by the self-adaptive filter, and enhancing the audio signal of the near-end speaker.
Further, in step 2, the feature stream a containing the reference signal and the feature stream B containing the shallow layer estimation information of the near-end speech are jointly processed in one of the following two ways: (1) subtracting the feature stream B from the feature stream containing the noisy near-end voice information to obtain a new feature stream C, and performing comprehensive processing on the feature streams A and C after processing through a normalization layer and a convolution layer respectively; (2) directly carrying out comprehensive treatment on the characteristic flows A and B after respectively carrying out treatment on the normalization layer and the convolution layer; wherein, the comprehensive treatment refers to the operations of dimension splicing, point-by-point addition, subtraction, multiplication and division and equivalent operations thereof.
Furthermore, the neural network model with the multi-feature stream structure extracts features in a time domain waveform through an encoder module, processes the features through a plurality of one-dimensional convolution modules with different expansion rates and a multi-feature stream convolution module in a suppressor module to obtain a mask of an estimated pure voice feature spectrum, applies the mask to a noisy voice feature spectrum output by an encoder to obtain an estimation of the pure voice feature spectrum, and finally restores the pure voice feature spectrum into the time domain waveform through a decoder module.
According to the method, the characteristic stream containing the reference signal information and the characteristic stream containing the shallow layer estimation information of the pure voice are processed in a combined mode through the neural network, so that the network model can still obtain a near-end voice estimation result with high tone quality and less distortion under the condition of high residual echo interference, the influence of residual echo on the near-end voice is effectively removed, and the robustness is high.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of a Conv-TasNet model of a multi-feature flow architecture employed in an embodiment of the present invention;
FIG. 3 is a schematic diagram of (a) a 1-D Conv module and (b) an MI Conv module of the model of FIG. 2;
fig. 4 is a diagram comparing PESQ values of enhanced speech in different SER cases for the prior art method and the method of the present invention, (a) the far-end signal is speech, and (b) the far-end signal is music.
Detailed Description
The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.
The embodiment is performed in simulation, and provides a residual echo suppression method based on a multi-feature stream structure Conv-TasNet model, which is suitable for the case of high residual echo interference, and includes the following steps:
1. generating training samples
2. Training multi-feature stream structure Conv-TasNet model
The information attribute is defined as an additional attribute of the module output in the model, and the feature stream is defined as the module output containing the same information attribute. Each original input feature of the model is also treated as a feature stream with information attributes that are composed of the physical meaning of the information of each original input feature itself. After the feature stream is operated, the information attribute of the output feature stream of the operation is inherited from the input feature stream. The information attributes are divided into prior information attributes and posterior information attributes, wherein the determination of the prior information attributes needs to analyze the inheritance relationship of the information attributes of the model frame, and the determination of the posterior information attributes needs to analyze the actual physical significance of the feature stream of the trained model. And when the prior information attribute and the posterior information attribute conflict, preferentially using the posterior information attribute to replace the prior information attribute. The 'shallow estimation of near-end speech' is defined as a type of posterior information attribute, and when a plurality of feature streams are subjected to mixing operation, the information attribute is not inherited. In this embodiment, the feature stream with the attribute of "shallow estimate of near-end speech" may be directly output by the decoder module as an estimate of near-end speech. The characteristic flow forms a part of the overall output of the model subgraph in a skip connection mode; and, when calculating the overall training loss of the model, this part of the feature stream will be used to calculate the loss alone and combined with the loss of the final output of the model. It should be understood that the above definitions are used to illustrate the definitions of the feature stream as set forth in the claims, and are not suitable for illustrating the model structure, and the feature stream partitioning is not performed according to the above strict definitions for convenience of explanation.
The basic block diagram of the multi-feature flow structure Conv-TasNet model adopted in this embodiment is shown in fig. 2, and the model accepts noisy near-end speech and an adaptive filter output signal (model reference signal) as model inputs. The Encoder (Encoder) module is composed of a single-layer one-dimensional convolution layer with an output dimension of N, a convolution kernel length of L and a step length of S and a linear rectification function (Relu) activation function and is used for converting a time domain waveform signal into a characteristic spectrum. The Decoder (Decoder) module is composed of a single-layer one-dimensional transposed convolutional layer and is used for restoring the characteristic spectrum into a time-domain waveform. The suppressor (Suppression) module consists of a normalization (LayerNorm) layer, a 1 × 1 convolution (1 × 1Conv) layer with an output dimension of B, and R sub-modules, each sub-module comprising M expansion rates of 1,2,4M-1The one-dimensional convolution (1-D Conv) module. Each sub-module contains, in addition to the first sub-module, a multiple eigenflow convolution (MI Conv) module.
The embodiment adopts the exponential moving average normalization operation as the normalization layer of the module, which is defined as follows
Figure BDA0002673406210000041
Figure BDA0002673406210000042
Figure BDA0002673406210000043
Wherein f isk,jIs the jth feature of the kth frame of the feature spectrum, F is the dimension of that feature,
Figure BDA0002673406210000044
is a trainable parameter, α, ∈, Ω is a hyperparameter, and Ω is set to 0.5 hereinafter unless otherwise specified. In the experiment, the parameter NαSet to a finite value so that the normalization operation can be efficiently achieved by a convolution operation.
The basic block diagram of the 1-D Conv module is shown in (a) of FIG. 3. The module consists of a 1 × 1Conv layer with an output dimension H, a parametric linear rectification unit (PRelu) layer, a normalization (Norm) layer, a deep convolution (D-Conv) layer with a convolution kernel length P, a PRelu layer, a Norm layer and two parallel 1 × 1Conv layers with an output dimension B in sequence. The output of one of the terminal 1 × 1Conv layers is added to the input of the module as the residual input of the following module, and the other 1 × 1Conv layer constitutes part of the final output of the model as the skip connection output.
The basic block diagram of the MI Conv module is shown in fig. 3 (b). This module accepts 4 feature streams as inputs, respectively: (1) the sum (stream D) of the residual output of the module before (stream C) and (4) the skip connection output of the module before (stream D) is obtained by adding the feature spectrum of the noisy near-end speech (stream a), (2) the feature spectrum of the model reference signal (stream B), (3). The Sub-operation is used to extract the characteristics of the residual echo signal from the output of the previous module, defined as follows
Figure BDA0002673406210000051
Figure BDA0002673406210000052
Wherein f isAAnd fDiIs the characteristic spectrum of stream A and the i stream D, gobRepresents the operation of the same Output Block as in the supression Block,
Figure BDA0002673406210000053
is a trainable parameter, fsubiIs the output of this sub operation. Will f isOiRegarded as an appropriate approximation to near-end speech, fsubiInformation of the residual echo is extracted accordingly. After this, the output of stream B and sub operations are normalized and dimensionally scaled using two parallel normalization layers (Norm) and 1 × 1Conv layers. The hyperparameter Ω of the normalization layer (Norm) was set to 0.4 to retain certain amplitude characteristics. The two output streams are then spliced together (Concat) and processed through a normalization layer, a 1 × 1Conv layer, a PRelu layer, a Norm layer, a D-Conv layer, a PRelu layer, and a Norm layer, respectively. And splicing the processed result with the characteristic spectrum of the stream C, processing by a 1 × 1Conv layer, and adding the processed result with the characteristic spectrum of the stream C to obtain residual output. The MI Conv module fuses the information of the reference signal and the residual echo into stream C by the above operation to guide the model to suppress the residual echo. Early approximation of near-end speech without sub-manipulation fOiProcessing directly with the Norm layer and the 1 × 1Conv layer, there is a slight decrease of SISNR (scale-innovative source to noise ratio) on the training set of this model of about 0.2dB, and no significant difference is generated.
In the training process, f generated by using stream A and stream DOiGenerating the overall training Loss of the model together with the final output of the modeltotalIs defined as follows
Figure BDA0002673406210000054
Therein, lossiIs fOiLoss of time-domain waveform converted by decoder modulelastIs the loss of model output, and w is the weight parameter. In this embodiment, SISNR is used as lossiAnd losslastIs measured.
3. Estimation of clean near-end speech using a multi-feature stream structure Conv-TasNet model
In the use process of the multi-feature-flow structure Conv-TasNet model, only noisy near-end speech and the output signal of the adaptive filter are required to be received as model input, and pure near-end speech estimation can be obtained from the model output. Thus, the basic block diagram for estimating clean near-end speech using this model is shown in FIG. 1. The method comprises the steps of firstly using a far-end signal to carry out self-adaptive filtering on a signal captured by a microphone, and then inputting a signal with noise and a signal of a self-adaptive filter after the self-adaptive filtering into a model, so that a signal which inhibits residual echo in the signal with noise and enhances near-end voice can be obtained.
Therefore, residual echo is suppressed, and a near-end speech enhancement result is obtained.
A simulation case is given below.
1. Training and testing samples and objective evaluation indices
This implementation is when constructing the simulation training data, considers the actual use scene of intelligent audio amplifier, uses the TIMMIT corpus as near-end pronunciation, uses the music storehouse in the MUSAN corpus and the TIMTI corpus as far-end signal. The voice data of 400 speakers are randomly selected from the TIMIT corpus as training set data, and the voice data of 40 speakers are randomly selected from the rest speakers as test set data. For each speaker, 10 segments of 16kHz sampled speech data were selected. 400 segments of voice data are selected from the training set as a verification set. The music library of the MUSAN is divided into 4s of audios, 19269 sections of audios are selected as training set data, 400 sections of audios are selected as verification set data, and 400 sections of audios which are from different songs from the training set and the verification set are selected as test sets.
To construct the echo data, a soft-clipping transform is first applied to the far-end signal, as defined below
Figure BDA0002673406210000061
Wherein x ismaxIs 80% of the maximum value of x (n). Applying a sigmoidal function to the signal after soft clipping to simulate nonlinear distortion of a loudspeaker, defined as follows
Figure BDA0002673406210000062
Figure BDA0002673406210000071
Figure BDA0002673406210000072
In order to simulate a real room reverberation environment, 50 virtual small offices with the length, width and height of between 2m and 5m and the reverberation time T60 of between 150ms and 450ms are randomly constructed, virtual loudspeaker units and virtual microphone units are arranged in each room, and finally, a total of 500 sections of room impulse responses are calculated by using a virtual source method. Of which 400 segments are used to construct the training set and the validation set and the remaining 100 segments are used to construct the test set. And (4) convolving the signal subjected to sigmoidal transformation with the room impulse response to obtain a simulated echo signal. And finally, using a linear adaptive filter based on a kalman algorithm to perform adaptive filtering on the constructed echo signal through a corresponding far-end signal to obtain a residual echo signal and an output signal of the adaptive filter. In this embodiment, the suppression of the echo signal energy by the linear adaptive filter is about 14.0 dB.
In the training set, 36000 speech echo signals and 38578 music echo signals are constructed. In the verification set and the test set, 400 pieces of speech echo signals and 400 pieces of music echo signals are constructed together.
The present embodiment employs a Perceptual Evaluation of Speech Quality (PESQ) index as an objective evaluation index of the residual echo suppression performance.
2. Parameter setting
In this embodiment, the Encoder module parameter N is set to 512, L is set to 40, and S is set to 10. The Suppression module parameter R is set to 4 and M is set to 8. The 1-D Conv module parameter B is set to 256, H is set to 512, and P is set to 3. All 1 × 1Conv layer output dimensions in the MI Conv module are set to 256 and the D-Conv layer convolution kernel length is set to 128. The exponential moving average normalization layer parameter α is set to 0.989, NαSet to 640. Model integral loss boxThe number parameter w is set to 0.707. And optimizing the model by adopting an Adam optimizer at a learning rate of 0.001 during model training. Training data is input to the training network in the form of two sets of 4s signals at a time. And (4) taking one traversal of the near-end speech training set as a training period, and training for 120 periods. And after each training period is finished, evaluating the performance of the model in the verification set, and halving the learning rate of the optimizer when the performance of the model is not improved in 4 training periods. To improve the robustness of the training, a two-norm based gradient clipping is used, with the maximum of the two norms set to 5.
3. Concrete implementation process of method
Referring to fig. 1, the method is mainly divided into a training phase and an enhancement phase.
In the training stage, echo signals are artificially recorded or constructed, corresponding far-end signals are used for constructing residual echo signals through adaptive filtering, and the residual echo signals are superposed with near-end voice and background noise to construct noisy near-end voice. Either the adaptive filter signal or the far-end signal or both are used as reference signals for the model. The noisy near-end speech and the reference signal of the model are used as model input, and the pure near-end speech is used as the target output of the model for training. In the enhancement stage, a signal obtained by carrying out adaptive filtering on a signal acquired by a microphone and a model reference signal consisting of an output signal of an adaptive filter and a far-end signal are input into the model, so that the estimation of pure near-end speech can be obtained.
In order to embody the performance improvement of the present invention relative to the existing method, the present embodiment compares the Residual Echo Suppression method Based on the bidirectional long and short term memory Network (BLSTM) in the literature (Zhang H, Wang d. deep Learning for the arrival Echo Cancellation in noise and Double-Talk scenes. [ C ]. conference of the interactive short term communication establishment, 2018: 3239. ]) with the Residual Echo Suppression method Based on the Full Channel (FCN) in the literature (cardaj G, serial R, vision E, et al, multiple-Input Neural Network-basic Echo Suppression [ C ]. interactive conference access, speech, and signal processing,2018: 231.) in comparison. Fig. 4a and 4b show histograms of PESQ values for enhanced speech for different methods with SER of-14.2 dB and-18.2 dB, respectively, when the far-end signal is speech and music. In the figure, the black blocks represent the PESQ score for the case where the SER is-14.2 dB, and the light gray blocks represent the PESQ score for the case where the SER is-18.2 dB. It can be found that the near-end speech enhancement performance of the method of the present invention is significantly improved under various echo conditions compared with the existing residual echo suppression method based on the deep neural network.

Claims (3)

1. The method for suppressing the residual echo based on the multi-feature flow structure deep neural network is characterized by comprising the following steps of:
step 1, constructing a noisy near-end voice with residual echo and background noise and an output signal of an adaptive filter by using a pure voice signal, background noise, an echo signal and a far-end signal corresponding to the echo signal through an adaptive filtering algorithm;
step 2, taking the output signal of the adaptive filter or the far-end signal or the two signals as a reference signal; using the reference signal and the noisy near-end speech signal constructed in the step 1 as input features of a neural network model with a multi-feature flow structure, wherein the model performs joint processing on a feature flow A containing the reference signal and a feature flow B containing shallow layer estimation information of near-end speech; training the model by using pure near-end speech as a training target of the model;
and 3, taking the trained neural network model with the multi-feature flow structure as a post-processing filter, suppressing residual echo and background noise in the signal processed by the self-adaptive filter, and enhancing the audio signal of the near-end speaker.
2. The method according to claim 1, wherein in step 2, the feature stream a containing the reference signal and the feature stream B containing the shallow layer estimation information of the near-end speech are jointly processed in one of the following manners: (1) subtracting the feature stream B from the feature stream containing the noisy near-end voice information to obtain a new feature stream C, and performing comprehensive processing on the feature streams A and C after processing through a normalization layer and a convolution layer respectively; (2) directly carrying out comprehensive treatment on the characteristic flows A and B after respectively carrying out treatment on the normalization layer and the convolution layer; wherein, the comprehensive treatment refers to the operations of dimension splicing, point-by-point addition, subtraction, multiplication and division and equivalent operations thereof.
3. The method according to claim 1, wherein the neural network model with the multi-feature stream structure extracts features in the time-domain waveform through an encoder module, and then the neural network model with the multi-feature stream structure is processed through a plurality of one-dimensional convolution modules with different expansion rates and a multi-feature stream convolution module in a suppressor module to obtain a mask of the estimated clean speech feature spectrum, and the mask is applied to the noisy speech feature spectrum output by the encoder to obtain an estimate of the clean speech feature spectrum, and finally the estimate is restored to the time-domain waveform through a decoder module.
CN202010940284.8A 2020-09-09 2020-09-09 Residual echo suppression method based on multi-feature flow structure deep neural network Pending CN112037809A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010940284.8A CN112037809A (en) 2020-09-09 2020-09-09 Residual echo suppression method based on multi-feature flow structure deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010940284.8A CN112037809A (en) 2020-09-09 2020-09-09 Residual echo suppression method based on multi-feature flow structure deep neural network

Publications (1)

Publication Number Publication Date
CN112037809A true CN112037809A (en) 2020-12-04

Family

ID=73585130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010940284.8A Pending CN112037809A (en) 2020-09-09 2020-09-09 Residual echo suppression method based on multi-feature flow structure deep neural network

Country Status (1)

Country Link
CN (1) CN112037809A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113286047A (en) * 2021-04-22 2021-08-20 维沃移动通信(杭州)有限公司 Voice signal processing method and device and electronic equipment
CN113362843A (en) * 2021-06-30 2021-09-07 北京小米移动软件有限公司 Audio signal processing method and device
CN113436636A (en) * 2021-06-11 2021-09-24 深圳波洛斯科技有限公司 Acoustic echo cancellation method and system based on adaptive filter and neural network
CN113707167A (en) * 2021-08-31 2021-11-26 北京地平线信息技术有限公司 Training method and training device for residual echo suppression model
CN114242100A (en) * 2021-12-16 2022-03-25 北京百度网讯科技有限公司 Audio signal processing method, training method and device, equipment and storage medium thereof
WO2022213825A1 (en) * 2021-04-06 2022-10-13 京东科技控股股份有限公司 Neural network-based end-to-end speech enhancement method and apparatus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180040333A1 (en) * 2016-08-03 2018-02-08 Apple Inc. System and method for performing speech enhancement using a deep neural network-based signal
CN111081268A (en) * 2019-12-18 2020-04-28 浙江大学 Phase-correlated shared deep convolutional neural network speech enhancement method
US20200201435A1 (en) * 2018-12-20 2020-06-25 Massachusetts Institute Of Technology End-To-End Deep Neural Network For Auditory Attention Decoding
CN111341341A (en) * 2020-02-11 2020-06-26 腾讯科技(深圳)有限公司 Training method of audio separation network, audio separation method, device and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180040333A1 (en) * 2016-08-03 2018-02-08 Apple Inc. System and method for performing speech enhancement using a deep neural network-based signal
US20200201435A1 (en) * 2018-12-20 2020-06-25 Massachusetts Institute Of Technology End-To-End Deep Neural Network For Auditory Attention Decoding
CN111081268A (en) * 2019-12-18 2020-04-28 浙江大学 Phase-correlated shared deep convolutional neural network speech enhancement method
CN111341341A (en) * 2020-02-11 2020-06-26 腾讯科技(深圳)有限公司 Training method of audio separation network, audio separation method, device and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HONGSHENG CHEN等: "Nonlinear Residual Echo Suppression Based on Multi-stream Conv-TasNet", ARXIV:2005.07631V1 [EESS.AS], pages 1 - 2 *
李曾玺: "基于自回归深度神经网络的单通道语音分离方法研究", 中国博士学位论文电子全文数据库, no. 08 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022213825A1 (en) * 2021-04-06 2022-10-13 京东科技控股股份有限公司 Neural network-based end-to-end speech enhancement method and apparatus
CN113286047A (en) * 2021-04-22 2021-08-20 维沃移动通信(杭州)有限公司 Voice signal processing method and device and electronic equipment
CN113286047B (en) * 2021-04-22 2023-02-21 维沃移动通信(杭州)有限公司 Voice signal processing method and device and electronic equipment
CN113436636A (en) * 2021-06-11 2021-09-24 深圳波洛斯科技有限公司 Acoustic echo cancellation method and system based on adaptive filter and neural network
CN113362843A (en) * 2021-06-30 2021-09-07 北京小米移动软件有限公司 Audio signal processing method and device
CN113362843B (en) * 2021-06-30 2023-02-17 北京小米移动软件有限公司 Audio signal processing method and device
CN113707167A (en) * 2021-08-31 2021-11-26 北京地平线信息技术有限公司 Training method and training device for residual echo suppression model
CN114242100A (en) * 2021-12-16 2022-03-25 北京百度网讯科技有限公司 Audio signal processing method, training method and device, equipment and storage medium thereof

Similar Documents

Publication Publication Date Title
CN112037809A (en) Residual echo suppression method based on multi-feature flow structure deep neural network
CN109841206B (en) Echo cancellation method based on deep learning
Luo et al. Real-time single-channel dereverberation and separation with time-domain audio separation network.
CN111161752B (en) Echo cancellation method and device
KR20180115984A (en) Method and apparatus for integrating and removing acoustic echo and background noise based on deepening neural network
Lee et al. DNN-based residual echo suppression.
JP5666444B2 (en) Apparatus and method for processing an audio signal for speech enhancement using feature extraction
EP3474280A1 (en) Signal processor for signal enhancement and associated methods
KR101807961B1 (en) Method and apparatus for processing speech signal based on lstm and dnn
Zhao et al. Late reverberation suppression using recurrent neural networks with long short-term memory
CN112863535B (en) Residual echo and noise elimination method and device
CN112700786B (en) Speech enhancement method, device, electronic equipment and storage medium
CN111031448B (en) Echo cancellation method, echo cancellation device, electronic equipment and storage medium
CN112735456A (en) Speech enhancement method based on DNN-CLSTM network
Adiga et al. Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN.
Strauss et al. A flow-based neural network for time domain speech enhancement
JP6748304B2 (en) Signal processing device using neural network, signal processing method using neural network, and signal processing program
CN113707167A (en) Training method and training device for residual echo suppression model
Barros et al. Estimation of speech embedded in a reverberant and noisy environment by independent component analysis and wavelets
CN111883154A (en) Echo cancellation method and apparatus, computer-readable storage medium, and electronic apparatus
KR102374167B1 (en) Voice signal estimation method and apparatus using attention mechanism
Kamarudin et al. Acoustic echo cancellation using adaptive filtering algorithms for Quranic accents (Qiraat) identification
KR102505653B1 (en) Method and apparatus for integrated echo and noise removal using deep neural network
Kim et al. Spectral distortion model for training phase-sensitive deep-neural networks for far-field speech recognition
KR101558397B1 (en) Reverberation Filter Estimation Method and Dereverberation Filter Estimation Method, and A Single-Channel Speech Dereverberation Method Using the Dereverberation Filter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination