CN112037809A

CN112037809A - Residual echo suppression method based on multi-feature flow structure deep neural network

Info

Publication number: CN112037809A
Application number: CN202010940284.8A
Authority: CN
Inventors: 陈宏圣; 卢晶
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2020-12-04

Abstract

The invention discloses a residual echo suppression method based on a multi-feature flow structure deep neural network. The method comprises the following specific steps: (1) constructing a noisy near-end voice with residual echo and background noise and an output signal of an adaptive filter through an adaptive filtering algorithm; using the self-adaptive filter output signal or the far-end signal or the two signals as a reference signal of a neural network model with a multi-feature flow structure; (2) using the voice signal with noise and the reference signal as the input characteristics of the neural network model; training the model by using pure near-end speech as a training target of the model; (3) and (3) taking the trained neural network model with the multi-feature flow structure as a post-processing filter, inhibiting residual echoes and background noise in the signal processed by the self-adaptive filter, and enhancing the audio signal of the near-end speaker. The invention can effectively remove the influence of the residual echo on the near-end voice in the scene of high residual echo.

Description

Residual echo suppression method based on multi-feature flow structure deep neural network

Technical Field

The invention belongs to the field of echo suppression, and particularly relates to a nonlinear residual echo suppression method based on a deep neural network with a multi-feature flow structure.

Background

In a communication system, a far-end signal is converted into an acoustic signal by a loudspeaker system, and the acoustic signal is collected by a microphone system through an echo acoustic path to generate an echo signal. The echo signal will severely interfere with the quality of the voice communication and degrade the accuracy of the voice recognition system. The technique of suppressing echo signals and extracting the speech signals of the near-end speaker is called echo suppression.

A typical echo suppression method is to use an adaptive Linear Acoustic Echo Cancellation (LAEC) algorithm to match the transfer function corresponding to the acoustic echo transfer path and further suppress the residual echo signal using a post-processing filter. Among various adaptive algorithms, the frequency domain least square adaptive filter algorithm and its derivative algorithm have a faster convergence rate and a lower computational burden, and are often applied to an actual echo suppression task.

When the echo path has non-negligible non-linear effect, the performance of the echo suppression system based on the assumption of the linear system will be greatly reduced, and therefore, it is necessary to suppress the residual echo in the signal processed by the LAEC system. Residual echo suppression systems often use the far-end signal, adaptive filter coefficients, and the signal processed by the LAEC system to estimate the amplitude of the residual echo and suppress the residual echo signal accordingly. This part of the signal processing based approach is often difficult to balance well with respect to residual echo suppression and near-end speech distortion. In response to this problem, scholars introduce a deep neural network into the residual echo suppression system to improve the suppression effect on the nonlinear residual echo. Most of the time domain features are extracted by using short-time Fourier transform, and the amplitude of a time frequency spectrum is used as an input feature. On one hand, conflict exists between the processing time delay of the short-time Fourier transform and the frequency domain resolution; on the other hand, using the magnitude spectrum or its mask as a training target does not allow the phase information to be recovered, limiting the performance of the network.

Conv-TasNet network (Luo Y, Mesgarani N.Conv-TasNet: weighting Ideal Time-Frequency mapping for Speech Separation [ J ]. IEEE Transactions on Audio, Speech, and Language Processing,2019,27(8):1256-1266.), that is, the full convolution Time domain Speech Separation network, is an end-to-end Speech Separation network. In the voice separation task, the network can have shorter processing delay due to end-to-end processing of the network, and the network can obtain better effect compared with a method based on a time-frequency spectrum mask. Considering that the residual echo suppression task can be regarded as a voice enhancement task for extracting only near-end voice, it is feasible to extend the Conv-TasNet model related to the voice separation task to the field of residual echo suppression.

Disclosure of Invention

In the existing residual echo suppression technology, under the condition that near-end voice exists and the interference of the residual echo is high, the residual echo signal is often difficult to be effectively suppressed, and excessive suppression of the near-end voice also often exists, so that the effect of the residual echo suppression is influenced. The invention provides a residual echo suppression method based on a deep neural network with a multi-feature flow structure, which can effectively extract a near-end voice signal under the condition of high residual echo interference.

The technical scheme adopted by the invention is as follows:

the residual echo suppression method based on the multi-feature flow structure deep neural network comprises the following steps:

step 1, constructing a noisy near-end voice with residual echo and background noise and an output signal of an adaptive filter by using a pure voice signal, background noise, an echo signal and a far-end signal corresponding to the echo signal through an adaptive filtering algorithm;

step 2, taking the output signal of the adaptive filter or the far-end signal or the two signals as a reference signal; using the reference signal and the noisy near-end speech signal constructed in the step 1 as input features of a neural network model with a multi-feature flow structure, wherein the model performs joint processing on a feature flow A containing the reference signal and a feature flow B containing shallow layer estimation information of near-end speech; training the model by using pure near-end speech as a training target of the model;

and 3, taking the trained neural network model with the multi-feature flow structure as a post-processing filter, suppressing residual echo and background noise in the signal processed by the self-adaptive filter, and enhancing the audio signal of the near-end speaker.

Further, in step 2, the feature stream a containing the reference signal and the feature stream B containing the shallow layer estimation information of the near-end speech are jointly processed in one of the following two ways: (1) subtracting the feature stream B from the feature stream containing the noisy near-end voice information to obtain a new feature stream C, and performing comprehensive processing on the feature streams A and C after processing through a normalization layer and a convolution layer respectively; (2) directly carrying out comprehensive treatment on the characteristic flows A and B after respectively carrying out treatment on the normalization layer and the convolution layer; wherein, the comprehensive treatment refers to the operations of dimension splicing, point-by-point addition, subtraction, multiplication and division and equivalent operations thereof.

Furthermore, the neural network model with the multi-feature stream structure extracts features in a time domain waveform through an encoder module, processes the features through a plurality of one-dimensional convolution modules with different expansion rates and a multi-feature stream convolution module in a suppressor module to obtain a mask of an estimated pure voice feature spectrum, applies the mask to a noisy voice feature spectrum output by an encoder to obtain an estimation of the pure voice feature spectrum, and finally restores the pure voice feature spectrum into the time domain waveform through a decoder module.

According to the method, the characteristic stream containing the reference signal information and the characteristic stream containing the shallow layer estimation information of the pure voice are processed in a combined mode through the neural network, so that the network model can still obtain a near-end voice estimation result with high tone quality and less distortion under the condition of high residual echo interference, the influence of residual echo on the near-end voice is effectively removed, and the robustness is high.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of a Conv-TasNet model of a multi-feature flow architecture employed in an embodiment of the present invention;

FIG. 3 is a schematic diagram of (a) a 1-D Conv module and (b) an MI Conv module of the model of FIG. 2;

fig. 4 is a diagram comparing PESQ values of enhanced speech in different SER cases for the prior art method and the method of the present invention, (a) the far-end signal is speech, and (b) the far-end signal is music.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

The embodiment is performed in simulation, and provides a residual echo suppression method based on a multi-feature stream structure Conv-TasNet model, which is suitable for the case of high residual echo interference, and includes the following steps:

1. generating training samples

2. Training multi-feature stream structure Conv-TasNet model

The information attribute is defined as an additional attribute of the module output in the model, and the feature stream is defined as the module output containing the same information attribute. Each original input feature of the model is also treated as a feature stream with information attributes that are composed of the physical meaning of the information of each original input feature itself. After the feature stream is operated, the information attribute of the output feature stream of the operation is inherited from the input feature stream. The information attributes are divided into prior information attributes and posterior information attributes, wherein the determination of the prior information attributes needs to analyze the inheritance relationship of the information attributes of the model frame, and the determination of the posterior information attributes needs to analyze the actual physical significance of the feature stream of the trained model. And when the prior information attribute and the posterior information attribute conflict, preferentially using the posterior information attribute to replace the prior information attribute. The 'shallow estimation of near-end speech' is defined as a type of posterior information attribute, and when a plurality of feature streams are subjected to mixing operation, the information attribute is not inherited. In this embodiment, the feature stream with the attribute of "shallow estimate of near-end speech" may be directly output by the decoder module as an estimate of near-end speech. The characteristic flow forms a part of the overall output of the model subgraph in a skip connection mode; and, when calculating the overall training loss of the model, this part of the feature stream will be used to calculate the loss alone and combined with the loss of the final output of the model. It should be understood that the above definitions are used to illustrate the definitions of the feature stream as set forth in the claims, and are not suitable for illustrating the model structure, and the feature stream partitioning is not performed according to the above strict definitions for convenience of explanation.

The basic block diagram of the multi-feature flow structure Conv-TasNet model adopted in this embodiment is shown in fig. 2, and the model accepts noisy near-end speech and an adaptive filter output signal (model reference signal) as model inputs. The Encoder (Encoder) module is composed of a single-layer one-dimensional convolution layer with an output dimension of N, a convolution kernel length of L and a step length of S and a linear rectification function (Relu) activation function and is used for converting a time domain waveform signal into a characteristic spectrum. The Decoder (Decoder) module is composed of a single-layer one-dimensional transposed convolutional layer and is used for restoring the characteristic spectrum into a time-domain waveform. The suppressor (Suppression) module consists of a normalization (LayerNorm) layer, a 1 × 1 convolution (1 × 1Conv) layer with an output dimension of B, and R sub-modules, each sub-module comprising M expansion rates of 1,2,4^M-1The one-dimensional convolution (1-D Conv) module. Each sub-module contains, in addition to the first sub-module, a multiple eigenflow convolution (MI Conv) module.

The embodiment adopts the exponential moving average normalization operation as the normalization layer of the module, which is defined as follows

Wherein f is_k,jIs the jth feature of the kth frame of the feature spectrum, F is the dimension of that feature,

is a trainable parameter, α, ∈, Ω is a hyperparameter, and Ω is set to 0.5 hereinafter unless otherwise specified. In the experiment, the parameter N_αSet to a finite value so that the normalization operation can be efficiently achieved by a convolution operation.

The basic block diagram of the 1-D Conv module is shown in (a) of FIG. 3. The module consists of a 1 × 1Conv layer with an output dimension H, a parametric linear rectification unit (PRelu) layer, a normalization (Norm) layer, a deep convolution (D-Conv) layer with a convolution kernel length P, a PRelu layer, a Norm layer and two parallel 1 × 1Conv layers with an output dimension B in sequence. The output of one of the terminal 1 × 1Conv layers is added to the input of the module as the residual input of the following module, and the other 1 × 1Conv layer constitutes part of the final output of the model as the skip connection output.

The basic block diagram of the MI Conv module is shown in fig. 3 (b). This module accepts 4 feature streams as inputs, respectively: (1) the sum (stream D) of the residual output of the module before (stream C) and (4) the skip connection output of the module before (stream D) is obtained by adding the feature spectrum of the noisy near-end speech (stream a), (2) the feature spectrum of the model reference signal (stream B), (3). The Sub-operation is used to extract the characteristics of the residual echo signal from the output of the previous module, defined as follows

Wherein f is_AAnd f_DiIs the characteristic spectrum of stream A and the i stream D, g_obRepresents the operation of the same Output Block as in the supression Block,

is a trainable parameter, f_subiIs the output of this sub operation. Will f is_OiRegarded as an appropriate approximation to near-end speech, f_subiInformation of the residual echo is extracted accordingly. After this, the output of stream B and sub operations are normalized and dimensionally scaled using two parallel normalization layers (Norm) and 1 × 1Conv layers. The hyperparameter Ω of the normalization layer (Norm) was set to 0.4 to retain certain amplitude characteristics. The two output streams are then spliced together (Concat) and processed through a normalization layer, a 1 × 1Conv layer, a PRelu layer, a Norm layer, a D-Conv layer, a PRelu layer, and a Norm layer, respectively. And splicing the processed result with the characteristic spectrum of the stream C, processing by a 1 × 1Conv layer, and adding the processed result with the characteristic spectrum of the stream C to obtain residual output. The MI Conv module fuses the information of the reference signal and the residual echo into stream C by the above operation to guide the model to suppress the residual echo. Early approximation of near-end speech without sub-manipulation f_OiProcessing directly with the Norm layer and the 1 × 1Conv layer, there is a slight decrease of SISNR (scale-innovative source to noise ratio) on the training set of this model of about 0.2dB, and no significant difference is generated.

In the training process, f generated by using stream A and stream D_OiGenerating the overall training Loss of the model together with the final output of the model_totalIs defined as follows

Therein, loss_iIs f_OiLoss of time-domain waveform converted by decoder module_lastIs the loss of model output, and w is the weight parameter. In this embodiment, SISNR is used as loss_iAnd loss_lastIs measured.

3. Estimation of clean near-end speech using a multi-feature stream structure Conv-TasNet model

In the use process of the multi-feature-flow structure Conv-TasNet model, only noisy near-end speech and the output signal of the adaptive filter are required to be received as model input, and pure near-end speech estimation can be obtained from the model output. Thus, the basic block diagram for estimating clean near-end speech using this model is shown in FIG. 1. The method comprises the steps of firstly using a far-end signal to carry out self-adaptive filtering on a signal captured by a microphone, and then inputting a signal with noise and a signal of a self-adaptive filter after the self-adaptive filtering into a model, so that a signal which inhibits residual echo in the signal with noise and enhances near-end voice can be obtained.

Therefore, residual echo is suppressed, and a near-end speech enhancement result is obtained.

A simulation case is given below.

1. Training and testing samples and objective evaluation indices

This implementation is when constructing the simulation training data, considers the actual use scene of intelligent audio amplifier, uses the TIMMIT corpus as near-end pronunciation, uses the music storehouse in the MUSAN corpus and the TIMTI corpus as far-end signal. The voice data of 400 speakers are randomly selected from the TIMIT corpus as training set data, and the voice data of 40 speakers are randomly selected from the rest speakers as test set data. For each speaker, 10 segments of 16kHz sampled speech data were selected. 400 segments of voice data are selected from the training set as a verification set. The music library of the MUSAN is divided into 4s of audios, 19269 sections of audios are selected as training set data, 400 sections of audios are selected as verification set data, and 400 sections of audios which are from different songs from the training set and the verification set are selected as test sets.

To construct the echo data, a soft-clipping transform is first applied to the far-end signal, as defined below

Wherein x is_maxIs 80% of the maximum value of x (n). Applying a sigmoidal function to the signal after soft clipping to simulate nonlinear distortion of a loudspeaker, defined as follows

In order to simulate a real room reverberation environment, 50 virtual small offices with the length, width and height of between 2m and 5m and the reverberation time T60 of between 150ms and 450ms are randomly constructed, virtual loudspeaker units and virtual microphone units are arranged in each room, and finally, a total of 500 sections of room impulse responses are calculated by using a virtual source method. Of which 400 segments are used to construct the training set and the validation set and the remaining 100 segments are used to construct the test set. And (4) convolving the signal subjected to sigmoidal transformation with the room impulse response to obtain a simulated echo signal. And finally, using a linear adaptive filter based on a kalman algorithm to perform adaptive filtering on the constructed echo signal through a corresponding far-end signal to obtain a residual echo signal and an output signal of the adaptive filter. In this embodiment, the suppression of the echo signal energy by the linear adaptive filter is about 14.0 dB.

In the training set, 36000 speech echo signals and 38578 music echo signals are constructed. In the verification set and the test set, 400 pieces of speech echo signals and 400 pieces of music echo signals are constructed together.

The present embodiment employs a Perceptual Evaluation of Speech Quality (PESQ) index as an objective evaluation index of the residual echo suppression performance.

2. Parameter setting

In this embodiment, the Encoder module parameter N is set to 512, L is set to 40, and S is set to 10. The Suppression module parameter R is set to 4 and M is set to 8. The 1-D Conv module parameter B is set to 256, H is set to 512, and P is set to 3. All 1 × 1Conv layer output dimensions in the MI Conv module are set to 256 and the D-Conv layer convolution kernel length is set to 128. The exponential moving average normalization layer parameter α is set to 0.989, N_αSet to 640. Model integral loss boxThe number parameter w is set to 0.707. And optimizing the model by adopting an Adam optimizer at a learning rate of 0.001 during model training. Training data is input to the training network in the form of two sets of 4s signals at a time. And (4) taking one traversal of the near-end speech training set as a training period, and training for 120 periods. And after each training period is finished, evaluating the performance of the model in the verification set, and halving the learning rate of the optimizer when the performance of the model is not improved in 4 training periods. To improve the robustness of the training, a two-norm based gradient clipping is used, with the maximum of the two norms set to 5.

3. Concrete implementation process of method

Referring to fig. 1, the method is mainly divided into a training phase and an enhancement phase.

In the training stage, echo signals are artificially recorded or constructed, corresponding far-end signals are used for constructing residual echo signals through adaptive filtering, and the residual echo signals are superposed with near-end voice and background noise to construct noisy near-end voice. Either the adaptive filter signal or the far-end signal or both are used as reference signals for the model. The noisy near-end speech and the reference signal of the model are used as model input, and the pure near-end speech is used as the target output of the model for training. In the enhancement stage, a signal obtained by carrying out adaptive filtering on a signal acquired by a microphone and a model reference signal consisting of an output signal of an adaptive filter and a far-end signal are input into the model, so that the estimation of pure near-end speech can be obtained.

In order to embody the performance improvement of the present invention relative to the existing method, the present embodiment compares the Residual Echo Suppression method Based on the bidirectional long and short term memory Network (BLSTM) in the literature (Zhang H, Wang d. deep Learning for the arrival Echo Cancellation in noise and Double-Talk scenes. [ C ]. conference of the interactive short term communication establishment, 2018: 3239. ]) with the Residual Echo Suppression method Based on the Full Channel (FCN) in the literature (cardaj G, serial R, vision E, et al, multiple-Input Neural Network-basic Echo Suppression [ C ]. interactive conference access, speech, and signal processing,2018: 231.) in comparison. Fig. 4a and 4b show histograms of PESQ values for enhanced speech for different methods with SER of-14.2 dB and-18.2 dB, respectively, when the far-end signal is speech and music. In the figure, the black blocks represent the PESQ score for the case where the SER is-14.2 dB, and the light gray blocks represent the PESQ score for the case where the SER is-18.2 dB. It can be found that the near-end speech enhancement performance of the method of the present invention is significantly improved under various echo conditions compared with the existing residual echo suppression method based on the deep neural network.

Claims

1. The method for suppressing the residual echo based on the multi-feature flow structure deep neural network is characterized by comprising the following steps of:

2. The method according to claim 1, wherein in step 2, the feature stream a containing the reference signal and the feature stream B containing the shallow layer estimation information of the near-end speech are jointly processed in one of the following manners: (1) subtracting the feature stream B from the feature stream containing the noisy near-end voice information to obtain a new feature stream C, and performing comprehensive processing on the feature streams A and C after processing through a normalization layer and a convolution layer respectively; (2) directly carrying out comprehensive treatment on the characteristic flows A and B after respectively carrying out treatment on the normalization layer and the convolution layer; wherein, the comprehensive treatment refers to the operations of dimension splicing, point-by-point addition, subtraction, multiplication and division and equivalent operations thereof.

3. The method according to claim 1, wherein the neural network model with the multi-feature stream structure extracts features in the time-domain waveform through an encoder module, and then the neural network model with the multi-feature stream structure is processed through a plurality of one-dimensional convolution modules with different expansion rates and a multi-feature stream convolution module in a suppressor module to obtain a mask of the estimated clean speech feature spectrum, and the mask is applied to the noisy speech feature spectrum output by the encoder to obtain an estimate of the clean speech feature spectrum, and finally the estimate is restored to the time-domain waveform through a decoder module.