CN113707167A

CN113707167A - Training method and training device for residual echo suppression model

Info

Publication number: CN113707167A
Application number: CN202111017286.0A
Authority: CN
Inventors: 陈宏圣; 乐笑怀; 卢晶
Original assignee: Beijing Horizon Information Technology Co Ltd
Current assignee: Beijing Horizon Information Technology Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-11-26

Abstract

The embodiment of the disclosure discloses a training method and a training device for a residual echo suppression model, wherein the training method comprises the following steps: generating a plurality of mixed audio signals based on the plurality of residual echo signals, the plurality of background noise signals and the plurality of near-end speech signals; determining a plurality of auxiliary signals corresponding to the plurality of mixed audio signals based on the plurality of mixed audio signals; training a residual echo suppression model based on the plurality of mixed audio signals and the plurality of auxiliary signals. The embodiment of the disclosure can effectively suppress the nonlinear residual echo signal, thereby improving the communication quality and enhancing the user experience.

Description

Training method and training device for residual echo suppression model

Technical Field

The present disclosure relates to the field of echo suppression technologies, and in particular, to a training method and a training device for a residual echo suppression model.

Background

In a communication system, a far-end signal is converted into an acoustic signal by a loudspeaker system, and the acoustic signal is collected by a microphone system through an echo acoustic path to generate an echo signal. The echo signal will severely interfere with the quality of the voice communication and degrade the accuracy of the voice recognition system. The technique of suppressing echo signals and extracting the speech signals of the near-end speaker is called echo suppression.

In the related art, a linear-based echo suppression method is used, which performs echo suppression by matching a transfer function corresponding to an echo transfer path. However, when the echo path has a non-negligible nonlinear effect, the performance of the echo suppression method is greatly reduced, and therefore, it is necessary to suppress the residual echo.

In the related art, the amplitude of the residual echo is estimated by using the far-end signal and the adaptive filter coefficient, and the residual echo signal is suppressed accordingly, but it is difficult to balance the suppression of the residual echo and the distortion of the near-end speech.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a training method and a training device for a residual echo suppression model.

According to a first aspect of the embodiments of the present disclosure, there is provided a training method of a residual echo suppression model, including:

generating a plurality of mixed audio signals based on the plurality of residual echo signals, the plurality of background noise signals and the plurality of near-end voice signals, wherein each mixed audio signal comprises one residual echo signal, one background noise signal and one near-end voice signal;

determining a plurality of auxiliary signals corresponding to the plurality of mixed audio signals based on the plurality of mixed audio signals, wherein the auxiliary signal corresponding to each mixed audio signal is determined based on the far-end signal corresponding to the residual echo signal in each mixed audio signal;

training a residual echo suppression model based on the plurality of mixed audio signals and the plurality of auxiliary signals.

According to a second aspect of the embodiments of the present disclosure, there is provided a training apparatus for a residual echo suppression model, including:

a mixed audio signal generating module, configured to generate a plurality of mixed audio signals based on a plurality of residual echo signals, a plurality of background noise signals, and a plurality of clean near-end speech signals, wherein each mixed audio signal includes a residual echo signal, a background noise signal, and a clean near-end speech signal;

an auxiliary signal determination module configured to determine a plurality of auxiliary signals corresponding to the plurality of mixed audio signals based on the plurality of mixed audio signals, wherein the auxiliary signal corresponding to each mixed audio signal is determined based on the far-end signal corresponding to the residual echo signal in each mixed audio signal;

a model training module to train a residual echo suppression model based on the plurality of mixed audio signals and a plurality of auxiliary signals.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the training method of the residual echo suppression model according to the first aspect.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instruction from the memory, and execute the instruction to implement the training method of the residual echo suppression model according to the first aspect.

Based on the training method and the training device for the residual echo suppression model provided by the embodiment of the disclosure, based on the mixed audio signal comprising the residual echo signal, the background noise signal and the pure near-end voice signal, and the residual echo suppression model for the auxiliary signal training, the nonlinear residual echo signal can be effectively suppressed, so that the call quality can be improved, and the user experience is enhanced.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a flowchart illustrating a training method of a residual echo suppression model according to an embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a training method of a residual echo suppression model in one example of the present disclosure.

Fig. 3 is a block diagram of a residual echo suppression model 220 in an example of the present disclosure.

Fig. 4 is a schematic diagram of the operation of an encoder in a residual echo suppression model in one example of the present disclosure.

Fig. 5 is a schematic diagram of the operation of the decoder in the residual echo suppression model in the example corresponding to fig. 4.

Fig. 6 is a schematic diagram of the operation of an encoder in a residual echo suppression model in another example of the present disclosure.

Fig. 7 is a schematic diagram of the operation of a decoder in the residual echo suppression model in the example corresponding to fig. 6.

FIG. 8 is a schematic diagram of the operation of one network layer in a two-way recurrent neural network in one example of the present disclosure.

Fig. 9 is a histogram of PESQ values of enhanced speech when the far-end signal is speech and music for different methods in a simulated echo scenario in one example of the present disclosure.

Fig. 10 is a histogram of PESQ values of enhanced speech when the far-end signal is speech and music in different methods under a live echo scenario in one example of the present disclosure.

FIG. 11 is a graph of parameter quantities for different models in one example of the present disclosure.

Figure 12 is a histogram of PESQ values for enhanced speech when the far-end signal is speech and music for the pre-trained and post-refined models in a simulated echo scenario in one example of the present disclosure.

Fig. 13 is a histogram of PESQ values of enhanced speech when the far-end signal is speech and music for the pre-trained model and the refined model in a recorded echo scenario according to an example of the present disclosure.

Fig. 14 is a block diagram of a training apparatus for a residual echo suppression model according to an embodiment of the present disclosure.

FIG. 15 is a block diagram of the structure of the model training module 143 in one embodiment of the present disclosure.

Fig. 16 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Exemplary method

Fig. 1 is a schematic flowchart of a training method of a residual echo suppression model according to an embodiment of the present disclosure. The embodiment can be applied to an electronic device, as shown in fig. 1, and includes the following steps:

s1: a plurality of mixed audio signals are generated based on the plurality of residual echo signals, the plurality of background noise signals, and the plurality of near-end speech signals. Wherein each mixed audio signal comprises a residual echo signal, a background noise signal and a near-end speech signal.

Fig. 2 is a schematic diagram of a training method of a residual echo suppression model in one example of the present disclosure. As shown in fig. 2, after each far-end signal and the corresponding echo signal are filtered by the adaptive filter 210, a linear echo signal and a residual echo signal can be obtained. The linear echo signal is an echo signal linearly related to the far-end signal, and the residual echo signal is an echo signal having no linear characteristic except the linear echo signal.

In the embodiment of the present disclosure, the adaptive filter 210 refers to a filter that changes parameters and structure of the filter using an adaptive algorithm according to a change in environment. Wherein the coefficients of the adaptive filter are time-varying coefficients updated by an adaptive algorithm. I.e. its coefficients are automatically adapted continuously to a given signal to obtain a desired response. Specifically, the embodiments of the present disclosure separate a linear echo signal having a linear characteristic and a residual echo signal having no linear characteristic in an echo signal based on a far-end signal and the echo signal by an adaptive filter.

After separating the linear echo signal and the residual echo signal in an echo signal, mixing the residual echo signal with a preset background noise signal and a preset near-end voice signal to generate a mixed audio signal. The near-end speech signal corresponds to the far-end signal, and the near-end speech signal is a pure speech signal. For example, the near-end speech signal is a speech signal that is spoken by the speaker through the near-end without noise.

In the same manner as described above for generating the mixed audio signal, a plurality of mixed audio signals may be generated for a plurality of residual echo signals, a plurality of background noise signals, and a plurality of near-end speech signals.

S2: a plurality of auxiliary signals corresponding to the plurality of mixed audio signals are determined based on the plurality of mixed audio signals, wherein the auxiliary signal corresponding to each mixed audio signal is determined based on the far-end signal corresponding to the residual echo signal in each mixed audio signal.

Specifically, a far-end signal corresponding to the residual echo signal in each mixed audio signal may be used as an auxiliary signal corresponding to each mixed audio signal, or a linear echo signal obtained by passing the far-end signal corresponding to the residual echo signal and the echo signal in each mixed audio signal through an adaptive filter may be used as an auxiliary signal corresponding to each mixed audio signal.

S3: a residual echo suppression model is trained based on the plurality of mixed audio signals and the plurality of auxiliary signals.

Specifically, each mixed audio signal and the corresponding auxiliary signal are input into the initial residual echo suppression model 220 for training, and when a preset condition is satisfied, the final residual echo suppression model can be obtained. The preset condition is that the number of model iterations reaches the preset number of iterations, or the accuracy of the model output result reaches the preset accuracy.

In the embodiment of the disclosure, based on the mixed audio signal including the residual echo signal, the background noise signal and the pure near-end speech signal, and the residual echo suppression model of the auxiliary signal training, the nonlinear residual echo signal can be effectively suppressed, so that the communication quality can be improved, and the user experience is enhanced.

Fig. 3 is a block diagram of a residual echo suppression model 220 in an example of the present disclosure. As shown in fig. 3, step S3 includes:

s3-1: one mixed audio signal at a time is obtained from the plurality of mixed audio signals, and one auxiliary signal is obtained from the plurality of auxiliary signals. Wherein each acquired mixed audio signal corresponds to each acquired auxiliary signal.

S3-2: feature extraction is respectively performed on the currently acquired mixed audio signal and the currently acquired auxiliary signal through an encoder 2201/an encoder 2202 of the residual echo suppression model, so that a first feature tensor and a second feature tensor are obtained. The first feature tensor is a feature tensor of the currently acquired mixed audio signal, and the second feature tensor is a feature tensor of the currently acquired auxiliary signal.

In an embodiment, the residual echo suppression model 220 employs two encoders, one encoder 2201 for extracting a first feature tensor for the mixed audio signal and the other encoder 2202 for extracting a second feature tensor for the auxiliary signal.

S3-3: the first feature tensor and the second feature tensor are processed through a two-way recurrent neural network 2203 of the residual echo suppression model, and a third feature tensor is obtained.

In this embodiment, the two-way recurrent neural network 2203 is an end-to-end time domain speech separation network, and the two-way recurrent neural network 2203 has better performance for speech separation tasks. In this embodiment, the two-way recurrent neural network 2203 receives the first feature tensor obtained by processing the mixed audio signal on the one hand and the second feature tensor obtained by processing the auxiliary signal on the other hand, and finally the third feature tensor can be output by performing model processing on the first feature tensor and the second feature tensor.

S3-4: the first feature tensor and the third feature tensor are subjected to feature spectrum estimation of the clean speech signal by the decoder 2204 of the residual echo suppression model, and parameters of the residual echo suppression model are adjusted based on the feature spectrum estimation result to train the residual echo network model 220.

Specifically, the third feature tensor is decoded and transformed by the decoder 2204 of the residual echo suppression model, and then multiplied by the feature spectrum of the mixed audio signal point by point, so as to obtain the feature spectrum estimation result of the pure speech signal. The feature spectrum estimation result is compared with the feature spectrum of the near-end speech signal, and the parameters of the residual echo suppression model 220 are updated in a back propagation manner based on the comparison result, so as to train the residual echo network model 220.

In this embodiment, the performance of the voice separation task can be improved based on the two-way recurrent neural network, and then the separation effect of the residual echo signal in the mixed audio signal is improved, so that the suppression effect of the residual echo suppression model on the residual echo signal can be improved.

In one embodiment of the present disclosure, step S3-2 includes:

S3-2-A-1: the encoder 2201 of the residual echo suppression model performs one-dimensional convolution on the currently acquired mixed audio signal to obtain a two-dimensional characteristic spectrum of the currently acquired mixed audio signal. And the encoder 2202 of the residual echo suppression model performs one-dimensional convolution on the currently acquired auxiliary signal to obtain a two-dimensional characteristic spectrum of the currently acquired auxiliary signal, wherein two dimensions of the two-dimensional characteristic spectrum are a time dimension and a characteristic dimension respectively. The one-dimensional convolution has a preset overlap ratio.

Fig. 4 is a schematic diagram of the operation of an encoder in a residual echo suppression model in one example of the present disclosure. As shown in fig. 4, a signal (for example, a currently acquired mixed audio signal or a currently acquired auxiliary signal) u is subjected to one-dimensional convolution Conv1d to obtain a two-dimensional characteristic spectrum.

S3-2-A-2: after the dimension reduction of the two-dimensional characteristic spectrum of the currently acquired mixed audio signal is carried out through a full connection layer, the two-dimensional characteristic spectrum of the currently acquired mixed audio signal is divided into a plurality of first characteristic blocks with preset overlapping rates, and after the dimension reduction of the two-dimensional characteristic spectrum of the currently acquired auxiliary signal is carried out through the full connection layer, the two-dimensional characteristic spectrum of the currently acquired auxiliary signal is divided into a plurality of second characteristic blocks with preset overlapping rates.

Referring to fig. 4, after the two-dimensional feature spectrum is activated by a Linear rectification function (ReLU), the two-dimensional feature spectrum is reduced to C by a full connection layer, and then is divided into T sub-blocks with a length K and a predetermined overlap ratio, wherein FC₀Is a fully connected layer. In this example, the output dimension of the one-dimensional convolution layer is N, the convolution kernel is L, and the overlap ratio is 50%.

S3-2-A-3: and performing feature splicing on the plurality of first feature blocks with the preset overlapping rate to obtain a first feature tensor, and performing feature splicing on the plurality of second feature blocks with the preset overlapping rate to obtain a second feature tensor.

Referring to fig. 4 again, T sub-blocks with a length of K and a preset overlap ratio are spliced together to obtain a three-dimensional feature tensor

In this embodiment, a three-dimensional feature tensor can be extracted from a currently acquired mixed audio signal by a time domain processing method, a corresponding three-dimensional feature tensor can be extracted from a currently acquired auxiliary signal by the same time domain processing method, and a decoding method and a two-way recurrent neural network corresponding to the time domain processing method of an encoder are used in cooperation with a decoder to realize time domain waveform processing and waveform restoration of the audio signal, thereby ensuring the accuracy and pertinence of a residual echo suppression model.

In one embodiment of the present disclosure, step S3-4 includes: performing overlap-add operation on the third feature tensor according to a preset overlap ratio to obtain target two-dimensional features; and performing feature dimension dimensionality lifting on the target two-dimensional features through a full connection layer with an activation function to obtain mask estimation features of the time domain features, and taking the point-by-point multiplication result of the mask estimation features and the two-dimensional feature spectrum of the mixed audio signal as a feature spectrum estimation result.

Fig. 5 is a schematic diagram of the operation of the decoder in the residual echo suppression model in the example corresponding to fig. 4. As shown in fig. 5, the three-dimensional feature tensor output to the last layer of the two-way recurrent neural network

It is transformed into a two-dimensional feature Q by applying an overlap-add operation with an overlap ratio of 50%,

and applying full-connection layer dimensionality raising to N with a ReLU activation function to the two-dimensional feature Q to obtain a mask estimation feature of the time domain feature, and taking a point-by-point multiplication result of the mask estimation feature and a two-dimensional feature spectrum of the mixed audio signal as a feature spectrum estimation result.

In this embodiment, the decoding portion of the time domain processing method of the decoder, the encoding portion of the encoder, and the two-way recurrent neural network are used in cooperation, so that the time domain waveform processing and waveform restoration of the audio signal can be realized, and the accuracy of the residual echo suppression model is ensured.

In another embodiment of the present disclosure, step S3-2 includes:

S3-2-B-1: the encoder 2201 of the residual echo suppression model 220 performs short-time fourier transform on the one-dimensional waveform of the currently acquired mixed audio signal to obtain a first complex time-frequency spectrum characteristic, and the encoder 2202 of the residual echo suppression model 220 performs short-time fourier transform on the one-dimensional waveform of the auxiliary signal to obtain a second complex time-frequency spectrum characteristic.

Fig. 6 is a schematic diagram of the operation of an encoder in a residual echo suppression model in another example of the present disclosure. As shown in fig. 6, a short-time fourier transform STFT is performed on a signal (e.g. a currently acquired mixed audio signal or a currently acquired auxiliary signal) u to obtain a first complex time-frequency spectral feature z,

wherein, the window function of the short-time Fourier transform is a Hamming window of a Q point, and the overlapping rate is 50 percent. Where T' is the frame number and F Q/2+1 is the effective frequency domain dimension.

S3-2-B-2: and performing feature splicing on the real number part and the imaginary number part of the first complex time-frequency spectrum feature to obtain a first feature tensor, and performing feature splicing on the real number part and the imaginary number part of the second complex time-frequency spectrum feature to obtain a second feature tensor.

Performing feature splicing on a real part and an imaginary part of a complex time-frequency spectrum feature (such as a first complex time-frequency spectrum feature or a second complex time-frequency spectrum feature) to obtain a three-dimensional feature tensor

The three-dimensional feature tensor is checked by a two-dimensional convolution kernel with 5 x 5 and 1 x 2 step length

Processing to obtain three-dimensional feature tensor

Wherein

In the embodiment of the present disclosure, T ', K ', and C ' of the output tensor dimension of the encoder in the time-frequency domain processing method correspond to T, K, and C of the output tensor dimension of the encoder in the time-domain processing method, respectively.

In this embodiment, a three-dimensional feature tensor can be extracted from a currently acquired mixed audio signal by a time-frequency domain processing method, a corresponding three-dimensional feature tensor can be extracted from a currently acquired auxiliary signal by the same time-frequency domain processing method, and time-domain waveform processing and waveform restoration of the audio signal can be realized by using a decoding method corresponding to the time-frequency domain processing method of the encoder and a two-way recurrent neural network in cooperation with a decoder, so that accuracy and pertinence of a residual echo suppression model are ensured.

In one embodiment of the present disclosure, step S3-4 includes: after the third feature tensor is processed by two full connection layers, the deconvolution is performed by two deconvolution layers corresponding to the convolution layers of the encoder 2201/the encoder 2202 of the residual echo suppression model 220, so as to obtain a first output feature and a second output feature, wherein the first output feature is a one-dimensional feature, and the second output feature is a two-dimensional feature. And multiplying the first output characteristic, the second output characteristic and the time-frequency spectrum amplitude value of the first complex time-frequency spectrum characteristic to obtain a characteristic spectrum estimation result of the pure voice signal.

Fig. 7 is a schematic diagram of the operation of a decoder in the residual echo suppression model in the example corresponding to fig. 6. As shown in fig. 7, the three-dimensional feature tensor output to the last layer of the two-way recurrent neural network

Two full-link layers are used for processing, and then two deconvolution layers Trans Conv _ A and Trans Conv _ P with the kernel size of 1 × 5 and the step size of 1 × 2 are used for deconvolution operation. Wherein the output dimension of the Trans Conv _ A is 1, and the activation function is ReLU and is used for estimating an amplitude mask; the output dimension of Trans Conv _ P is 2 for estimating the phase information. The output of Trans Conv _ P is normalized to satisfy the constraint that the real and imaginary parts of the phase information satisfy a magnitude of 1. And finally, multiplying the amplitude mask and the phase information with the time-frequency spectrum amplitude of the residual signal to obtain a characteristic spectrum estimation result of the pure voice signal, and restoring the characteristic spectrum estimation result into a time-domain waveform through inverse short-time Fourier transform operation.

In this embodiment, the time-domain waveform processing and waveform restoration of the audio signal can be realized by the decoding part in the time-frequency domain processing method of the decoder in cooperation with the encoding part of the encoder and the two-way recurrent neural network, thereby ensuring the accuracy of the residual echo suppression model.

In the present embodiment, the two-way recurrent neural network employs 6 network layers. FIG. 8 is a schematic diagram of the operation of one network layer in a two-way recurrent neural network in one example of the present disclosure. As shown in fig. 8, in the present embodiment, the processing manner of each network layer in the time domain processing method and the time-frequency domain processing method is completely the same, and therefore, no distinction is made in the description of the processing portion of one network layer in the two-way recurrent neural network. Each network layer in the two-way recurrent neural network comprises two groups of RNN layers, a full connection layer and a normalization layer, and is used for processing two feature tensors, namely Stream A and Stream B. Where Stream a is the feature tensor of the mixed audio signal, and Stream B is the feature tensor of the auxiliary signal. Representing the input features of Stream A and Stream B as

The RNN in the intra-block module is a bi-directional RNN, where the output dimension of the RNN for each direction is half the input dimension, applied to the intra-block dimension (second dimension); the RNN in the inter-block module is a unidirectional RNN, with the output dimension being the same as the input dimension, applied to the time dimension (first dimension).

Wherein the content of the first and second substances,

as output of RNN layer, f_RThe RNN layer is represented. The outputs of the two sets of RNNs are then mixed in a weighted manner in order to efficiently integrate the information in the different feature streams.

Wherein the content of the first and second substances,

an element-by-element multiplication by an element-by-element parameter. After the mixed features are respectively spliced with the original input of the corresponding feature stream, the spliced dimensionality is mapped back to the original dimensionality through the full connection layer and is added with the original input features of the corresponding feature stream.

Wherein f is_FCRepresents a fully connected layer [, ]]A dimension stitching operation is represented. The splicing dimension and the affine transformation dimension in the intra-block module are the intra-block dimension; the stitching dimension and the affine transformation dimension in the inter-block module are channel dimensions (third dimension). To pair

And applying the normalization layer to obtain output.

Wherein f is_nA normalization layer is represented. The Normalization used is Group Normalization with a Group number of 2. In the normalization function, first, the input of the normalization function is inputted

Dividing the channel dimension into two groups and then respectively carrying out the normalization operation frame by frame.

Wherein the content of the first and second substances,

represents the features of the l-th frame, and gamma and beta are learnable parameters. In general, the information of two feature streams, Stream a and Stream B, is processed by the intra-block module and the inter-block module in sequence in a single network layer, so as to obtain the output of the single network layer. Except for the last network layer, the output serves as the input for the next network layer.

In this embodiment, two feature tensors input by the encoder can be processed layer by layer according to the sequence of the network layers through a plurality of network layers in the two-way recurrent neural network, and finally, a feature tensor matched with the features of the audio signal and the auxiliary signal is generated.

In an embodiment of the present disclosure, the training method of the residual echo suppression model further includes: and when the residual echo network model is trained, using a target signal-to-noise ratio as a training target of the residual echo network model, wherein the target signal-to-noise ratio is a signal-to-noise ratio with unchanged scaling.

In particular, a maximum scale invariant signal-to-noise ratio (SISNR) is used as a training target for the model. SISNR is defined as follows:

wherein the content of the first and second substances,

estimated and original clean speech, respectively, | s | | | represents the 2 norm of s.

In this embodiment, the signal-to-noise ratio with unchanged scaling is used as a target for model training, so that the model precision can be ensured, and the model training speed can be increased.

The present disclosure illustrates the technical effects of the present disclosure through the following simulation cases.

Training and testing samples and objective evaluation indexes. When the simulation training data is constructed, the library of libristech is used as the near-end speech and the music library and library of libristech in the MUSAN are used as the far-end signals in consideration of the actual use scene of the intelligent loudspeaker box. For speech data, 225, 25 and 40 bits of different speaker data were randomly selected in the training set, validation set and test set and segmented to obtain 16kHz sampled tones of 26556, 1083 and 920 segments of 4s, respectively. For music data, 497, 48, and 115 pieces of music were randomly selected in the training set, validation set, and test set and divided into 101956, 1083, and 920 4s 16kHz sampled tones, respectively.

To construct the echo data, a soft or hard-cut transform is first applied to the far-end signal, as defined below

Wherein x is_maxThe maximum value of x (n) is determined by Θ · max (abs (x (n))). The parameters Θ of the soft and hard clipping functions are randomly chosen from the set 0.6, 0.8, 0.9. The soft clipped signal is then subjected to a sigmoidal function to simulate nonlinear distortion of the loudspeaker, defined as follows

Wherein the parameter (a)_p，a_n) Randomly selected from the set { (4,3), (4,1), (2,3), (1,3), (3,3), (1,1) }.

In order to simulate a real room reverberation environment, 50 random distributions with the length and the width between 3m and 8m and the height between 2.5m and 4.5m are randomly constructed, and the reverberation time T is₆₀Virtual cubicles randomly distributed between 200ms and 400ms, and virtual loudspeaker and microphone units are arranged in each room. 10 segments of room impulse responses are respectively constructed in each room by using a virtual source method. The training set, the verification set and the test set respectively have different room impulse responses of 400 segments, 30 segments and 70 segments. And (4) convolving the signal subjected to sigmoidal transformation with the room impulse response to obtain a simulated echo signal.

And (3) using a frequency domain linear adaptive filter based on a kalman algorithm, and carrying out adaptive filtering on the constructed echo signal through a corresponding far-end signal to obtain a residual echo signal and an output signal of the adaptive filter. In this embodiment, the suppression of the energy of the artificial echo signal by the linear adaptive filter is about 17.0 dB. To construct the near-end speech signal, the residual echo and color noise are mixed with the clean near-end speech. In the training set and the verification set, the energy Ratio (Signal-to-echo Ratio, SER) of the Signal to the echo before the adaptive filter processing is randomly selected from the set { -14.2, -16.2, -18.2, -20.2} dB, and the energy Ratio of the Signal to the color noise is randomly selected from the set {30, 20, 10} dB. In the test set, the energy ratio of signal to echo was-18.2 dB, and the energy ratio of signal to color noise was 20 dB.

To evaluate the performance of the model in real-world scenarios, echoic audio was recorded using a speaker. 920 segments of 4s of speech echo and 920 segments of 4s of music echo were recorded using the speaker for the test, and 6 hours of speech echo and 6 hours of music echo were recorded using different speakers of the same model for the fine phase of training. The suppression of the actually recorded echo energy by the frequency domain linear adaptive filter based on the kalman algorithm is about 24 dB. When constructing the recorded echo test set, the energy ratio of the signal to the echo is-22.2 dB, and the energy ratio of the signal to the color noise is 20 dB.

The present embodiment employs a Perceptual Evaluation of Speech Quality (PESQ) index as an objective evaluation index of the residual echo suppression performance.

And setting parameters. In this embodiment, the one-dimensional convolutional layer output dimension N of the time domain network encoder is 256, the kernel size L is 8, the subblock length K is 100, and the fully-connected layer output dimension C is 128. The frame length of the short-time fourier transform of the time-frequency domain encoder is set to 400, the frame shift is 200, and the convolutional layer output dimension C' is set to 128. Accordingly, the output dimensions of the encoder in the time domain and the time-frequency domain are T × 100 × 128 and T × 99 × 128, respectively. A Gated Round Unit (GRU) layer is selected as a function of each Recurrent neural network in the two-way Recurrent neural network.

During the training process, 8 sets of 4s 16kHz sampled data are used each time, and 120 periods of training are carried out. Model training uses an Adam optimizer, the initial learning rate is set to 1e-3, and the training process is stabilized by means of a gradient clipping mode with 2 norm 5. If the validation set loss does not decrease for 2 consecutive training cycles, the learning rate is reduced to half.

In the training stage, echo signals are artificially recorded or constructed, corresponding far-end signals are used for constructing residual echo signals through adaptive filtering, and the residual echo signals are superposed with near-end voice and background noise to construct mixed audio signals. The linear echo signal output by the adaptive filter is used as an auxiliary signal. The mixed audio signal and auxiliary signal are used as model inputs, and pure near-end speech is used as a target output of the model for training.

In the enhancement stage, the signal obtained by the adaptive filtering of the signal collected by the microphone and the auxiliary signal composed of the output signal of the adaptive filter or the far-end signal are input into the model, and the estimation of the pure near-end speech can be obtained.

Fig. 9 and 10 show histograms of objective speech quality assessment (PESQ) values of enhanced speech for different methods when the far-end signal is speech and music in a simulated echo scenario and a recorded echo scenario, respectively. The dark color blocks in the figure represent the PESQ score in case the far-end signal is speech, and the light color blocks represent the PESQ score in case the far-end signal is music.

FIG. 11 is a graph of parameter quantities for different models in one example of the present disclosure. The magnitude of the parameters for each comparative model is given in fig. 11. It can be found that the method of the embodiment of the present disclosure has a significant improvement in the near-end speech enhancement performance under multiple echo conditions and the related residual echo suppression method based on the deep neural network, and also has an advantage in the number of parameters. Compared with the performance of the algorithm under different conditions, the time-frequency domain algorithm using the output signal of the adaptive filter as the auxiliary signal can better give consideration to both simulation and real recording echo scenes.

In consideration of the fact that the artificially constructed nonlinear echo signal is greatly different from the nonlinear echo signal of an actual loudspeaker, the embodiment of the disclosure adopts a fine tuning strategy to improve the performance of the model in an actual echo scene. And (4) regarding the model trained on the artificial data set as a pre-training model, and retraining the pre-training model on the real echo data set for fine tuning. In the fine tuning stage, only the decoder of the model and the parameters of the last network layer are trained. In the embodiment of the present disclosure, the case of using the output signal of the adaptive filter as the far-end signal is tested, and Time and TF are used to represent the Time domain algorithm and the Time-frequency domain algorithm, respectively, and suffixes n and r are used to represent the pre-training model and the fine-tuned model, respectively.

Fig. 12 and 13 are histograms of PESQ values of enhanced speech when the far-end signal is speech and music for the pre-trained model and the refined model in the simulated echo scenario and the recorded echo scenario, respectively. It can be found that the performance of the refined model on the artificial echo data set is somewhat reduced, but the performance on the recorded echo test set is obviously improved, which explains the effectiveness of the refinement strategy. Further, considering that only little training data is needed in the fine tuning stage, the fine tuning strategy can be effectively applied in a practical scene.

Any of the training methods for the residual echo suppression model provided by the embodiments of the present disclosure may be performed by any suitable device with data processing capability, including but not limited to: terminal equipment, a server and the like. Alternatively, the training method of any residual echo suppression model provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute the training method of any residual echo suppression model mentioned in the embodiments of the present disclosure by calling a corresponding instruction stored in a memory. And will not be described in detail below.

Exemplary devices

Fig. 14 is a block diagram of a training apparatus for a residual echo suppression model according to an embodiment of the present disclosure. As shown in fig. 14, the training apparatus for a residual echo suppression model includes: a mixed audio signal generation module 141, an auxiliary signal determination module 142 and a model training module 143.

The mixed audio signal generating module 141 is configured to generate a plurality of mixed audio signals based on the plurality of residual echo signals, the plurality of background noise signals, and the plurality of clean near-end speech signals. Wherein each mixed audio signal comprises a residual echo signal, a background noise signal and a clean near-end speech signal. The auxiliary signal determining module 142 is configured to determine a plurality of auxiliary signals corresponding to the plurality of mixed audio signals based on the plurality of mixed audio signals, wherein the auxiliary signal corresponding to each mixed audio signal is determined based on the far-end signal corresponding to the residual echo signal in each mixed audio signal. The model training module 143 is configured to train a residual echo suppression model based on the plurality of mixed audio signals and the plurality of auxiliary signals.

FIG. 15 is a block diagram of the structure of the model training module 143 in one example of the present disclosure. In one embodiment of the present disclosure, the model training module 143 includes:

a signal obtaining unit 1431 configured to obtain one mixed audio signal at a time from the plurality of mixed audio signals, and obtain one auxiliary signal from the plurality of auxiliary signals;

an encoder unit 1432, configured to perform feature extraction on the currently acquired mixed audio signal and the currently acquired auxiliary signal respectively through an encoder of the residual echo suppression model to obtain a first feature tensor and a second feature tensor, where the first feature tensor is a feature tensor of the currently acquired mixed audio signal, and the second feature tensor is a feature tensor of the currently acquired auxiliary signal;

a two-way recurrent neural network unit 1433, configured to process the first feature tensor and the second feature tensor through the two-way recurrent neural network of the residual echo suppression model, so as to obtain a third feature tensor;

an estimate parameter adjusting unit 1434, configured to perform, by a decoder of the residual echo suppression model, eigenspectrum estimation on the first eigentensor and the third eigentensor for a clean speech signal, and adjust parameters of the residual echo suppression model based on an eigenspectrum estimation result to train the residual echo network model.

In an embodiment of the present disclosure, the encoder unit 1432 is configured to perform a one-dimensional convolution on the currently-obtained mixed audio signal through an encoder of the residual echo suppression model to obtain a two-dimensional feature spectrum of the currently-obtained mixed audio signal, and perform a one-dimensional convolution on the currently-obtained auxiliary signal through an encoder of the residual echo suppression model to obtain a two-dimensional feature spectrum of the currently-obtained auxiliary signal. Wherein the one-dimensional convolution has a preset overlap ratio. The encoder unit 1432 is further configured to divide the two-dimensional feature spectrum of the currently acquired mixed audio signal into a plurality of first feature blocks with the preset overlap ratio after performing dimension reduction on the fully connected layer, and divide the two-dimensional feature spectrum of the currently acquired auxiliary signal into a plurality of second feature blocks with the preset overlap ratio after performing dimension reduction on the fully connected layer. The encoder unit 1432 is further configured to perform feature splicing on the first feature blocks with the preset overlap ratio to obtain the first feature tensor, and perform feature splicing on the second feature blocks with the preset overlap ratio to obtain the second feature tensor.

In an embodiment of the present disclosure, the decoder unit 1434 is configured to perform overlap-add operation on the third feature tensor according to the preset overlap ratio to obtain a target two-dimensional feature; and performing feature dimension dimensionality lifting on the target two-dimensional feature through a full connection layer with an activation function to obtain a mask estimation feature of a time domain feature, and taking a point-by-point multiplication result of the mask estimation feature and a two-dimensional feature spectrum of the mixed audio signal as a feature spectrum estimation result.

In another embodiment of the present disclosure, the encoder unit 1432 is configured to perform a short-time fourier transform on the one-dimensional waveform of the currently obtained mixed audio signal by using an encoder of the residual echo suppression model to obtain a first complex time-frequency spectrum feature, and perform a short-time fourier transform on the one-dimensional waveform of the auxiliary signal by using an encoder of the residual echo suppression model to obtain a second complex time-frequency spectrum feature. The encoder unit 1432 is further configured to perform feature splicing on a real part and an imaginary part of the first complex time-frequency spectrum feature to obtain the first feature tensor, and perform feature splicing on a real part and an imaginary part of the second complex time-frequency spectrum feature to obtain the second feature tensor.

In another embodiment of the present disclosure, the decoder unit 1434 is configured to perform deconvolution on the two deconvolution layers corresponding to the convolution layers of the encoder of the residual echo suppression model to obtain a first output feature and a second output feature after the third feature tensor is processed by using two fully connected layers, respectively. Wherein the first output feature is a one-dimensional feature and the second output feature is a two-dimensional feature. The decoder unit 1434 is further configured to multiply the first output characteristic, the second output characteristic, and a time-frequency spectrum amplitude of the first complex time-frequency spectrum characteristic to obtain a characteristic spectrum estimation result of the clean speech signal.

Referring to fig. 15, in an embodiment of the present disclosure, the model training module 143 further includes a training target determining unit 1435, configured to use a target signal-to-noise ratio as a training target of the residual echo network model when training the residual echo network model, where the target signal-to-noise ratio is a scaled signal-to-noise ratio.

It should be noted that, a specific implementation of the training apparatus for a residual echo suppression model in the embodiment of the present disclosure is similar to a specific implementation of the residual echo suppression method in the embodiment of the present disclosure, and for details, reference is made to a part of the residual echo suppression method, and details are not described here in order to reduce redundancy.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 16. As shown in fig. 16, the electronic device includes one or more processors 161 and memory 162.

The processor 161 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.

Memory 162 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 161 to implement the training method of the residual echo suppression model of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device may further include: an input device 163 and an output device 164, which are interconnected by a bus system and/or other form of connection mechanism (not shown). The input device 163 may include a keyboard, mouse, and the like. Output device 164 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, for simplicity, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 16, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device may include any other suitable components, depending on the particular application.

Exemplary computer readable storage Medium

A computer-readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the device embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the apparatus and methods of the present disclosure, the components or steps may be broken down and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A training method of a residual echo suppression model comprises the following steps:

2. The method of training a residual echo suppression model according to claim 1, wherein said training a residual echo suppression model based on said plurality of mixed audio signals and a plurality of auxiliary signals comprises:

obtaining one mixed audio signal at a time from the plurality of mixed audio signals and one auxiliary signal from the plurality of auxiliary signals;

respectively performing feature extraction on a currently acquired mixed audio signal and a currently acquired auxiliary signal through an encoder of the residual echo suppression model to obtain a first feature tensor and a second feature tensor, wherein the first feature tensor is a feature tensor of the currently acquired mixed audio signal, and the second feature tensor is a feature tensor of the currently acquired auxiliary signal;

processing the first feature tensor and the second feature tensor through a two-way recurrent neural network of the residual echo suppression model to obtain a third feature tensor;

and performing eigen spectrum estimation on the pure voice signal by a decoder of the residual echo suppression model on the first feature tensor and the third feature tensor, and adjusting parameters of the residual echo suppression model based on an eigen spectrum estimation result so as to train the residual echo network model.

3. The method for training a residual echo suppression model according to claim 2, wherein said performing, by an encoder of the residual echo suppression model, feature extraction on the currently acquired mixed audio signal and the currently acquired auxiliary signal respectively to obtain a first feature tensor and a second feature tensor comprises:

performing one-dimensional convolution on the currently acquired mixed audio signal through an encoder of the residual echo suppression model to obtain a two-dimensional characteristic spectrum of the currently acquired mixed audio signal, and performing one-dimensional convolution on the currently acquired auxiliary signal through an encoder of the residual echo suppression model to obtain a two-dimensional characteristic spectrum of the currently acquired auxiliary signal, wherein the one-dimensional convolution has a preset overlapping rate;

after the dimension reduction of the two-dimensional characteristic spectrum of the currently acquired mixed audio signal is carried out through a full connection layer, the two-dimensional characteristic spectrum of the currently acquired mixed audio signal is divided into a plurality of first characteristic blocks with the preset overlapping rate, and after the dimension reduction of the two-dimensional characteristic spectrum of the currently acquired auxiliary signal is carried out through the full connection layer, the two-dimensional characteristic spectrum of the currently acquired auxiliary signal is divided into a plurality of second characteristic blocks with the preset overlapping rate;

and performing feature splicing on the plurality of first feature blocks with the preset overlapping rate to obtain the first feature tensor, and performing feature splicing on the plurality of second feature blocks with the preset overlapping rate to obtain the second feature tensor.

4. The method for training the residual echo suppression model according to claim 3, wherein said performing, by a decoder of the residual echo suppression model, an eigenspectrum estimation of a clean speech signal on the first feature tensor and the third feature tensor comprises:

performing overlap-add operation on the third feature tensor according to the preset overlap ratio to obtain target two-dimensional features;

and performing feature dimension dimensionality lifting on the target two-dimensional feature through a full connection layer with an activation function to obtain a mask estimation feature of a time domain feature, and taking a point-by-point multiplication result of the mask estimation feature and a two-dimensional feature spectrum of the mixed audio signal as a feature spectrum estimation result.

5. The method for training a residual echo suppression model according to claim 2, wherein said performing, by an encoder of the residual echo suppression model, feature extraction on the currently acquired mixed audio signal and the currently acquired auxiliary signal respectively to obtain a first feature tensor and a second feature tensor comprises:

performing short-time Fourier transform on the one-dimensional waveform of the currently acquired mixed audio signal through an encoder of the residual echo suppression model to obtain a first complex time-frequency spectrum feature, and performing short-time Fourier transform on the one-dimensional waveform of the auxiliary signal through the encoder of the residual echo suppression model to obtain a second complex time-frequency spectrum feature;

and performing feature splicing on the real part and the imaginary part of the first complex time-frequency spectrum feature to obtain the first feature tensor, and performing feature splicing on the real part and the imaginary part of the second complex time-frequency spectrum feature to obtain the second feature tensor.

6. The method for training the residual echo suppression model according to claim 5, wherein said performing, by a decoder of the residual echo suppression model, an eigenspectrum estimation of the clean speech signal on the first feature tensor and the third feature tensor comprises:

after the third feature tensor is processed by two full-connection layers respectively, deconvolution is respectively carried out by two deconvolution layers corresponding to convolution layers of an encoder of the residual echo suppression model to obtain a first output feature and a second output feature, wherein the first output feature is a one-dimensional feature, and the second output feature is a two-dimensional feature;

and multiplying the first output characteristic, the second output characteristic and the time-frequency spectrum amplitude value of the first complex time-frequency spectrum characteristic to obtain a characteristic spectrum estimation result of the pure voice signal.

7. The method of training a residual echo suppression model according to claim 2, further comprising:

and when the residual echo network model is trained, using a target signal-to-noise ratio as a training target of the residual echo network model, wherein the target signal-to-noise ratio is a signal-to-noise ratio with unchanged scaling.

8. A training apparatus for a residual echo suppression model, comprising:

9. A computer-readable storage medium, storing a computer program for executing the method for training a residual echo suppression model according to any one of claims 1 to 7.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the training method of the residual echo suppression model according to any one of claims 1 to 7.