CN115083431A

CN115083431A - Echo cancellation method and device, electronic equipment and computer readable medium

Info

Publication number: CN115083431A
Application number: CN202210701124.7A
Authority: CN
Inventors: 李楠; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-09-20

Abstract

The disclosure relates to an echo cancellation method, an echo cancellation device, electronic equipment and a computer readable medium, and belongs to the technical field of signal processing. The method comprises the following steps: acquiring an original audio signal and a reference signal required by a filter for echo cancellation; inputting the original audio signal and the reference signal into the filter to perform linear echo filtering processing to obtain a residual signal; and carrying out interference signal elimination and effective signal enhancement on the residual signal based on a target model to obtain a target audio signal, wherein the interference signal comprises a nonlinear echo signal and a noise signal. According to the method, the linear echo signal in the original audio signal is eliminated by using a signal processing method, and then the nonlinear echo signal and the noise signal in the residual signal are eliminated by using the deep learning model, so that the elimination effect of echo can be improved under the lower model complexity, and the higher audio quality is ensured.

Description

Echo cancellation method and device, electronic equipment and computer readable medium

Technical Field

The present disclosure relates to the field of signal processing technologies, and in particular, to an echo cancellation method, an echo cancellation device, an electronic device, and a computer readable medium.

Background

In recent years, with the rapid development of communication technology, audio and video products continuously enrich the diversity of life and social interaction, such as an online conference system, a karaoke system and the like. In these applications, the fluency, integrity, and intelligibility of the audio information content transfer directly determine the communication quality between users, and the optimization and innovation of echo cancellation techniques are not left behind in these applications.

The echo phenomenon is caused by the fact that sound of a loudspeaker is fed back to a microphone, if the echo cannot be effectively suppressed, a user can hear the delayed sound, the intelligibility of voice can be directly influenced, and bad experience is brought to the user. Therefore, AEC (Acoustic echo cancellation) techniques play a crucial role in improving audio quality.

At present, an adaptive filtering algorithm is mainly used for echo cancellation, but the method is not thorough enough for echo cancellation in a scene with nonlinear echo.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to provide an echo cancellation method, an echo cancellation device, an electronic device, and a computer readable medium, so as to improve an echo cancellation effect to at least a certain extent and ensure high audio quality.

According to a first aspect of the present disclosure, there is provided a method for canceling echo, including:

acquiring an original audio signal and a reference signal required by a filter for echo cancellation;

inputting the original audio signal and the reference signal into the filter to perform linear echo filtering processing to obtain a residual signal;

and carrying out interference signal elimination and effective signal enhancement on the residual signal based on a target model to obtain a target audio signal, wherein the interference signal comprises a nonlinear echo signal and a noise signal.

In an exemplary embodiment of the present disclosure, the inputting the original audio signal and the reference signal into the filter for linear echo filtering processing to obtain a residual signal includes:

convolving the reference signal with the filter weight of the filter to obtain an estimated value of a linear echo signal;

and obtaining residual signals remained in the original audio signal according to the difference value between the original audio signal and the estimated value of the linear echo signal.

In an exemplary embodiment of the present disclosure, the performing interference signal cancellation and effective signal enhancement on the residual signal based on a target model to obtain a target audio signal includes:

obtaining a frequency domain characteristic corresponding to the original audio signal according to the residual signal and the reference signal;

inputting the frequency domain characteristics into a pre-trained target model to obtain an audio signal time-frequency mask corresponding to a target audio signal;

obtaining the frequency spectrum of the target audio signal according to the frequency spectrum of the residual signal and the audio signal time-frequency mask;

and converting the frequency spectrum of the target audio signal into a time domain signal to obtain an enhanced target audio signal.

In an exemplary embodiment of the present disclosure, the training method of the filter includes:

acquiring a plurality of groups of training data, wherein each group of training data comprises original audio training data, reference training data corresponding to the original audio training data, and residual training data in the original audio training data;

convolving the reference training data with the initial filter weight of the filter to obtain linear echo prediction data;

obtaining residual prediction data according to a difference value between the original audio training data and the linear echo prediction data;

iterating initial filter weights for the filter according to a difference between the residual prediction data and the residual training data to train the filter.

In an exemplary embodiment of the present disclosure, the training method of the target model includes:

taking the residual training data and the reference training data in each group of training data as first training data, and training an initial neural network model according to the first training data to obtain a first neural network model;

taking the original audio training data and the reference training data in each group of training data as second training data, and obtaining mixed training data according to a preset data proportion and the first training data and the second training data;

and training the first neural network model according to the mixed training data to obtain a target model.

In an exemplary embodiment of the present disclosure, the method further comprises:

and obtaining test data according to the reference training data and the original audio training data in each group of training data, and testing the target model according to the test data.

In an exemplary embodiment of the disclosure, the training the initial neural network model according to the first training data to obtain a first neural network model includes:

obtaining corresponding frequency domain characteristic training data according to the first training data, and inputting the frequency domain characteristic training data into an initial neural network model to obtain an audio sample time-frequency mask and an interference sample time-frequency mask;

obtaining a sample frequency spectrum of the enhanced target audio signal according to the sample frequency spectrum of the residual training data and the audio sample time-frequency mask, and obtaining a sample frequency spectrum of an interference signal according to the sample frequency spectrum of the residual training data and the interference sample time-frequency mask;

determining spectral distance loss according to the sample frequency spectrum of the enhanced target audio signal and the sample frequency spectrum of the interference signal, and determining signal-to-noise ratio loss according to the signal-to-noise ratio parameter of the enhanced target audio signal and the signal-to-noise ratio parameter of the interference signal;

and obtaining the overall loss according to the spectral distance loss and the signal-to-noise ratio loss, and iterating the neural network parameters in the initial neural network model according to the overall loss to obtain a trained first neural network model.

In an exemplary embodiment of the present disclosure, the determining a spectral distance loss according to the sample spectrum of the enhanced target audio signal and the sample spectrum of the interference signal includes:

determining audio frequency spectrum distance loss according to the sample frequency spectrum of the residual training data and the sample frequency spectrum of the enhanced target audio signal;

acquiring interference training data in training data, and determining interference spectrum distance loss according to a sample spectrum of the interference training data and a sample spectrum of the interference signal;

and obtaining the spectral distance loss according to the audio spectral distance loss and the interference spectral distance loss.

In an exemplary embodiment of the present disclosure, the method for generating residual training data includes:

acquiring the original audio training data collected by a microphone of a user terminal, wherein the original audio training data comprises original voice training data and original music training data;

carrying out variable speed processing, reverberation processing and delay processing on the original audio training data to obtain simulated echo data corresponding to the original audio training data;

and obtaining residual training data corresponding to the original audio training data according to the difference value between the original audio training data and the corresponding simulated echo data.

According to a second aspect of the present disclosure, there is provided an echo cancellation device, including:

the audio signal acquisition module is configured to acquire an original audio signal and a reference signal required by echo cancellation of the filter;

a linear echo cancellation module configured to perform linear echo filtering processing on the original audio signal and the reference signal input to the filter, so as to obtain a residual signal;

and the interference signal eliminating module is configured to perform interference signal elimination and effective signal enhancement on the residual signal based on a target model to obtain a target audio signal, wherein the interference signal comprises a nonlinear echo signal and a noise signal.

In an exemplary embodiment of the present disclosure, the linear echo cancellation module includes:

a linear echo estimation unit configured to perform convolution of the reference signal with the filter weights of the filter to obtain an estimated value of a linear echo signal;

a linear echo cancellation unit configured to perform deriving a residual signal remaining in the original audio signal according to a difference between the original audio signal and an estimated value of the linear echo signal.

In an exemplary embodiment of the present disclosure, the interference signal cancellation module includes:

a frequency domain characteristic determining unit configured to perform obtaining a frequency domain characteristic corresponding to the original audio signal according to the residual signal and the reference signal;

the time-frequency mask determining unit is configured to input the frequency domain characteristics into a pre-trained target model to obtain an audio signal time-frequency mask corresponding to a target audio signal;

a signal spectrum determination unit configured to perform deriving a spectrum of a target audio signal from a spectrum of the residual signal and the audio signal time-frequency mask;

a signal spectrum conversion unit configured to perform a spectrum conversion of the target audio signal into a time domain signal, resulting in an enhanced target audio signal.

In an exemplary embodiment of the present disclosure, the echo cancellation apparatus further includes a filter training module, and the filter training module includes:

a training data obtaining unit configured to perform obtaining a plurality of sets of training data, wherein each set of training data includes original audio training data, reference training data corresponding to the original audio training data, and residual training data in the original audio training data;

a linear echo prediction data determination unit configured to perform a convolution of the reference training data with an initial filter weight of the filter to obtain linear echo prediction data;

a residual prediction data determination unit configured to perform deriving residual prediction data from a difference between the original audio training data and the linear echo prediction data;

a filter weight iteration unit configured to perform an iteration of initial filter weights of the filter according to a difference between the residual prediction data and the residual training data to train the filter.

In an exemplary embodiment of the present disclosure, the echo cancellation apparatus further includes a target model training module, and the target model training module includes:

an initial neural network model training unit configured to perform training on an initial neural network model according to first training data, wherein the first training data is the residual training data and the reference training data in each set of training data, and obtain a first neural network model;

a mixed training data determining unit configured to perform, as second training data, the original audio training data and the reference training data in each set of the training data, and obtain mixed training data according to a preset data ratio according to the first training data and the second training data;

and the first neural network model training unit is configured to train the first neural network model according to the mixed training data to obtain a target model.

In an exemplary embodiment of the present disclosure, the target model training module further includes:

and the target model testing unit is configured to execute test data obtained according to the reference training data and the original audio training data in each set of training data, and test the target model according to the test data.

In an exemplary embodiment of the present disclosure, the initial neural network model training unit includes:

a sample time-frequency mask determining unit, configured to perform obtaining of corresponding frequency-domain feature training data according to the first training data, and input the frequency-domain feature training data into an initial neural network model, so as to obtain an audio sample time-frequency mask and an interference sample time-frequency mask;

a sample spectrum determination unit configured to perform deriving a sample spectrum of the enhanced target audio signal according to the sample spectrum of the residual training data and the audio sample time-frequency mask, and deriving a sample spectrum of an interference signal according to the sample spectrum of the residual training data and the interference sample time-frequency mask;

a network loss determination unit configured to perform a spectral distance loss determination according to the sample spectrum of the enhanced target audio signal and the sample spectrum of the interference signal, and determine a signal-to-noise ratio loss according to a signal-to-noise ratio parameter of the enhanced target audio signal and a signal-to-noise ratio parameter of the interference signal;

and the neural network parameter iteration unit is configured to execute the operation of obtaining the overall loss according to the spectrum distance loss and the signal-to-noise ratio loss, and iterate the neural network parameters in the initial neural network model according to the overall loss to obtain the trained first neural network model.

In an exemplary embodiment of the present disclosure, the network loss determination unit includes:

an audio spectral distance loss determination unit configured to perform determining an audio spectral distance loss from a sample spectrum of the residual training data and a sample spectrum of the enhanced target audio signal;

an interference spectrum distance loss determination unit configured to perform acquisition of interference training data in training data and determine an interference spectrum distance loss according to a sample spectrum of the interference training data and a sample spectrum of the interference signal;

a spectral distance loss determination unit configured to perform deriving a spectral distance loss from the audio spectral distance loss and the interference spectral distance loss.

In an exemplary embodiment of the present disclosure, the training data obtaining unit includes:

the original audio training data acquisition unit is configured to acquire the original audio training data acquired by a microphone of a user terminal, wherein the original audio training data comprises original voice training data and original music training data;

the analog echo data generating unit is configured to execute variable speed processing, reverberation processing and delay processing on the original audio training data to obtain analog echo data corresponding to the original audio training data;

a residual training data generating unit configured to perform obtaining residual training data corresponding to the original audio training data according to a difference between the original audio training data and the corresponding simulated echo data.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the echo cancellation method according to any one of the above.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the echo cancellation method of any one of the above.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of echo cancellation of any of the above.

The exemplary embodiments of the present disclosure may have the following advantageous effects:

in the echo cancellation method according to the exemplary embodiment of the present disclosure, a linear echo signal in an original audio signal is cancelled by using a pre-trained filter, and then a non-linear echo signal and a noise signal in a residual signal are cancelled by using a pre-trained target model, so as to obtain an enhanced target audio signal. On one hand, the echo cancellation method in the exemplary embodiment of the present disclosure, by combining the conventional signal processing and the deep learning method, can improve the echo cancellation effect on the premise of a smaller system complexity, and ensure a higher audio quality; on the other hand, not only can the echo be eliminated, but also the noise interference can be eliminated, the audio intelligibility can be ensured to the maximum extent, and the user experience is further improved. The echo cancellation method in the disclosed example embodiment is not only suitable for the Karaoke scenes, but also supports the voice echo cancellation in the conventional scenes, and can achieve better tone quality.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 is a flow chart illustrating a method of canceling echo in a related embodiment of the disclosure;

fig. 2 shows a schematic diagram of an echo cancellation system in accordance with one embodiment of the present disclosure.

FIG. 3 shows a flow diagram of eliminating a linear echo signal in an original audio signal according to an example embodiment of the present disclosure;

fig. 4 shows a flow diagram of canceling an interference signal in a residual signal according to an example embodiment of the present disclosure;

FIG. 5 shows a flow chart diagram of a method of training a filter of an example embodiment of the present disclosure;

FIG. 6 shows a flow diagram of a method of training a target model of an example embodiment of the present disclosure;

FIG. 7 shows a schematic diagram of a neural network model in accordance with one embodiment of the present disclosure.

FIG. 8 illustrates a flow diagram of training an initial neural network model with first training data in accordance with an exemplary embodiment of the present disclosure;

fig. 9 shows a block diagram of an echo cancellation device of an example embodiment of the present disclosure;

FIG. 10 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein.

The following example embodiments may be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

In some related embodiments, an adaptive filtering algorithm may be used to cancel the echo in the audio signal, such as an LMS (Least Mean Square), NLMS (Normalized Least Mean Square) algorithm, or the like. The adaptive filtering algorithm estimates an approximate echo path to approximate a real echo path by adjusting the weight vector of the filter, so as to obtain an estimated value of an echo signal, and subtracts the estimated echo signal from an original signal at a receiving end, so as to obtain an ideal pure signal. However, the adaptive filtering algorithm is not thorough enough for echo cancellation in a scene with nonlinear echo.

In other related embodiments, echo cancellation of the audio signal may also be performed based on a neural network. The neural network has the nonlinear fitting capability and can directly eliminate linear echo and nonlinear echo. Effective characteristics are extracted from signals collected by a near-end microphone and a far-end reference signal, and the signals are input into a neural network in series, so that pure signals can be recovered. However, the method tests the modeling capability of the model more, and usually needs a more complex or refined model to achieve a better effect, so that the method is difficult to be applied in an actual scene.

The present exemplary embodiment first provides a method of canceling echo. Referring to fig. 1, the echo cancellation method may include the following steps:

and S110, acquiring an original audio signal and a reference signal required by a filter for echo cancellation.

And S120, inputting the original audio signal and the reference signal into a filter to perform linear echo filtering processing to obtain a residual signal.

And S130, eliminating interference signals and enhancing effective signals of the residual signals based on the target model to obtain target audio signals, wherein the interference signals comprise nonlinear echo signals and noise signals.

In the echo cancellation method according to the exemplary embodiment of the present disclosure, a linear echo signal in an original audio signal is cancelled by using a pre-trained filter, and then a non-linear echo signal and a noise signal in a residual signal are cancelled by using a pre-trained target model, so as to obtain an enhanced target audio signal. In the echo cancellation method in the exemplary embodiment of the present disclosure, on one hand, by combining the conventional signal processing and the deep learning method, the echo cancellation effect can be improved on the premise of a smaller system complexity, and a higher audio quality is ensured; on the other hand, not only can the echo be eliminated, but also the noise interference can be eliminated, the audio intelligibility can be ensured to the maximum extent, and the user experience is further improved. The echo cancellation method in the exemplary embodiment of the present disclosure is not only suitable for the karaoke scenes, but also supports the voice echo cancellation in the conventional scenes, and can achieve better tone quality.

Next, the above steps of the present exemplary embodiment will be described in more detail with reference to fig. 2 to 8.

In step S110, an original audio signal and a reference signal required for echo cancellation by the filter are acquired.

Fig. 2 is a schematic diagram of an echo cancellation system in accordance with an embodiment of the present disclosure. In the echo cancellation system, r (t) is a far-end reference signal at time t, s (t) is a near-end target audio signal at time t, n (t) is noise, and h (t) is an impulse response corresponding to an echo path of the far-end reference signal.

For example, when real-time communication is performed, the signal transmitted from the opposite end is the far-end signal and is also the reference signal for echo cancellation. If in the karaoke system, the remote reference signal may be audio data transmitted from the opposite side in the real-time communication process, or background music played by the system itself during karaoke, or a mixture of the two signals.

The echo signals mainly include linear echo signals and nonlinear echo signals, where the linear echo signals y (t) are echo signals directly received by the near-end microphone, and are defined as:

y(t)＝r(t)*h(t)

where denotes the convolution operation. The nonlinear echo signal is an echo signal received by a near-end microphone after a far-end reference signal propagates in multiple paths, and is denoted by v (t). The original audio signal actually captured by the near-end microphone is d (t), which is defined as:

d(t)＝s(t)+n(t)+r(t)*h(t)+v(t)

in order to receive a pure target audio signal s (t), in this exemplary embodiment, a conventional adaptive filtering method may be used to eliminate a linear Echo signal y (t) in d (t) to obtain a Residual signal e (t), and then a neural network in an RES (Residual Echo Suppression) module is used to eliminate a Residual nonlinear Echo v (t) and a noise n (t), so that an audio signal received at a far end is closer to a target audio signal s (t) at a near end.

In step S120, the original audio signal and the reference signal are input into a filter to perform linear echo filtering, so as to obtain a residual signal.

In this exemplary embodiment, the filter is an adaptive filter, and the cancellation of the linear echo signal aims to remove the portion r (t) × h (t) in d (t) described in the echo cancellation system, so as to obtain a residual signal e (t). As shown in fig. 3, inputting the original audio signal and the reference signal into a filter to perform linear echo filtering processing to obtain a residual signal, which may specifically include the following steps:

and S310, convolving the reference signal with the filter weight of the filter to obtain an estimated value of the linear echo signal.

For the input far-end reference signal r (t), the far-end reference signal r (t) is convolved with the filter weight h (t) of the adaptive filter to obtain an estimated value r (t) h (t) of the linear echo signal.

And S320, obtaining residual signals remained in the original audio signal according to the difference value between the original audio signal and the estimated value of the linear echo signal.

According to the difference between the original audio signal d (t) and the estimated value r (t) h (t) of the linear echo signal, the residual signal e (t) in the original audio signal can be obtained. After linear AEC, the components of the residual signal e (t) are expressed as follows:

e(t)＝s(t)+n(t)+v(t)

in step S130, interference signal cancellation and effective signal enhancement are performed on the residual signal based on the target model, so as to obtain a target audio signal, where the interference signal includes a nonlinear echo signal and a noise signal.

In this example embodiment, the target model may be a neural network model, such as CrossNet (CrossNet), which may be used to eliminate an interference signal in the residual signal, where the interference signal includes a nonlinear echo signal and a noise signal. As shown in fig. 4, performing interference signal cancellation and effective signal enhancement on the residual signal based on the target model to obtain the target audio signal may specifically include the following steps:

and S410, obtaining frequency domain characteristics corresponding to the original audio signal according to the residual signal and the reference signal.

Firstly, extracting frequency domain characteristics from output signals of linear AEC, namely residual signals e (t) and far-end reference signals R (t), and splicing the frequency domain characteristics into X e R corresponding to original audio signals ^T×F×C Equation tableShown below:

X＝concat(FEAT[e(t)],FEAT[r(t)])

wherein, the FEAT [ ] is obtained as the corresponding log power spectrum of the audio signal, T is the frame index, F is the frequency index, C represents two splicing channels, and concat is the splicing function.

And S420, inputting the frequency domain characteristics into a pre-trained target model to obtain an audio signal time-frequency mask corresponding to the target audio signal.

Inputting the frequency domain characteristics extracted in the steps into a pre-trained target model, and predicting to obtain an audio signal time-frequency mask corresponding to the target audio signal _speech 。

And S430, obtaining the frequency spectrum of the target audio signal according to the frequency spectrum of the residual signal and the audio signal time-frequency mask.

Frequency spectrum and audio signal time-frequency mask according to residual signal _speech Calculating to obtain the frequency spectrum of the enhanced target audio signal, wherein the specific formula is as follows:

enhanced audio spectrum mask _speech Residual signal spectrum

And step S440, converting the frequency spectrum of the target audio signal into a time domain signal to obtain an enhanced target audio signal.

Finally, the phase of the target audio signal is reused, and the frequency spectrum of the enhanced target audio signal is converted back to the Time domain by using an Inverse Short-Time Fourier Transform (ISTFT) function, so that the enhanced target audio signal can be obtained.

In the embodiment of the present invention, by combining the conventional signal processing and the deep learning method, the echo cancellation effect can be improved and higher audio quality can be ensured on the premise of a smaller system complexity. Therefore, not only can the echo be eliminated, but also the noise interference can be eliminated, the audio intelligibility can be ensured to the maximum extent, and the user experience is further improved.

In addition, the echo cancellation method provided in this exemplary embodiment may further include a training process of a filter and a target model.

In this exemplary embodiment, as shown in fig. 5, the filter training method may specifically include the following steps:

step S510, obtaining a plurality of groups of training data, wherein each group of training data comprises original audio training data, reference training data corresponding to the original audio training data, and residual training data in the original audio training data.

First, a plurality of sets of training data required for training the adaptive filter and the neural network model need to be prepared as required, where each set of training data includes original audio training data, reference training data corresponding to the original audio training data, and residual training data included in the original audio training data.

The method for generating the residual training data comprises the following steps: acquiring original audio training data acquired by a microphone of a user terminal, wherein the original audio training data comprises original voice training data and original music training data; carrying out variable speed processing, reverberation processing and delay processing on the original audio training data to obtain simulated echo data corresponding to the original audio training data; and obtaining residual training data corresponding to the original audio training data according to the difference value between the original audio training data and the corresponding simulated echo data.

The far-end echo data in the training data may include background music and voice echo signals, while the original audio training data is mainly based on voice and singing data and mainly includes two parts, namely original voice training data and original music training data. In order to obtain better audio quality, data processing is an indispensable part, original audio training data are processed through various data processing methods to increase data, training data can be closer to actual distribution, and bedding is made for obtaining higher tone quality.

For example, the following data augmentation methods may be mainly used in the present exemplary embodiment:

(1) a large amount of K songs of music are added in the training data set, and the ratio of original voice training data to original music training data in the training data set is kept about 1: 1.

(2) Data with variations and reverberation is added to the original speech training data in the original audio training data to simulate corresponding echo data.

(3) Adding delay to the original audio training data.

(4) And adjusting the signal-to-noise ratio of the noise and the echo within a certain reasonable range. The signal-to-noise ratio can be set generally to 0dB to 20dB, and the signal-to-back ratio can be set generally to-20 dB to 20 dB.

And S520, convolving the reference training data with the initial filter weight of the filter to obtain linear echo prediction data.

In this exemplary embodiment, an adaptive filter algorithm based on NLMS may be used to convolve the reference training data with the initial filter weight of the adaptive filter to obtain linear echo prediction data.

And S530, obtaining residual prediction data according to the difference value between the original audio training data and the linear echo prediction data.

And subtracting the linear echo prediction data from the original audio training data to obtain corresponding residual prediction data.

And S540, iterating the initial filter weight of the filter according to the difference between the residual prediction data and the residual training data so as to train the filter.

And applying a minimum mean square error principle to the difference value between the residual prediction data and the residual training data, and finally converging the initial filter weight of the adaptive filter to a path closest to the linear echo through continuous iteration, thereby completing the training process of the adaptive filter.

In this exemplary embodiment, as shown in fig. 6, the training method of the target model may specifically include the following steps:

and S610, taking residual training data and reference training data in each group of training data as first training data, and training the initial neural network model according to the first training data to obtain a first neural network model.

FIG. 7 is a drawing showingA schematic diagram of a neural network model in accordance with one embodiment of the present disclosure. The neural network model uses a crossover network and can be used for nonlinear echo cancellation and noise suppression. The cross network mainly comprises a plurality of convolution modules and GRU (Gated Current Unit) modules, wherein the modules are in cross connection, and the output of the previous layer is cross-spliced to be used as the input of the next layer so as to simultaneously utilize the potential relationship of two tasks. Wherein, one task estimates the voice or singing voice signal which the near end wants to reserve, and the output parameter is the time frequency mask of the audio signal _spee％h The other task estimates echo and noise signals, and the output parameter is an interference signal time-frequency mask _interference 。

As shown in fig. 7, each branch in the crossbar network comprises 4 convolution modules (Conv block) and 3 GRU modules (GRU block), each convolution module is composed of a Conv2D (two-dimensional convolution module), a Batch Normalization module BN (Batch Normalization), and an activation function (ReLU, Linear rectification function). And taking the output of the GRU layer of the last branch as the input of the Dense layer, and finally outputting a time-frequency mask value through a GRU module.

In this exemplary embodiment, the linear echo cancellation part may be combined with the training process of the neural network model, and the training process of the neural network model may be divided into two stages.

In the first stage, linear echo cancellation may be added to train the neural network model, so that the model converges to obtain a first neural network model. Thus, the training data of the first stage are the residual training data and the reference training data in each set of training data.

In this example embodiment, as shown in fig. 8, training the initial neural network model according to the first training data to obtain the first neural network model may specifically include the following steps:

and step S810, obtaining corresponding frequency domain characteristic training data according to the first training data, and inputting the frequency domain characteristic training data into the initial neural network model to obtain an audio sample time-frequency mask and an interference sample time-frequency mask.

Firstly, extracting frequency domain characteristics from residual training data and reference training data which do not contain linear echo in first training data to obtain frequency domain characteristic training data, and then inputting the frequency domain characteristic training data into an initial neural network model to obtain two time-frequency mask sample data, namely audio sample time-frequency mask sample data _speech Sum interference sample time-frequency mask _interference 。

And S820, obtaining a sample frequency spectrum of the enhanced target audio signal according to the sample frequency spectrum of the residual training data and the audio sample time-frequency mask, and obtaining a sample frequency spectrum of the interference signal according to the sample frequency spectrum of the residual training data and the interference sample time-frequency mask.

And convolving the sample frequency spectrum of the residual training data and the audio sample time-frequency mask to obtain the sample frequency spectrum of the enhanced target audio signal. And convolving the sample frequency spectrum of the residual training data with the interference sample time-frequency mask to obtain the sample frequency spectrum of the enhanced interference signal. The specific formula is as follows:

sample spectrum of target audio signal _speech Sample spectrum of sample spectrum interference signal of residual training data _interference Sample spectra of residual training data

And S830, determining spectrum distance loss according to the sample spectrum of the enhanced target audio signal and the sample spectrum of the interference signal, and determining signal-to-noise ratio loss according to the signal-to-noise ratio parameter of the enhanced target audio signal and the signal-to-noise ratio parameter of the interference signal.

In this example embodiment, the audio spectrum distance loss may be determined according to a sample spectrum of residual training data and a sample spectrum of an enhanced target audio signal, then interference training data in the training data is obtained, the interference spectrum distance loss is determined according to the sample spectrum of the interference training data and the sample spectrum of the interference signal, and finally the spectrum distance loss is obtained according to the audio spectrum distance loss and the interference spectrum distance loss.

In the present exemplary embodiment, two loss functions are selected, namely an osi (Optimal scale-invariant signal-to-noise ratio) function and a CSD (Compressed spectral distance) function. The OSINR function is used for calculating the distortion degree between the original clean audio and the enhanced audio, and the signal-to-noise ratio loss can be obtained based on the OSINR function; the CSD function is used to calculate the spectral mismatch of the original clean audio and the enhanced audio, and the spectral distance loss can be obtained based on the CSD function. By means of these two loss functions, acoustic echo cancellation and noise suppression effects can be achieved. The specific formula is as follows:

wherein L is _OSISNR For loss of signal-to-noise ratio, L _CSD In order to be a loss of spectral distance,

and

representing the OSISNR and spectral distance loss of the audio portion,

and

OSINR and spectrum representing the interference portionThe distance is lost. S (t, f) denotes the audio magnitude spectrum, t denotes the time index, f denotes the frequency index, and the exponential factor α is 0.3.

And S840, obtaining overall loss according to the spectral distance loss and the signal-to-noise ratio loss, and iterating the neural network parameters in the initial neural network model according to the overall loss to obtain the trained first neural network model.

The overall loss L-L can be obtained according to the spectrum distance loss and the signal-to-noise ratio loss _OSISNR +γL _CSD Wherein, the weighting factor gamma is 10. And iterating the neural network parameters in the initial neural network model according to the overall loss L to obtain a first neural network model.

And S620, taking the original audio training data and the reference training data in each group of training data as second training data, and obtaining mixed training data according to the first training data and the second training data according to a preset data proportion.

For the second stage of the neural network model training, the original audio training data and the reference training data in each set of training data can be used as second training data, and then the first training data and the second training data are mixed according to a preset data proportion to obtain mixed training data required by the second stage training. For example, the residual training data not containing linear echo may be used with a probability of 10%, the original audio training data containing linear echo may be used with a probability of 90%, and the reference training data corresponding to the two data may be added to form the mixed training data required for the second stage training.

And S630, training the first neural network model according to the mixed training data to obtain a target model.

And continuously training the first neural network model obtained by the training in the first stage based on the mixed training data until the model converges again to obtain the final target model. The specific training process is similar to that in steps S810 to S840, and the difference is only that the used training data is different, which is not described herein again.

After the neural network model training is completed, test data can be obtained according to reference training data and original audio training data in each group of training data, and the target model is tested according to the test data.

In the testing stage of the neural network model, the original audio training data containing the linear echo can be used as testing data to test the neural network model, verify the effect of the neural network model on eliminating the interference signal and optimize the model.

In the present exemplary embodiment, in the first stage of neural network model training, the target of neural network model elimination can be specified based on the linear AEC using residual training data that does not contain linear echoes. However, linear AEC is not used in the testing stage because it impairs the sound quality of music and affects the user experience. However, for model matching, it is necessary to continue the training of the second stage on the basis that the model of the first stage has converged, so that the model converges again to obtain the final neural network model. By using the linear AEC to guide the training of the neural network model in stages, the convergence rate of the model and the complexity of the model can be further balanced, and the effects of eliminating echo to the maximum extent and not damaging the tone quality of music are achieved.

It should be noted that although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order or that all of the depicted steps must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Further, the present disclosure also provides an echo cancellation device. Referring to fig. 9, the echo cancellation apparatus may include an audio signal obtaining module 910, a linear echo cancellation module 920, and an interference signal cancellation module 930. Wherein:

the audio signal obtaining module 910 is configured to obtain an original audio signal and a reference signal required by a filter for echo cancellation;

the linear echo cancellation module 920 is configured to input the original audio signal and the reference signal into the filter for linear echo filtering processing, so as to obtain a residual signal;

the interference signal elimination module 930 is configured to perform interference signal elimination and effective signal enhancement on the residual signal based on the target model, resulting in a target audio signal, wherein the interference signal includes a non-linear echo signal and a noise signal.

In some exemplary embodiments of the present disclosure, the linear echo cancellation module 920 may include a linear echo estimation unit and a linear echo cancellation unit. Wherein:

the linear echo prediction unit is configured to convolve the reference signal with the filter weight of the filter to obtain an estimated value of the linear echo signal;

the linear echo cancellation unit is configured to derive a residual signal remaining in the original audio signal from a difference between the original audio signal and an estimated value of the linear echo signal.

In some exemplary embodiments of the present disclosure, the interference signal cancellation module 930 may include a frequency domain characteristic determination unit, a time-frequency mask determination unit, a signal spectrum determination unit, and a signal spectrum conversion unit. Wherein:

the frequency domain characteristic determining unit is configured to obtain a frequency domain characteristic corresponding to the original audio signal according to the residual signal and the reference signal;

the signal spectrum determination unit is configured to obtain a frequency spectrum of the target audio signal according to the frequency spectrum of the residual signal and the audio signal time-frequency mask;

the signal spectrum conversion unit is configured to convert the spectrum of the target audio signal into a time domain signal, resulting in an enhanced target audio signal.

In some exemplary embodiments of the present disclosure, an echo cancellation device provided by the present disclosure may further include a filter training module, which may include a training data acquisition unit, a linear echo prediction data determination unit, a residual prediction data determination unit, and a filter weight iteration unit. Wherein:

the training data acquisition unit is configured to acquire a plurality of groups of training data, wherein each group of training data comprises original audio training data, reference training data corresponding to the original audio training data, and residual training data in the original audio training data;

the linear echo prediction data determination unit is configured to convolve the reference training data with the initial filter weight of the filter to obtain linear echo prediction data;

the residual prediction data determination unit is configured to obtain residual prediction data according to a difference value between the original audio training data and the linear echo prediction data;

the filter weight iteration unit is configured to iterate initial filter weights of the filter according to a difference between the residual prediction data and the residual training data to train the filter.

In some exemplary embodiments of the present disclosure, an echo cancellation device provided by the present disclosure may further include a target model training module, which may include an initial neural network model training unit, a hybrid training data determination unit, and a first neural network model training unit. Wherein:

the initial neural network model training unit is configured to take residual training data and reference training data in each group of training data as first training data, and train the initial neural network model according to the first training data to obtain a first neural network model;

the mixed training data determining unit is configured to take the original audio training data and the reference training data in each set of training data as second training data and obtain mixed training data according to the first training data and the second training data according to a preset data proportion;

the first neural network model training unit is configured to train the first neural network model according to the mixed training data, and obtain a target model.

In some exemplary embodiments of the present disclosure, the target model training module may further include a target model testing unit configured to obtain test data according to reference training data and original audio training data in each set of training data, and test the target model according to the test data.

In some exemplary embodiments of the present disclosure, the initial neural network model training unit may include a sample time-frequency mask determining unit, a sample spectrum determining unit, a network loss determining unit, and a neural network parameter iterating unit. Wherein:

the sample time-frequency mask determining unit is configured to obtain corresponding frequency-domain feature training data according to the first training data, and input the frequency-domain feature training data into the initial neural network model to obtain an audio sample time-frequency mask and an interference sample time-frequency mask;

the sample spectrum determination unit is configured to obtain a sample spectrum of the enhanced target audio signal according to the sample spectrum of the residual training data and the audio sample time-frequency mask, and obtain a sample spectrum of the interference signal according to the sample spectrum of the residual training data and the interference sample time-frequency mask;

the network loss determining unit is configured to determine a spectral distance loss according to the sample spectrum of the enhanced target audio signal and the sample spectrum of the interference signal, and determine a signal-to-noise ratio loss according to a signal-to-noise ratio parameter of the enhanced target audio signal and a signal-to-noise ratio parameter of the interference signal;

the neural network parameter iteration unit is configured to obtain an overall loss according to the spectral distance loss and the signal-to-noise ratio loss, and iterate the neural network parameters in the initial neural network model according to the overall loss to obtain a trained first neural network model.

In some exemplary embodiments of the present disclosure, the network loss determination unit may include an audio spectrum distance loss determination unit, an interference spectrum distance loss determination unit, and a spectrum distance loss determination unit. Wherein:

an audio spectral distance loss determination unit configured to determine an audio spectral distance loss from a sample spectrum of the residual training data and a sample spectrum of the enhanced target audio signal;

the interference spectrum distance loss determining unit is configured to acquire interference training data in the training data and determine interference spectrum distance loss according to a sample spectrum of the interference training data and a sample spectrum of the interference signal;

the spectral distance loss determination unit is configured to derive a spectral distance loss from the audio spectral distance loss and the interference spectral distance loss.

In some exemplary embodiments of the present disclosure, the training data acquisition unit may include an original audio training data acquisition unit, a simulated echo data generation unit, and a residual training data generation unit. Wherein:

the original audio training data acquisition unit is configured to acquire original audio training data acquired by a microphone of a user terminal, wherein the original audio training data comprises original voice training data and original music training data;

the simulated echo data generation unit is configured to perform variable speed processing, reverberation processing and delay processing on the original audio training data to obtain simulated echo data corresponding to the original audio training data;

the residual training data generating unit is configured to obtain residual training data corresponding to the original audio training data according to a difference value between the original audio training data and the corresponding simulated echo data.

The details of each module/unit in the echo cancellation device are already described in detail in the corresponding method embodiment section, and are not described herein again.

FIG. 10 illustrates a schematic structural diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.

It should be noted that the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for system operation are also stored. The CPU1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. When the computer program is executed by a Central Processing Unit (CPU)1001, various functions defined in the system of the present application are executed.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments above.

It should be noted that although in the above detailed description several modules of the device for action execution are mentioned, this division is not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for canceling echo, comprising:

2. The method of claim 1, wherein the inputting the original audio signal and the reference signal into the filter for linear echo filtering to obtain a residual signal comprises:

3. The method of canceling echo according to claim 1, wherein the performing interference signal cancellation and effective signal enhancement on the residual signal based on the target model to obtain a target audio signal comprises:

4. The method of canceling echo according to claim 1, wherein the method of training the filter comprises:

5. The method for canceling echo according to claim 4, wherein the training method of the target model comprises:

6. The method of eliminating echo according to claim 5, wherein said training an initial neural network model according to the first training data to obtain a first neural network model comprises:

7. An echo cancellation device, comprising:

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of canceling echo according to any one of claims 1 to 6.

9. A computer-readable storage medium, whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of canceling echo according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the method of cancellation of echo according to any one of claims 1 to 6.