CN111883154A

CN111883154A - Echo cancellation method and apparatus, computer-readable storage medium, and electronic apparatus

Info

Publication number: CN111883154A
Application number: CN202010693855.2A
Authority: CN
Inventors: 马路; 赵培; 苏腾荣
Original assignee: Haier Uplus Intelligent Technology Beijing Co Ltd
Current assignee: Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2020-11-03
Anticipated expiration: 2040-07-17
Also published as: CN111883154B

Abstract

The invention provides an echo cancellation method and device, a computer readable storage medium and an electronic device, wherein the echo cancellation method comprises the following steps: estimating an echo signal in the sound source signal according to the reference signal and the echo detection information to obtain echo estimation information; the terminal comprises a sound source signal, a reference signal and echo detection information, wherein the sound source signal is an audio signal received by an audio input channel of the terminal, the reference signal is an audio signal in an audio output channel of the terminal, and the echo detection information is used for indicating the probability of existence of an echo signal in the sound source signal; obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model, and eliminating the echo signal in the sound source signal according to the output information. The invention solves the problem that certain residual echo still exists in the echo cancellation process in the related technology so as to influence the performance of voice signal processing, thereby achieving the effect of improving the echo cancellation and further improving the performance of voice signal processing.

Description

Echo cancellation method and apparatus, computer-readable storage medium, and electronic apparatus

Technical Field

The present invention relates to the field of audio signal processing, and in particular, to an echo cancellation method and apparatus, a computer-readable storage medium, and an electronic apparatus.

Background

The voice signal processing technology is a key technology in the field of human-computer interaction at present; in the implementation process of voice signal processing, the echo cancellation algorithm can achieve cancellation of a self-played voice signal received by a device microphone, is a key technology of whole voice signal processing and voice enhancement, and plays an extremely important role in back-end voice recognition.

Fig. 1 is a schematic diagram of an echo cancellation method provided according to the related art, and as shown in fig. 1, the echo cancellation method in the related art mainly adopts an echo cancellation method in Web Real-Time Communication (WebRTC) of an open source tool, that is, an adaptive filter is used to complete estimation of an echo, so as to cancel a linear echo, and a nonlinear process is used to complete suppression of a residual nonlinear echo. The method can well eliminate the linear echo, but when processing the nonlinear echo, because the nonlinear echo and the time delay estimation error can introduce the residual echo, although the nonlinear processing can suppress the residual echo to a certain extent, the suppression degree is limited, so a certain residual echo still exists, especially for the echo introduced in a complex environment and a nonlinear device, the suppression effect of the residual echo is extremely limited, thereby affecting the final echo cancellation effect and causing the performance reduction of the voice signal processing.

In view of the above-mentioned problem that a certain residual echo still exists in the echo cancellation process in the related art, and further affects the performance of speech signal processing, an effective solution has not been proposed in the related art.

Disclosure of Invention

The embodiment of the invention provides an echo cancellation method and device, a computer readable storage medium and an electronic device, which are used for at least solving the problem that certain residual echo still exists in the echo cancellation process in the related technology so as to influence the performance of voice signal processing.

According to an embodiment of the present invention, there is provided an echo cancellation method including:

estimating an echo signal in the sound source signal according to the reference signal and the echo detection information to obtain echo estimation information; the sound source signal is an audio signal received by an audio input channel of a terminal, the reference signal is an audio signal in an audio output channel of the terminal, and the echo detection information is used for indicating the probability of the echo signal existing in the sound source signal;

obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model, and eliminating an echo signal in the sound source signal according to the output information; the first neural network model is obtained by training according to a sample sound source signal, a sample echo signal and sample output information.

According to another embodiment of the present invention, there is also provided an echo canceling device including:

the estimation module is used for estimating an echo signal in the sound source signal according to the reference signal and the echo detection information to obtain echo estimation information; the sound source signal is an audio signal received by an audio input channel of a terminal, the reference signal is an audio signal in an audio output channel of the terminal, and the echo detection information is used for indicating the probability of the echo signal existing in the sound source signal;

the elimination module is used for obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model, and eliminating the echo signal in the sound source signal according to the output information; the first neural network model is obtained by training according to a sample sound source signal, a sample echo signal and sample output information.

According to another embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-described method embodiments when executed.

According to another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, the echo signal in the sound source signal can be estimated according to the reference signal and the echo detection information to obtain the echo estimation information, the output information is further obtained according to the sound source signal, the echo estimation information and the preset first neural network model, and the echo signal in the sound source signal is eliminated according to the output information; the first neural network model is obtained by training according to a sample sound source signal, a sample echo signal and sample output information. Therefore, the invention can solve the problem that certain residual echo still exists in the echo cancellation process in the related technology so as to influence the performance of voice signal processing, thereby achieving the effect of improving the echo cancellation and further improving the performance of voice signal processing.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic diagram of an echo cancellation method provided according to the related art;

fig. 2 is a functional diagram (one) of an echo cancellation system according to an embodiment of the present invention;

fig. 3 is a functional diagram of an echo cancellation system according to an embodiment of the present invention (ii);

fig. 4 is a schematic structural diagram of an echo cancellation system provided according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a room impulse response generating unit provided according to an embodiment of the present invention;

fig. 6 is a flowchart illustrating the operation of an echo cancellation system according to an embodiment of the present invention;

fig. 7 is a flowchart of an echo cancellation method provided according to an embodiment of the present invention;

FIG. 8 is a flow chart of a method of training a neural network model provided in accordance with an embodiment of the present invention;

FIG. 9 is a training diagram of a training method of a neural network model provided in accordance with an embodiment of the present invention;

FIG. 10 is a flow chart of a method of training a neural network model provided in accordance with an embodiment of the present invention;

FIG. 11 is a schematic training diagram of a training method of a neural network model provided in accordance with an embodiment of the present invention;

fig. 12 is a block diagram of an echo cancellation device according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1

Fig. 2 is a functional schematic diagram (i) of an echo cancellation system according to an embodiment of the present invention, and as shown in fig. 2, the echo cancellation system in this embodiment includes:

a cancellation unit 102, including a preset first neural network model, configured to obtain output information according to the sound source signal, the echo estimation information, and the first neural network model; the canceling unit 102 is further configured to cancel an echo signal in the sound source signal according to the output information;

the method comprises the steps that a sound source signal is an audio signal received by an audio input channel of a terminal, and echo estimation information is used for indicating an estimation value for estimating an echo signal in the sound source signal;

the first neural network model is obtained by training according to the sample sound source signal, the sample echo signal and the sample output information.

It should be further described that the echo cancellation system in this embodiment is applied to a terminal with a voice signal processing function, where the terminal in the above embodiment may be a mobile phone, a tablet computer, a PC, a sound box, a vehicle-mounted system with a voice interaction function, and the like, and the present invention is not limited thereto; in the above embodiment, the sound source signal is a signal received by an audio input channel of the terminal, and the signal may include an echo signal to be cancelled; the audio input channel of the terminal is an input channel for the terminal to receive audio, for example, a microphone in a mobile phone.

It should be further noted that, in the above embodiment, since the first neural network model is obtained by training the sample sound source signal, the sample echo signal and the sample output information, the first neural network model may establish a relationship between the sample sound source signal, the sample echo signal and the sample output information; the sample echo signal corresponds to the echo estimation information in the above embodiment. Therefore, after the current sound source signal and the echo estimation information of the terminal are input into the first neural network model, corresponding output information can be obtained.

With the echo cancellation system in this embodiment, output information can be obtained by the cancellation unit according to a sound source signal, echo estimation information, and a preset first neural network model, so as to cancel the echo signal in the sound source signal by using the output information; the sound source signal is an audio signal received by an audio input channel of a terminal, and the echo estimation information is used for indicating an estimation value for estimating an echo signal in the sound source signal; the first neural network model is obtained by training according to a sample sound source signal, a sample echo signal and sample output information. Therefore, the echo cancellation system in this embodiment can solve the problem that a certain residual echo still exists in the echo cancellation process in the related art, and further affects the performance of speech signal processing, so as to achieve the effect of improving echo cancellation, and further improve the performance of speech signal processing.

In an optional embodiment, the echo cancellation system in this embodiment further includes:

the estimation unit 104 comprises a preset second neural network model, and is configured to obtain echo estimation information according to the sound source signal, the reference signal, the echo detection information and the second neural network model;

wherein, the reference signal is an audio signal in an audio output channel of the terminal, such as an output channel of a loudspeaker, and the echo detection information is used for indicating the probability of existence of an echo signal in a sound source signal;

the second neural network model is obtained by training according to the sample sound source signal, the sample reference signal, the sample echo detection information and the sample echo signal.

the detection unit 106 comprises a preset third neural network model, and is configured to obtain echo detection information according to the sound source signal, the reference signal and the third neural network model;

and the third neural network model is obtained by training according to the sample sound source signal, the sample reference signal and the sample echo detection information.

In the above alternative embodiment, the detecting unit and the estimating unit may cooperate with the canceling unit to form the echo canceling system in this embodiment. Fig. 3 is a functional schematic diagram (two) of an echo cancellation system according to an embodiment of the present invention, where connection functions of a detection unit, an estimation unit, and a cancellation unit are shown in fig. 3, fig. 4 is a structural schematic diagram of an echo cancellation system according to an embodiment of the present invention, and a connection structure of a detection unit, an estimation unit, and a cancellation unit is shown in fig. 4.

It should be further noted that the reference signal is used to indicate an audio signal in an audio output channel of the terminal, where the audio output channel of the terminal is used for the terminal to play audio, for example, an output channel of a speaker in a mobile phone, and the reference signal is specifically an audio signal that the terminal prepares to play through an audio device in the audio output channel, such as an audio signal before being played by the speaker.

It should be further noted that, in the above optional embodiment, since the second neural network model is obtained by training the sample sound source signal, the sample reference signal, the sample echo detection information and the sample echo signal, the second neural network model can establish a relationship between the sample sound source signal, the sample reference signal, the sample echo detection information and the sample echo signal. Therefore, after the current sound source signal, the reference signal and the echo detection information of the terminal are input to the second neural network model, the corresponding echo signal can be obtained, and the echo signal is an estimated value, so the echo signal is the echo estimation information in the embodiment. Similarly, since the third neural network model can be obtained by training the sample sound source signal, the sample reference signal and the sample echo detection information, the third neural network model can establish a relationship between the sample sound source signal, the sample reference signal and the sample echo detection information. Therefore, after the current sound source signal and the reference signal of the terminal are input into the third neural network model, corresponding echo detection information can be obtained.

In an optional embodiment, the first neural network model is a Recurrent Neural Network (RNN) model, the second neural network model is an RNN model, and the third neural network model is an RNN model.

It should be further noted that, in the above alternative embodiment, the first neural network model, the second neural network model, and the third neural network model all use RNN models formed by Gated Round Units (GRUs).

In the above optional embodiment, since the cyclic neural network is used to implement the signal processing of each unit, the nonlinear characteristic of the cyclic neural network itself can be utilized to implement the cancellation of the nonlinear echo in the echo signal. Secondly, in each unit, because the first neural network model also adopts a recurrent neural network with a time sequence memory function, compared with a self-adaptive filtering method in the related technology, the method can realize more complex nonlinear operation, and can better complete the elimination of echo by utilizing the time sequence characteristic of voice; similarly, since the second neural network model adopts the recurrent neural network, the time sequence memory function of the recurrent neural network can adapt to the delay of the echo; meanwhile, due to the nonlinear characteristic of the recurrent neural network, nonlinear echo can be correctly estimated, so that more accurate estimation of echo is completed.

Based on this, the echo cancellation system formed by the recurrent neural network can improve the robustness to the echo time delay estimation error, thereby improving the performance of echo cancellation.

It should be further noted that, in the above-mentioned embodiment, the output information is used for indicating information or parameters that can cancel the echo signal in the sound source signal, for example, gain information of the sound source signal. The following further illustrates, by way of an alternative embodiment, a process of the above-mentioned cancellation unit canceling an echo signal in a sound source signal according to the output information:

in an alternative embodiment, the cancellation unit 102 is further configured to,

dividing the sound source signal into a plurality of frequency bands according to a preset frequency band dividing mode, and determining a frequency band gain coefficient corresponding to each frequency band in the sound source signal according to echo estimation information and a first neural network model; wherein, the frequency band gain coefficient is output information;

and performing echo cancellation processing on the sound source signal corresponding to each frequency band according to the frequency band gain coefficient to obtain a sound source signal with the echo signals being cancelled.

It should be further noted that, in the above optional embodiment, the frequency band distinguishing manner may be a Bark frequency band, that is, sound source signals are distinguished according to 22 Bark frequency bands, and sound source signals corresponding to 22 Bark frequency bands are correspondingly obtained; therefore, the frequency band gain coefficient corresponding to each Bark frequency band of the sound source signal can be determined through the first neural network model; in this optional embodiment, the frequency band gain coefficients corresponding to the 22 Bark frequency bands may be used as the output information in the above embodiment.

It should be further noted that, the frequency band distinguishing manner in the above optional embodiment may also be other frequency band distinguishing manners, which is not limited in the present invention.

In the optional embodiment, the echo cancellation processing on the sound source signal corresponding to each frequency band according to the frequency band gain coefficient may specifically be performed by transforming each frame of audio in the sound source signal to a frequency domain through short-time fourier transform, multiplying each frequency band by the frequency band gain coefficient corresponding to the frequency band, and then transforming the frequency domain to a time domain through short-time inverse fourier transform, so as to complete the echo cancellation processing on the sound source signal.

dividing the sound source signal into a plurality of frequency bands according to a preset frequency band dividing mode, and determining a frequency band gain coefficient corresponding to each frequency band in the sound source signal according to echo estimation information and a first neural network model;

determining a frequency point gain coefficient corresponding to each frequency point in each frequency band of the sound source signal according to the frequency band gain coefficient; wherein, the frequency point gain coefficient is output information;

and multiplying the sound source signal corresponding to each frequency point in each frequency band by the gain coefficient corresponding to the frequency point to perform echo cancellation processing so as to obtain the sound source signal with the echo signals cancelled.

It should be further noted that, in the above optional embodiment, the frequency band distinguishing manner may be a Bark frequency band, that is, sound source signals are distinguished according to 22 Bark frequency bands, and sound source signals corresponding to 22 Bark frequency bands are correspondingly obtained; therefore, the frequency band gain coefficient corresponding to each Bark frequency band of the sound source signal can be determined through the first neural network model. On the premise of determining the frequency band gain coefficient corresponding to each Bark frequency band of the sound source signal, the frequency point gain coefficient corresponding to each frequency point of the sound source signal in each Bark frequency band can be further determined; in this optional embodiment, the frequency point gain coefficient corresponding to each frequency point in each Bark frequency band of the 22 Bark frequency bands may be used as the output information in the above embodiment.

The frequency point gain coefficient corresponding to each frequency point of the sound source signal in each Bark frequency band can be determined by the following formula:

in the above formula, g_k(m) gain factor for m frequency bin representing k frequency band, g_kAnd g_k+1And the gain coefficients are respectively used for representing frequency band gain coefficients of a k frequency band and a k +1 frequency band, M is an mth frequency point of the k frequency band, and M represents the length of the k frequency band.

In the above optional embodiment, the process of performing echo cancellation processing on the sound source signal corresponding to each frequency point in each frequency band by multiplying the gain coefficient corresponding to the frequency point may specifically be that each frame of audio in the sound source signal is transformed to a frequency domain by short-time fourier transform, and each frequency point is multiplied by the gain coefficient corresponding to the frequency point, so that the level of each frequency band is changed quickly, so as to attenuate a far-end signal (i.e., an echo signal) in the sound source signal, and allow a near-end signal to pass through. And then transforming the processed sound source signal to a time domain through short-time inverse Fourier transform, thereby completing the echo cancellation processing of the sound source signal.

In the optional embodiment, the frequency point gain coefficient corresponding to each frequency point in each frequency band is determined, so that each frame of audio signal in the sound source signal is subjected to targeted processing to replace the frequency band gain coefficient in the optional embodiment, the echo cancellation effect can be further improved, and the reduction effect of the processed sound source signal can be further improved.

In the two optional embodiments, no matter the frequency band gain or the frequency point gain is adopted as the output information, in the process of eliminating the echo signal in the sound source signal by adopting the gain, the related calculation amount is obviously reduced compared with the filtering processing in the related technology; meanwhile, since the gain coefficients are distributed between 0 and 1, the S-shaped activation functions with the outputs also distributed between 0 and 1 can be used for the calculation of the gain coefficients, the function models used by the method are simpler than those in the related art, and the accuracy of the calculation of the gain coefficients is improved compared with that in the related art. On the other hand, in the process of eliminating the echo signal in the sound source signal by adopting the frequency band gain or the frequency point gain, only a single tone is passed through, so that the common music noise artifact in the related technology can not be generated.

It should be further noted that, in an alternative embodiment, the audio signal after the echo signal is removed may be processed by using a comb filter, for example, harmonic echoes that may exist in the audio signal are removed within one fundamental frequency period.

In an alternative embodiment, the sound source signal includes: a near-end signal, or a near-end signal and a far-end signal; the far-end signal is used for indicating an echo signal in the sound source signal;

the detection unit is also configured to detect whether the sound source signal comprises a far-end signal according to the sound source signal, the reference signal and the third neural network model;

in the case where the sound source signal includes only the near-end signal, the system is further configured to output the sound source signal to an audio output channel of the terminal; or,

in the case that the sound source signal includes a near-end signal and a far-end signal, the detection unit is further configured to obtain echo detection information according to the sound source signal, the reference signal and a third neural network model.

It should be further noted that, in the above optional embodiment, the far-end signal in the sound source signal is an echo signal that may exist in the sound source signal, so that, in the case that the detection unit detects that the sound source signal only includes the near-end signal, that is, the sound source signal does not include the echo signal that needs to be cancelled, it is not necessary to perform echo cancellation processing on the sound source signal, and in this case, the sound source signal can be output to the audio output channel of the terminal, that is, the subsequent estimation unit and cancellation unit of the detection unit do not process the sound source signal; in the case that the detection unit detects that the sound source signal includes the far-end signal, the far-end signal is eliminated according to the operation of the estimation unit and the elimination unit in the foregoing optional embodiment.

It should be further noted that the detection of whether the sound source signal includes the far-end signal by the detection unit is implemented based on a third neural network model. According to the aforementioned optional embodiment, after the current sound source signal and the reference signal of the terminal are input to the third neural network model, the corresponding echo detection information may be obtained, at this time, the probability that the echo detection information indicates that the echo signal exists in the sound source signal may be compared with a preset threshold, and if the probability corresponding to the far-end signal is smaller than the preset threshold, it indicates that the far-end signal does not exist.

In an alternative embodiment, the sample echo signal is derived from a sample reference signal and a predetermined room impulse response.

It should be further noted that the sample echo signal used in the training process of the first neural network model and the second neural network model may be obtained from a sample reference signal obtained by sampling in an audio output channel of the terminal and a preset room impulse response, specifically, the sample reference signal may be convolved with the room impulse response to obtain the sample echo signal, or the sample echo signal may be obtained by multiplying the sample reference signal and the room impulse response in a frequency domain.

In an alternative embodiment, the room impulse response may be generated by a room impulse response generating unit, fig. 5 is a schematic structural diagram of the room impulse response generating unit provided according to an embodiment of the present invention, the structure of the room impulse response generating unit is as shown in fig. 5, and the room impulse response generating unit shown in fig. 5 is composed of a plurality of filters, specifically, a linear filter and a nonlinear filter connected in series. It should be further noted that the above-mentioned room impulse response generating unit is only an alternative, and any unit capable of simulating the impulse response of the room in the art may constitute the room impulse response generating unit in the embodiment of the present invention.

As shown in fig. 5, the linear filter is implemented with Finite Impulse Response (FIR), which can simulate the Impulse Response of a roomThe maximum delay number of the impulse response can be set according to an application scene, and a general range can also be set; the different tap coefficients in the impulse response may be attenuated by the square of time depending on the magnitude of the delay (tap coefficients are proportional to 1/(c)²t²) Where c is used to represent the speed of sound and t is used to represent the magnitude of the delay). The nonlinear filter adopts Infinite Impulse Response (IIR), which can simulate nonlinear factors introduced by a real environment.

In an alternative embodiment, the sample acoustic source signal comprises: a sample far-end signal and a sample near-end signal;

the far-end signal of the sample is obtained from a reference signal of the sample and a room impulse response, and the near-end signal of the sample is obtained from a pure audio signal and a noise signal.

It should be further noted that the sample sound source signals adopted in the training process of the first, second and third neural network models are both composed of a sample far-end signal and a sample near-end signal. For the sample far-end signal, the sample far-end signal may be obtained from a sample reference signal sampled in an audio output channel of the terminal and a preset room impulse response, and similar to the above sample echo signal, the sample reference signal may be specifically convolved with the room impulse response to obtain the sample far-end signal, or the sample reference signal may be multiplied by the room impulse response in a frequency domain to obtain the sample far-end signal. For the sample near-end signal, different types of noise signals may be superimposed on the clean audio signal to generate the sample near-end signal.

the input processing unit is configured to acquire a sound source signal and a reference signal, determine a sound source characteristic according to the sound source signal, and determine a reference characteristic according to the reference signal;

the eliminating unit is also configured to obtain output information according to the sound source characteristics, the echo estimation information and the first neural network model;

the estimation unit is also configured to obtain echo estimation information according to the sound source characteristics, the reference characteristics, the echo detection information and the second neural network model;

the detection unit is further configured to obtain echo detection information according to the sound source characteristics, the reference characteristics and the third neural network model.

It should be further noted that, in the present embodiment, the elimination unit, the estimation unit, and the detection unit all input the corresponding characteristics of the audio signal in the actual input process. Specifically, the input processing unit may extract the characteristics of the sound source signal and the reference signal to be used as the sound source characteristic and the reference characteristic for the subsequent echo cancellation processing. The input processing unit in the above alternative embodiment may be a virtual unit, i.e. integrated in the processor of the terminal. The following describes, by way of an alternative embodiment, the process of extracting features by an input processing unit:

in an optional embodiment, the sound source characteristics at least include a sound source frequency domain characteristic and a sound source pitch characteristic; the reference features at least comprise reference frequency domain features and reference tone features; the input processing unit is also configured such that,

acquiring a sound source signal and a reference signal, and respectively carrying out frequency division and windowing on the sound source signal and the reference signal;

transforming the processed sound source signal to a frequency domain to extract sound source frequency domain characteristics, and performing tone analysis on the processed sound source signal to determine sound source tone characteristics;

the processed reference signal is transformed to the frequency domain to extract reference frequency domain features and pitch analysis is performed on the processed reference signal to determine reference pitch features.

It should be further noted that, in the above alternative embodiment, the windowing of the frames of the sound source signal and the reference signal can effectively eliminate the spectrum discontinuity at the frame boundary; the above-mentioned transformation of the processed sound source signal or reference signal into the frequency domain may be realized by Short-time Fourier transform (STFT).

In an alternative embodiment, the sound source frequency domain characteristics include at least: the method comprises the steps of obtaining a plurality of Barker cepstrum coefficient BFCC frequency domain characteristics of a sound source signal, first-order difference information of the plurality of BFCC frequency domain characteristics of the sound source signal, and second-order difference information of the plurality of BFCC frequency domain characteristics of the sound source signal;

the sound source pitch characteristics include at least: discrete Cosine Transform (DCT) information of a plurality of operation coefficients corresponding to the tones of the sound source signal, the pitch period dynamic characteristic of the sound source signal and the pitch frequency spectrum dynamic characteristic of the sound source signal;

the reference frequency domain features include at least: the first order difference information of the plurality of BFCC frequency domain characteristics of the reference signal, and the second order difference information of the plurality of BFCC frequency domain characteristics of the reference signal;

the reference pitch characteristic includes at least: DCT information of a plurality of operation coefficients corresponding to the pitch of the reference signal, pitch period dynamic characteristics of the reference signal, and pitch spectrum dynamic characteristics of the reference signal.

It should be further noted that the BFCC frequency domain is used to indicate the characteristics of 22 Back frequency bands, so in the above-mentioned alternative embodiment, the BFCC frequency domain characteristics of the sound source signal may be 22; the first-order difference information of the plurality of BFCC frequency domain characteristics of the sound source signal may adopt a first-order difference of the first 6 BFCC frequency domain characteristics among the 22 BFCC frequency domain characteristics of the sound source signal, and the second-order difference information of the plurality of BFCC frequency domain characteristics of the sound source signal may adopt a second-order difference of the first 6 BFCC frequency domain characteristics among the 22 BFCC frequency domain characteristics of the sound source signal. Meanwhile, the Discrete Cosine Transform (DCT) information of a plurality of operational coefficients corresponding to the tones of the sound source signal can adopt the DCT information of the first 6 Pitch-related operational coefficients.

Similarly, the BFCC frequency domain characteristics of the reference signals may be 22; the first-order difference information of the plurality of BFCC frequency-domain features of the reference signal may employ a first-order difference of the first 6 of the 22 BFCC frequency-domain features of the reference signal, and the second-order difference information of the plurality of BFCC frequency-domain features of the reference signal may employ a second-order difference of the first 6 of the 22 BFCC frequency-domain features of the reference signal. Meanwhile, the Discrete Cosine Transform (DCT) information of a plurality of operational coefficients corresponding to the tones of the reference signal can adopt the DCT information of the first 6 Pitch-related operational coefficients.

It should be further noted that, in the above alternative embodiment, by using the corresponding features of the sound source signal and the reference signal as the input in the echo cancellation, it may be avoided that a large number of neurons exist in the neural network model processing process to generate a large number of outputs, and compared with the related art in which a sample of the signal or a signal spectrum is directly used as the input in the echo cancellation, the amount of computation of the system may be further reduced.

the output processing unit is configured to acquire a first output audio signal, filter the first output audio signal according to the tone characteristic of a sound source, and convert the filtered first output audio signal into a time domain to obtain a second output audio signal; wherein the first output audio signal is indicative of a sound source signal from which the echo signal is cancelled;

the output processing unit is further configured to output the second output audio signal to an audio output channel of the terminal.

Fig. 6 is a flowchart of an echo cancellation system according to an embodiment of the present invention, where the workflow of the input processing unit, the detection unit, the estimation unit, the cancellation unit, and the output processing unit in the echo cancellation system according to the embodiment is as shown in fig. 6.

It should be further noted that the output processing unit in the above-mentioned alternative embodiment may be a virtual unit, i.e. integrated in the processor of the terminal. The output processing unit filters the first output audio signal according to the tone characteristic of the sound source, so that the tone characteristic of the near-end signal in the sound source signal can be kept, and the integrity of the audio can be better kept.

Example 2

Fig. 7 is a flowchart of an echo cancellation method according to an embodiment of the present invention, and as shown in fig. 7, the echo cancellation method in this embodiment includes:

s202, estimating an echo signal in a sound source signal according to the reference signal and the echo detection information to obtain echo estimation information; the terminal comprises a sound source signal, a reference signal and echo detection information, wherein the sound source signal is an audio signal received by an audio input channel of the terminal, the reference signal is an audio signal in an audio output channel of the terminal, and the echo detection information is used for indicating the probability of existence of an echo signal in the sound source signal;

s204, obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model, and eliminating the echo signal in the sound source signal according to the output information; the first neural network model is obtained by training according to the sample sound source signal, the sample echo signal and the sample output information.

It should be further noted that other optional embodiments and technical effects of the echo cancellation method in this embodiment correspond to those of the echo cancellation system in embodiment 1, and therefore are not described herein again.

In an optional embodiment, in the step S202, estimating an echo signal in the sound source signal according to the reference signal and the echo detection information to obtain echo estimation information, includes:

obtaining echo estimation information according to the sound source signal, the reference signal, the echo detection information and the second neural network model;

In an optional embodiment, before obtaining the echo estimation information according to the sound source signal, the reference signal, the echo detection information, and the second neural network model in step S202, the method further includes:

obtaining echo detection information according to the sound source signal, the reference signal and a preset third neural network model;

In an optional embodiment, in step S204, obtaining output information according to the sound source signal, the echo estimation information, and a preset first neural network model, includes:

and carrying out echo cancellation processing on the sound source signal corresponding to each frequency point in each frequency band according to the frequency point gain coefficient so as to obtain the sound source signal with the echo signals being cancelled.

before obtaining the echo detection information according to the sound source signal, the reference signal, and the preset third neural network model, the method further includes:

detecting whether the sound source signal comprises a far-end signal or not according to the sound source signal, the reference signal and the third neural network model;

outputting the sound source signal to an audio output channel of the terminal in a case where the sound source signal includes only the near-end signal; or,

and under the condition that the sound source signal comprises a near-end signal and a far-end signal, obtaining echo detection information according to the sound source signal, the reference signal and the third neural network model.

In an optional embodiment, the echo cancellation method in the embodiment of the present invention further includes:

acquiring a sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, and the sample far-end signal is used for indicating a sample echo signal;

determining a gain coefficient of a sample sound source signal, and taking the gain coefficient as sample output information;

and establishing a first neural network model according to the relation between the sample sound source signal and the sample output information.

In an alternative embodiment, the acquiring the sample sound source signal includes:

acquiring a sample reference signal, and obtaining a sample far-end signal according to the sample reference signal and a preset room impulse response;

acquiring a sample pure audio signal and a preset sample noise signal, and overlapping the sample pure audio signal and the sample noise signal to obtain a sample near-end signal;

and superposing the sample far-end signal and the sample near-end signal to obtain a sample sound source signal.

In an alternative embodiment, the room impulse response is generated by a preset room impulse response generating unit, wherein the room impulse response generating unit is formed by serially connecting a linear filter and a nonlinear filter.

In an alternative embodiment, the determining the gain factor of the sample sound source signal includes:

acquiring first frequency band energy of a sample pure audio signal, and acquiring second frequency band energy of a sample sound source signal;

and determining the gain coefficient of the sample sound source signal according to the energy of the first frequency band and the energy of the second frequency band.

It should be further noted that, in the above optional embodiment, the process of training the first neural network model is described below by using the training method of the first neural network model described in embodiment 3, and therefore, no further description is given here.

acquiring a sample reference signal and a sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, and the sample far-end signal is used for indicating a sample echo signal;

determining a first label and a second label; wherein the first label is used for indicating the probability of the audio existing in the sample reference signal, and the second label is used for indicating the probability of the audio existing in at least part of the sample sound source signal;

and establishing a third neural network model according to the relation between the sample reference signal and the first label and the relation between the sample sound source signal and the second label.

In an optional embodiment, the determining the first tag and the second tag includes:

performing framing processing on the sample reference signal, and determining reference audio energy corresponding to each frame of audio in the sample reference signal;

determining the probability of each frame of audio in the sample reference signal according to the relation between the reference audio energy and a preset threshold value, and identifying the probability of each frame of audio in the sample reference signal as a first label;

performing frame processing on the sample pure audio signal, and determining sound source audio energy corresponding to each frame of audio in the sample pure audio signal;

and determining the probability of each frame of audio in the sample pure audio signal according to the relation between the sound source audio energy and the preset threshold value, and setting the probability of each frame of audio in the sample pure audio signal as a second label.

It should be further noted that, in the above optional embodiment, the process of training the third neural network model is described below by using the training method of the third neural network model described in embodiment 4, and therefore, no further description is given here.

acquiring a sound source signal and a reference signal, determining a sound source characteristic according to the sound source signal, and determining a reference characteristic according to the reference signal;

obtaining echo detection information according to the sound source characteristics, the reference characteristics and the third neural network model;

obtaining echo estimation information according to the sound source characteristics, the reference characteristics, the echo detection information and the second neural network model;

and obtaining output information according to the sound source characteristics, the echo estimation information and the first neural network model.

In an optional embodiment, the sound source characteristics at least include a sound source frequency domain characteristic and a sound source pitch characteristic; the reference features at least comprise reference frequency domain features and reference tone features;

acquiring a sound source signal and a reference signal, determining a sound source characteristic according to the sound source signal, and determining the reference characteristic according to the reference signal, further comprising:

In an optional embodiment, the sound source frequency domain characteristics at least include: the method comprises the steps of obtaining a plurality of Barker cepstrum coefficient BFCC frequency domain characteristics of a sound source signal, first-order difference information of the plurality of BFCC frequency domain characteristics of the sound source signal, and second-order difference information of the plurality of BFCC frequency domain characteristics of the sound source signal;

In an optional embodiment, after obtaining the output information according to the sound source signal, the echo estimation information, and a preset first neural network model, the method further includes:

acquiring a first output audio signal, filtering the first output audio signal according to the tone characteristic of a sound source, and converting the filtered first output audio signal into a time domain to obtain a second output audio signal; wherein the first output audio signal is indicative of a sound source signal from which the echo signal is cancelled;

and outputting the second output audio signal to an audio output channel of the terminal.

In an alternative embodiment, the first neural network model is a Recurrent Neural Network (RNN) model, the second neural network model is an RNN model, and the third neural network model is an RNN model.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 3

This embodiment provides a training method of a neural network model, which is used to implement the training of the first neural network model described in embodiment 2, fig. 8 is a flowchart of the training method of the neural network model provided according to the embodiment of the present invention, and as shown in fig. 8, the training method of the neural network model in this embodiment includes:

s302, acquiring a sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, and the sample far-end signal is used for indicating a sample echo signal;

s304, determining a gain coefficient of the sample sound source signal, and taking the gain coefficient as sample output information;

s306, establishing a first neural network model according to the relation between the sample sound source signal and the sample output information.

It should be further noted that the sample sound source signal, the sample far-end signal, the sample near-end signal, and the sample output information in this embodiment respectively correspond to the sound source signal, the far-end signal, the near-end signal, and the output information described in embodiment 1, that is, the sample sound source signal, the sample far-end signal, the sample near-end signal, and the sample output information are a plurality of samples of the sound source signal, the far-end signal, the near-end signal, and the output information, respectively.

In an alternative embodiment, in step S302, acquiring a sample sound source signal includes:

It should be further noted that, in the above optional embodiment, the obtaining of the sample far-end signal according to the sample reference signal and the preset room impulse response may specifically be to convolve the sample reference signal with the room impulse response to obtain the sample far-end signal, or to multiply the sample reference signal with the room impulse response in the frequency domain to obtain the sample far-end signal.

In an alternative embodiment, the room impulse response is generated by a preset room impulse response generating unit, wherein the room impulse response generating unit is composed of a linear filter and a nonlinear filter connected in series with each other.

It should be further noted that the room impulse response generating unit in the above-mentioned optional embodiment corresponds to the room impulse response generating unit in embodiment 1, and therefore, the description thereof is omitted here.

In an alternative embodiment, in step S304, the determining the gain factor of the sample sound source signal includes:

It should be further noted that, in the above alternative embodiment, the energy of the first frequency band of the sample clean audio signal is set to be E_s,kSetting the energy of the second frequency band of the sample sound source signal as E_m,kThen, the gain coefficient of the sample sound source signal should satisfy the following formula:

the gain coefficient is a label of a sample sound source signal in the training process of the first neural network model, namely sample output information.

It should be further noted that, in step S306, the sample sound source signal is input in a characteristic manner during the input process, that is, before the training of the first neural network model is performed according to the sample sound source signal and the sample output information, the characteristic of the sample sound source signal needs to be extracted, and the characteristic extraction manner correspond to the sound source characteristic and the extraction manner for the sound source signal described in embodiment 1, and therefore, the details are not described herein again. Fig. 9 is a training schematic diagram of a training method of a neural network model according to an embodiment of the present invention, and a training process indicated by the training method of the neural network model is as shown in fig. 9.

Example 4

The present embodiment provides a training method of a neural network model, which is used to implement the training of the third neural network model described in embodiment 2, fig. 10 is a flowchart of the training method of the neural network model provided according to the embodiment of the present invention, and as shown in fig. 10, the training method of the neural network model in the present embodiment includes:

s402, acquiring a sample reference signal and a sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, and the sample far-end signal is used for indicating a sample echo signal;

s404, determining a first label and a second label; wherein the first label is used for indicating the probability of the audio existing in the sample reference signal, and the second label is used for indicating the probability of the audio existing in at least part of the sample sound source signal;

s406, establishing a third neural network model according to the relation between the sample reference signal and the first label and the relation between the sample sound source signal and the second label.

It should be further noted that the sample reference signal, the sample sound source signal, the sample far-end signal, and the sample near-end signal in this embodiment correspond to the reference signal, the sound source signal, the far-end signal, and the near-end signal described in embodiment 1, respectively, that is, the sample reference signal, the sample sound source signal, the sample far-end signal, the sample near-end signal, and the sample output information are a plurality of samples of the reference signal, the sound source signal, the far-end signal, and the near-end signal, respectively.

In an alternative embodiment, in step S402, the acquiring a sample sound source signal includes:

In an optional embodiment, the determining the first tag and the second tag in step S404 includes:

It should be further noted that, in the above alternative embodiment, the audio energy of each frame of the audio signal in the sample reference signal or the sample clean audio signal is respectively compared with the corresponding threshold, so as to determine the probability of the presence of audio in the sample reference signal or the sample clean audio signal, that is, the first tag and the second tag. Specifically, three threshold values may be set as preset threshold values, taking a sample reference signal as an example, comparing an energy value of each frame of audio signal in the sample reference signal with the three threshold values, respectively, if the energy value is greater than a threshold 2, the frame of audio signal is labeled as 1, if the energy value is greater than the threshold 1 and less than the threshold 2, the frame of audio signal is labeled as 0.5, and if the energy value is less than the threshold 1, the frame of audio signal is labeled as 0; the above 0, 0.5, and 1 can be used as labels of the frame audio signal of the sample reference signal to indicate the probability of the frame audio signal existing. The sample clean audio signal may be obtained by the above method, and the details are not repeated herein.

On the basis of determining the probability of each frame of audio signal in the sample reference signal or the sample clean audio signal, probability calculation of the sample reference signal or the sample clean audio signal can be completed by adopting a sigmoid activation function, and then the corresponding first label and the second label are obtained. The expression of the sigmoid activation function is as follows:

the above-mentioned process for performing probability calculation based on sigmoid activation function is known to those skilled in the art, and therefore will not be described herein.

Fig. 11 is a training schematic diagram of a training method of a neural network model according to an embodiment of the present invention, and a training process indicated by the training method of the neural network model is shown in fig. 11.

Example 5

The echo cancellation device provided in this embodiment is used to implement the foregoing embodiments and preferred embodiments, and details are not repeated for what has been described. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 12 is a block diagram of an echo cancellation device according to an embodiment of the present invention, and as shown in fig. 12, the echo cancellation device in this embodiment includes:

an estimating module 502, configured to estimate an echo signal in a sound source signal according to a reference signal and echo detection information to obtain echo estimation information; the terminal comprises a sound source signal, a reference signal and echo detection information, wherein the sound source signal is an audio signal received by an audio input channel of the terminal, the reference signal is an audio signal in an audio output channel of the terminal, and the echo detection information is used for indicating the probability of existence of an echo signal in the sound source signal;

a cancellation module 504, configured to obtain output information according to the sound source signal, the echo estimation information, and a preset first neural network model, and cancel an echo signal in the sound source signal according to the output information; the first neural network model is obtained by training according to the sample sound source signal, the sample echo signal and the sample output information.

It should be further noted that other optional embodiments and technical effects of the echo cancellation device in this embodiment correspond to those of the echo cancellation method in embodiment 2, and therefore are not described herein again.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 6

Embodiments of the present invention also provide a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-mentioned method embodiments when executed.

Alternatively, in the present embodiment, the computer-readable storage medium may be configured to store a computer program for executing the computer program in the above-described embodiment.

Optionally, in this embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Example 7

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in this embodiment, the processor may be configured to execute the steps in the above embodiments through a computer program.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An echo cancellation method, comprising:

2. The method of claim 1, wherein estimating the echo signal in the sound source signal according to the reference signal and the echo detection information to obtain echo estimation information comprises:

obtaining the echo estimation information according to the sound source signal, the reference signal, the echo detection information and the second neural network model;

and the second neural network model is obtained by training according to the sample sound source signal, the sample reference signal, the sample echo detection information and the sample echo signal.

3. The method of claim 2, wherein before deriving the echo estimation information from the acoustic source signal, a reference signal, echo detection information, and the second neural network model, further comprising:

obtaining the echo detection information according to the sound source signal, the reference signal and a preset third neural network model;

4. The method of claim 1, wherein the obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model comprises:

dividing the sound source signal into a plurality of frequency bands according to a preset frequency band dividing mode, and determining a frequency band gain coefficient corresponding to each frequency band in the sound source signal according to the echo estimation information and the first neural network model; wherein the frequency band gain coefficient is the output information;

and multiplying the sound source signal corresponding to each frequency point in each frequency band by the gain coefficient corresponding to the frequency point to perform echo cancellation processing so as to obtain the sound source signal with the echo signal cancelled.

5. The method of claim 1, wherein the obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model comprises:

dividing the sound source signal into a plurality of frequency bands according to a preset frequency band dividing mode, and determining a frequency band gain coefficient corresponding to each frequency band in the sound source signal according to the echo estimation information and the first neural network model;

determining a frequency point gain coefficient corresponding to each frequency point in each frequency band of the sound source signal according to the frequency band gain coefficient; wherein, the frequency point gain coefficient is the output information;

and carrying out echo cancellation processing on the sound source signal corresponding to each frequency point in each frequency band according to the frequency point gain coefficient so as to obtain the sound source signal with the echo signal eliminated.

6. The method of claim 3, wherein the acoustic source signal comprises: a near-end signal, or a near-end signal and a far-end signal; wherein the far-end signal is used to indicate the echo signal in the sound source signal;

before obtaining the echo detection information according to the sound source signal, the reference signal and a preset third neural network model, the method further includes:

and under the condition that the sound source signal comprises the near-end signal and the far-end signal, obtaining the echo detection information according to the sound source signal, the reference signal and the third neural network model.

7. The method of claim 1, further comprising:

acquiring the sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, wherein the sample far-end signal is used for indicating a sample echo signal;

determining a gain coefficient of the sample sound source signal, and taking the gain coefficient as sample output information;

and establishing the first neural network model according to the relation between the sample sound source signal and the sample output information.

8. The method of claim 7, wherein determining the gain factor for the sample acoustic source signal comprises:

acquiring first frequency band energy of the sample pure audio signal, and acquiring second frequency band energy of the sample sound source signal;

and determining a gain coefficient of the sample sound source signal according to the first frequency band energy and the second frequency band energy.

9. The method of claim 3, further comprising:

acquiring the sample reference signal and the sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, wherein the sample far-end signal is used for indicating a sample echo signal;

and establishing the third neural network model according to the relation between the sample reference signal and the first label and the relation between the sample sound source signal and the second label.

10. The method of claim 7 or 9, wherein said obtaining the sample acoustic source signal comprises:

acquiring the sample reference signal, and obtaining the sample far-end signal according to the sample reference signal and a preset room impulse response;

obtaining a sample pure audio signal and a preset sample noise signal, and overlapping the sample pure audio signal and the sample noise signal to obtain a sample near-end signal;

the sample far-end signal and the sample near-end signal are superimposed to obtain the sample acoustic source signal.

11. The method according to claim 10, wherein the room impulse response is generated by a preset room impulse response generating unit, wherein the room impulse response generating unit is composed of a linear filter and a nonlinear filter connected in series with each other.

12. The method of claim 9, wherein determining the first tag and the second tag comprises:

performing frame division processing on the sample pure audio signal, and determining sound source audio energy corresponding to each frame of audio in the sample pure audio signal;

and determining the probability of each frame of audio in the sample pure audio signal according to the relation between the sound source audio energy and a preset threshold value, and setting the probability of each frame of audio in the sample pure audio signal as a second label.

13. The method of claim 3, further comprising:

acquiring the sound source signal and the reference signal, determining a sound source characteristic according to the sound source signal, and determining a reference characteristic according to the reference signal;

obtaining the echo detection information according to the sound source characteristics, the reference characteristics and the third neural network model;

obtaining the echo estimation information according to the sound source characteristic, the reference characteristic, the echo detection information and the second neural network model;

and obtaining the output information according to the sound source characteristics, the echo estimation information and the first neural network model.

14. The method according to claim 13, wherein the sound source characteristics comprise at least a sound source frequency domain characteristic and a sound source pitch characteristic; the reference features at least comprise reference frequency domain features and reference tone features;

the acquiring the sound source signal and the reference signal, determining a sound source characteristic according to the sound source signal, and determining a reference characteristic according to the reference signal, further includes:

acquiring the sound source signal and the reference signal, and respectively carrying out frequency division windowing processing on the sound source signal and the reference signal;

transforming the processed sound source signal to a frequency domain to extract the sound source frequency domain feature, and performing pitch analysis on the processed sound source signal to determine the sound source pitch feature;

transforming the processed reference signal to the frequency domain to extract the reference frequency domain features, and performing a pitch analysis on the processed reference signal to determine the reference pitch features.

15. The method according to claim 14, wherein the sound source frequency domain features comprise at least: a plurality of bark cepstrum coefficients (BFCC) frequency domain characteristics of the sound source signal, first order difference information of a plurality of the BFCC frequency domain characteristics of the sound source signal, second order difference information of a plurality of the BFCC frequency domain characteristics of the sound source signal;

the reference frequency domain features include at least: a plurality of BFCC frequency domain characteristics of the reference signal, first order difference information of a plurality of the BFCC frequency domain characteristics of the reference signal, second order difference information of a plurality of the BFCC frequency domain characteristics of the reference signal;

16. The method according to claim 14, wherein after obtaining the output information according to the sound source signal, the echo estimation information and a preset first neural network model, further comprising:

acquiring a first output audio signal, filtering the first output audio signal according to the sound source tone characteristic, and converting the filtered first output audio signal into a time domain to obtain a second output audio signal; wherein the first output audio signal is indicative of a sound source signal from which the echo signal is cancelled;

17. An echo cancellation device, comprising:

18. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 16 when executed.

19. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 16.