CN111883154B

CN111883154B - Echo cancellation method and device, computer-readable storage medium, and electronic device

Info

Publication number: CN111883154B
Application number: CN202010693855.2A
Authority: CN
Inventors: 马路; 赵培; 苏腾荣
Original assignee: Haier Uplus Intelligent Technology Beijing Co Ltd
Current assignee: Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2023-11-28
Anticipated expiration: 2040-07-17
Also published as: CN111883154A

Abstract

The invention provides an echo cancellation method and device, a computer-readable storage medium and an electronic device, wherein the echo cancellation method comprises the following steps: estimating an echo signal in the sound source signal according to the reference signal and the echo detection information to obtain echo estimation information; the sound source signal is an audio signal received by an audio input channel of the terminal, the reference signal is an audio signal in an audio output channel of the terminal, and the echo detection information is used for indicating the probability of existence of an echo signal in the sound source signal; obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model, and eliminating echo signals in the sound source signal according to the output information. The invention solves the problem that certain residual echo still exists in the echo cancellation process in the related technology, thereby influencing the voice signal processing performance, so as to achieve the effect of improving the echo cancellation, and further improving the voice signal processing performance.

Description

Echo cancellation method and device, computer-readable storage medium, and electronic device

Technical Field

The present invention relates to the field of audio signal processing, and in particular, to an echo cancellation method and apparatus, a computer-readable storage medium, and an electronic apparatus.

Background

The voice signal processing technology is a key technology in the field of man-machine interaction at present; in the process of realizing voice signal processing, the echo cancellation algorithm can realize the cancellation of self-played voice signals received by a microphone of the equipment, is a key technology for whole voice signal processing and voice enhancement, and has extremely important effect on the voice recognition of the back end.

Fig. 1 is a schematic diagram of an echo cancellation method according to the related art, as shown in fig. 1, in which an echo cancellation method in open source tool Web instant messaging (Web Real-Time Communication, webRTC) is mainly adopted, that is, an adaptive filter is used to complete the estimation of an echo, so as to cancel a linear echo, and a nonlinear process is used to complete the suppression of a residual nonlinear echo. The method can well eliminate the linear echo, but when the nonlinear echo is processed, the nonlinear echo and the time delay estimation error can introduce residual echo, and although the nonlinear processing can inhibit the residual echo to a certain extent, the inhibition degree is limited, so that certain residual echo still exists, particularly for the echo introduced by the complex environment and nonlinear equipment, the inhibition effect of the residual echo is limited, thereby influencing the final echo elimination effect to reduce the voice signal processing performance.

Aiming at the problem that certain residual echo still exists in the echo cancellation process in the related technology so as to influence the voice signal processing performance, no effective solution has been proposed in the related technology.

Disclosure of Invention

The embodiment of the invention provides an echo cancellation method and device, a computer-readable storage medium and an electronic device, which are used for at least solving the problem that certain residual echo still exists in the echo cancellation process in the related technology, so that the processing performance of a voice signal is affected.

According to an embodiment of the present invention, there is provided an echo cancellation method including:

estimating an echo signal in the sound source signal according to the reference signal and the echo detection information to obtain echo estimation information; the sound source signal is an audio signal received by an audio input channel of the terminal, the reference signal is an audio signal in an audio output channel of the terminal, and the echo detection information is used for indicating the probability of existence of the echo signal in the sound source signal;

obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model, and eliminating echo signals in the sound source signal according to the output information; the first neural network model is trained according to the sample sound source signal, the sample echo signal and the sample output information.

According to another embodiment of the present invention, there is also provided an echo cancellation device including:

the estimation module is used for estimating the echo signals in the sound source signals according to the reference signals and the echo detection information so as to obtain echo estimation information; the sound source signal is an audio signal received by an audio input channel of the terminal, the reference signal is an audio signal in an audio output channel of the terminal, and the echo detection information is used for indicating the probability of existence of the echo signal in the sound source signal;

the elimination module is used for obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model, and eliminating echo signals in the sound source signal according to the output information; the first neural network model is trained according to the sample sound source signal, the sample echo signal and the sample output information.

According to another embodiment of the invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

According to another embodiment of the invention there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

According to the invention, the echo signal in the sound source signal can be estimated according to the reference signal and the echo detection information to obtain the echo estimation information, the output information is further obtained according to the sound source signal, the echo estimation information and the preset first neural network model, and the echo signal in the sound source signal is eliminated according to the output information; the method comprises the steps that a sound source signal is an audio signal received by an audio input channel of a terminal, a reference signal is an audio signal in an audio output channel of the terminal, echo detection information is used for indicating the probability of existence of an echo signal in the sound source signal, and a first neural network model is obtained through training according to a sample sound source signal, a sample echo signal and sample output information. Therefore, the invention can solve the problem that certain residual echo still exists in the echo cancellation process in the related technology, thereby influencing the voice signal processing performance, so as to achieve the effect of improving the echo cancellation, and further improve the voice signal processing performance.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is a schematic diagram of an echo cancellation method provided according to the related art;

fig. 2 is a functional schematic diagram (a) of an echo cancellation system according to an embodiment of the present application;

fig. 3 is a functional schematic diagram (two) of an echo cancellation system according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an echo cancellation system according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a room impulse response generating unit provided according to an embodiment of the present application;

fig. 6 is a flowchart of the operation of an echo cancellation system provided in accordance with an embodiment of the present application;

fig. 7 is a flowchart of an echo cancellation method provided according to an embodiment of the present application;

FIG. 8 is a flowchart of a method of training a neural network model provided in accordance with an embodiment of the present application;

FIG. 9 is a training schematic diagram of a training method of a neural network model according to an embodiment of the present application;

FIG. 10 is a flowchart of a method of training a neural network model provided in accordance with an embodiment of the present application;

FIG. 11 is a training schematic diagram of a training method of a neural network model according to an embodiment of the present application;

fig. 12 is a block diagram of an echo cancellation device according to an embodiment of the present application.

Detailed Description

The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

Example 1

The present embodiment provides an echo cancellation system, fig. 2 is a functional schematic diagram (one) of the echo cancellation system provided according to an embodiment of the present application, as shown in fig. 2, where the echo cancellation system in the present embodiment includes:

the cancellation unit 102, including a preset first neural network model, is configured to obtain output information according to the sound source signal, the echo estimation information and the first neural network model; the cancellation unit 102 is further configured to cancel an echo signal in the sound source signal according to the output information;

The sound source signal is an audio signal received by an audio input channel of the terminal, and the echo estimation information is used for indicating an estimated value for estimating the echo signal in the sound source signal;

the first neural network model is trained according to the sample sound source signal, the sample echo signal and the sample output information.

It should be further noted that, the echo cancellation system in this embodiment is applied to a terminal with a voice signal processing function, where the terminal in the above embodiment may be a mobile phone, a tablet computer, a PC, a sound box, a vehicle-mounted system with a voice interaction function, etc., which is not limited in this invention; in the above embodiment, the sound source signal is a signal received by the audio input channel of the terminal, and there may be echo signals to be eliminated in the signal; the audio input channel of the terminal is the input channel for the terminal to receive audio, for example, a microphone in a cell phone.

It should be further noted that, in the above embodiment, since the first neural network model is obtained by training the sample sound source signal, the sample echo signal and the sample output information, the first neural network model may establish a relationship between the sample sound source signal, the sample echo signal and the sample output information; the sample echo signal corresponds to the echo estimation information in the above embodiment. Thus, after the current sound source signal and echo estimation information of the terminal are input into the first neural network model, corresponding output information can be obtained.

With the echo cancellation system in this embodiment, since output information can be obtained by the cancellation unit according to a sound source signal, echo estimation information, and a preset first neural network model, the echo signal in the sound source signal is cancelled by using the output information; the sound source signal is an audio signal received by an audio input channel of the terminal, and the echo estimation information is used for indicating an estimated value for estimating the echo signal in the sound source signal; the first neural network model is obtained through training according to a sample sound source signal, a sample echo signal and sample output information. Therefore, the echo cancellation system in this embodiment can solve the problem that a certain residual echo still exists in the echo cancellation process in the related technology, so as to affect the performance of voice signal processing, so as to achieve the effect of improving the effect of echo cancellation, and further improve the performance of voice signal processing.

In an alternative embodiment, the echo cancellation system in this embodiment further includes:

an estimation unit 104, including a preset second neural network model, configured to obtain echo estimation information according to the sound source signal, the reference signal, the echo detection information and the second neural network model;

Wherein the reference signal is an audio signal in an audio output channel of the terminal, for example, an output channel of a loudspeaker, and the echo detection information is used for indicating the probability of existence of an echo signal in the sound source signal;

the second neural network model is obtained by training according to the sample sound source signal, the sample reference signal, the sample echo detection information and the sample echo signal.

the detection unit 106 includes a preset third neural network model, and is configured to obtain echo detection information according to the sound source signal, the reference signal and the third neural network model;

the third neural network model is obtained by training according to the sample sound source signal, the sample reference signal and the sample echo detection information.

In the above alternative embodiment, the detection unit and the estimation unit may cooperate with the cancellation unit to form the echo cancellation system in this embodiment. Fig. 3 is a functional schematic diagram (two) of an echo cancellation system according to an embodiment of the present invention, where connection functions of a detection unit, an estimation unit and a cancellation unit are shown in fig. 3, and fig. 4 is a schematic diagram of a structure of the echo cancellation system according to an embodiment of the present invention, and connection structures of the detection unit, the estimation unit and the cancellation unit are shown in fig. 4.

It should be further noted that, the reference signal is used to indicate an audio signal in an audio output channel of the terminal, where the audio output channel of the terminal is used for playing audio by the terminal, for example, an output channel of a speaker in a mobile phone, and the reference signal is specifically an audio signal that is ready to be played by an audio device in the audio output channel by the terminal, such as an audio signal before being played by the speaker.

It should be further noted that, in the above alternative embodiment, since the second neural network model is obtained by training the sample sound source signal, the sample reference signal, the sample echo detection information and the sample echo signal, the second neural network model may establish a relationship among the sample sound source signal, the sample reference signal, the sample echo detection information and the sample echo signal. Thus, after the current sound source signal, the reference signal and the echo detection information of the terminal are input into the second neural network model, the corresponding echo signal can be obtained, and the echo signal is the estimated value, so that the echo signal is the echo estimated information in the embodiment. Similarly, since the third neural network model may be trained from the sample sound source signal, the sample reference signal, and the sample echo detection information, the third neural network model may establish a relationship between the sample sound source signal, the sample reference signal, and the sample echo detection information. Thus, after the current sound source signal and the reference signal of the terminal are input into the third neural network model, the corresponding echo detection information can be obtained.

In an alternative embodiment, the first neural network model is a recurrent neural network (Recurrent Neural Network, RNN) model, the second neural network model is an RNN model, and the third neural network model is an RNN model.

It should be further noted that in the above alternative embodiments, the first neural network model, the second neural network model, and the third neural network model all use RNN models composed of gated loop units (Gate Recurrent Unit, GRUs).

In the above-mentioned alternative embodiment, since the signal processing of each unit is implemented by using the recurrent neural network, the nonlinear characteristics of the recurrent neural network itself may be utilized to implement the cancellation of nonlinear echoes in the echo signal. Secondly, in each unit, as the first neural network model also adopts a cyclic neural network with a time sequence memory function, compared with the self-adaptive filtering method in the related art, more complex nonlinear operation can be realized, and meanwhile, the elimination of echo can be better completed by utilizing the time sequence characteristic of voice; similarly, since the second neural network model adopts a cyclic neural network, the time sequence memory function of the second neural network model can adapt to the delay of echo; meanwhile, due to the nonlinear characteristics of the cyclic neural network, nonlinear echoes can be estimated correctly, so that more accurate estimation of the echoes is completed.

Based on the above, the echo cancellation system formed by the cyclic neural network can improve the robustness of echo delay estimation errors, thereby improving the performance of echo cancellation.

It should be further noted that, in the above embodiment, the output information is used to indicate information or parameters that can cancel the echo signal in the sound source signal, for example, gain information of the sound source signal. The process of the above-described cancellation unit for canceling the echo signal in the sound source signal based on the output information is further described below by way of an alternative embodiment:

in an alternative embodiment, the cancellation unit 102 is further configured to,

dividing a sound source signal into a plurality of frequency bands according to a preset frequency band dividing mode, and determining a frequency band gain coefficient corresponding to each frequency band in the sound source signal according to echo estimation information and a first neural network model; the frequency band gain coefficient is output information;

and carrying out echo cancellation processing on the sound source signals corresponding to each frequency band according to the frequency band gain coefficients so as to obtain sound source signals for canceling the echo signals.

It should be further noted that, in the above-mentioned alternative embodiment, the frequency band distinguishing manner may be barker frequency bands, that is, the sound source signals are distinguished according to 22 Bark frequency bands, so as to correspondingly obtain sound source signals corresponding to 22 Bark frequency bands; therefore, the frequency band gain coefficient corresponding to the sound source signal in each Bark frequency band can be determined through the first neural network model; in this alternative embodiment, the band gain coefficients corresponding to the 22 Bark bands respectively may be used as the output information in the above embodiment.

It should be further noted that the frequency band differentiating method in the above alternative embodiment may be other frequency band differentiating methods, which is not limited in the present invention.

In the above alternative embodiment, the process of performing echo cancellation processing on the sound source signal corresponding to each frequency band according to the frequency band gain coefficient may specifically be that each frame of audio in the sound source signal is transformed into the frequency domain by short-time fourier transform, each frequency band is multiplied by the frequency band gain coefficient corresponding to the frequency band, and then transformed into the time domain by short-time inverse fourier transform, so that the echo cancellation processing on the sound source signal may be completed.

dividing a sound source signal into a plurality of frequency bands according to a preset frequency band dividing mode, and determining a frequency band gain coefficient corresponding to each frequency band in the sound source signal according to echo estimation information and a first neural network model;

determining a frequency point gain coefficient corresponding to each frequency point in each frequency band of the sound source signal according to the frequency band gain coefficients; the frequency point gain coefficient is output information;

and carrying out echo cancellation processing on the sound source signal corresponding to each frequency point in each frequency band by multiplying the gain coefficient corresponding to the frequency point so as to obtain the sound source signal for canceling the echo signal.

It should be further noted that, in the above-mentioned alternative embodiment, the frequency band distinguishing manner may be barker frequency bands, that is, the sound source signals are distinguished according to 22 Bark frequency bands, so as to correspondingly obtain sound source signals corresponding to 22 Bark frequency bands; therefore, the frequency band gain coefficient corresponding to the sound source signal in each Bark frequency band can be determined through the first neural network model. On the premise of determining the frequency band gain coefficient corresponding to the sound source signal in each Bark frequency band, the frequency point gain coefficient corresponding to each frequency point of the sound source signal in each Bark frequency band can be further determined; in this optional embodiment, the frequency point gain coefficient corresponding to each frequency point in each of the 22 Bark bands may be used as the output information in the above embodiment.

The frequency point gain coefficient corresponding to each frequency point of the sound source signal in each Bark frequency band can be determined by the following formula:

in the above, g _k (m) gain coefficient for representing the mth frequency point of the kth frequency band, g _k And g is equal to _k+1 The gain coefficients of the k-th frequency band and the k+1-th frequency band are respectively represented, M is the M-th frequency point of the k-th frequency band, and M represents the length of the k-th frequency band.

In the above alternative embodiment, the process of performing the echo cancellation processing on the sound source signal corresponding to each frequency point in each frequency band by multiplying the gain coefficient corresponding to the frequency point may specifically be that each frame of audio in the sound source signal is transformed to the frequency domain by short-time fourier transform, and each frequency point is multiplied by the frequency point gain coefficient corresponding to the frequency point, so as to quickly change the level of each frequency band, so as to attenuate the far-end signal (i.e., echo signal) in the sound source signal, and make the near-end signal pass. And then the processed sound source signal is transformed into a time domain through short-time inverse Fourier, so that the echo cancellation processing of the sound source signal can be completed.

In the above-mentioned alternative embodiment, by determining the frequency point gain coefficient corresponding to each frequency point in each frequency band, so as to perform targeted processing on each frame of audio signal in the sound source signal to replace the frequency band gain coefficient in the foregoing alternative embodiment, the echo cancellation effect can be further improved, and the restoration effect of the processed sound source signal can be further improved.

In the two alternative embodiments, no matter the frequency band gain or the frequency point gain is adopted as output information, the calculation amount involved in the process of eliminating the echo signal in the sound source signal by adopting the gain is obviously reduced compared with the filtering processing in the related art; meanwhile, since the gain coefficients are distributed between 0 and 1, an S-shaped activation function with output distributed between 0 and 1 can be adopted for calculating the gain coefficients, the function model is more concise than that of the related art, and the accuracy of calculating the gain coefficients is improved than that of the related art. On the other hand, in the process of eliminating the echo signal in the sound source signal by adopting the frequency band gain or the frequency point gain, only a single tone passes through, so that the music noise artifact common in the related art cannot be generated.

It should be further noted that in an alternative embodiment, the acoustic signal after the echo cancellation may be processed by using a comb filter, for example, to cancel the harmonic echo that may exist in the acoustic signal in one fundamental frequency period.

In an alternative embodiment, the sound source signal includes: a near-end signal, or a near-end signal and a far-end signal; wherein the far-end signal is used for indicating an echo signal in the sound source signal;

The detection unit is further configured to detect whether the sound source signal includes a far-end signal according to the sound source signal, the reference signal and the third neural network model;

in the case where the sound source signal includes only the near-end signal, the system is further configured to output the sound source signal to an audio output channel of the terminal; or,

in the case that the sound source signal includes a near-end signal and a far-end signal, the detection unit is further configured to obtain echo detection information according to the sound source signal, the reference signal and the third neural network model.

It should be further noted that in the above-mentioned alternative embodiment, the far-end signal in the sound source signal is an echo signal that may exist in the sound source signal, so in a case that the detection unit detects that only the near-end signal is included in the sound source signal, that is, the sound source signal does not include the echo signal to be eliminated, no echo cancellation processing is required for the sound source signal, in this case, the sound source signal may be output to the audio output channel of the terminal, that is, the subsequent estimation unit and the cancellation unit of the detection unit do not process the sound source signal; in case the detection unit detects that the sound source signal includes the far-end signal, the far-end signal is subjected to the cancellation processing according to the operation manner of the estimation unit and the cancellation unit in the foregoing alternative embodiment.

It should be further noted that the detection of whether the far-end signal is included in the sound source signal by the detection unit is implemented based on the third neural network model. According to the foregoing optional embodiment, after the current sound source signal and the reference signal of the terminal are input to the third neural network model, corresponding echo detection information may be obtained, and at this time, the probability that the echo detection information indicates that the echo signal exists in the sound source signal may be compared with a preset threshold, and if the probability that the remote signal corresponds to the remote signal is smaller than the preset threshold, it indicates that the remote signal does not exist.

In an alternative embodiment, the sample echo signal is derived from a sample reference signal and a predetermined room impulse response.

It should be further noted that, the sample echo signal used in the training process of the first neural network model and the second neural network model may be obtained by using a sample reference signal obtained by sampling in an audio output channel of the terminal and a preset room impulse response, specifically, the sample reference signal may be convolved with the room impulse response to obtain a sample echo signal, or the sample reference signal may be multiplied with the room impulse response in a frequency domain to obtain a sample echo signal.

In an alternative embodiment, the room impulse response may be generated by a room impulse response generating unit, and fig. 5 is a schematic structural diagram of the room impulse response generating unit provided according to an embodiment of the present invention, where the room impulse response generating unit is configured as shown in fig. 5, and the room impulse response generating unit shown in fig. 5 is configured by a plurality of filters, and in particular, is configured by a linear filter and a nonlinear filter that are connected in series with each other. It should be further noted that the room impulse response generating unit is only an alternative, and any unit capable of simulating an impulse response of a room in the art may constitute the room impulse response generating unit in the embodiment of the present invention.

As shown in the figure5, the linear filter is realized by adopting a finite impulse response (Finite Impulse Response, FIR), which can simulate the impulse response of a room, and the maximum delay quantity of the impulse response can be set according to the application scene, and can also be set into a universal range; the different tap coefficients in the impulse response may decay as a function of the square of time (the tap coefficients are proportional to 1/(c) ² t ² ) Where c is used to represent the speed of sound and t is used to represent the delay size). The nonlinear filter employs an infinite impulse response (Infinite Impulse Response, IIR) that can simulate the nonlinear factors introduced by the real environment.

In an alternative embodiment, the sample sound source signal comprises: a sample far-end signal and a sample near-end signal;

the sample far-end signal is obtained by a sample reference signal and a room impulse response, and the sample near-end signal is obtained by a pure audio signal and a noise signal.

It should be further noted that, the sample sound source signals adopted in the training process of the first neural network model, the second neural network model and the third neural network model are composed of two parts, namely a sample far-end signal and a sample near-end signal. For the sample far-end signal, the sample reference signal obtained by sampling in the audio output channel of the terminal and the preset room impulse response can be obtained, and similar to the sample echo signal, the sample reference signal and the room impulse response can be specifically convolved to obtain the sample far-end signal, and the sample reference signal and the room impulse response can be multiplied in the frequency domain to obtain the sample far-end signal. For sample near-end signals, different types of noise signals may be superimposed on the clean audio signal to generate the sample near-end signal.

The input processing unit is configured to acquire a sound source signal and a reference signal, determine sound source characteristics according to the sound source signal and determine reference characteristics according to the reference signal;

the cancellation unit is further configured to obtain output information according to the sound source characteristics, the echo estimation information and the first neural network model;

the estimation unit is further configured to obtain echo estimation information according to the sound source characteristics, the reference characteristics, the echo detection information and the second neural network model;

the detection unit is further configured to obtain echo detection information according to the sound source characteristics, the reference characteristics and the third neural network model.

It should be further noted that, in the present embodiment, the cancellation unit, the estimation unit, and the detection unit are all features of the input corresponding audio signal in the actual input process. Specifically, the characteristics of the sound source signal and the reference signal can be extracted by the input processing unit, so as to serve as the sound source characteristics and the reference characteristics for subsequent echo cancellation processing. The input processing unit in the alternative embodiment described above may be a virtual unit, i.e. integrated in the processor of the terminal. The process of extracting features by the input processing unit is described below by way of alternative embodiments:

In an alternative embodiment, the sound source features include at least sound source frequency domain features and sound source pitch features; the reference features include at least reference frequency domain features and reference pitch features; the input processing unit is further configured to,

acquiring a sound source signal and a reference signal, and respectively carrying out frequency division windowing treatment on the sound source signal and the reference signal;

transforming the processed sound source signal to a frequency domain to extract sound source frequency domain characteristics, and performing tone analysis on the processed sound source signal to determine sound source tone characteristics;

the processed reference signal is transformed to the frequency domain to extract reference frequency domain features and pitch analysis is performed on the processed reference signal to determine reference pitch features.

It should be further noted that, in the above-mentioned alternative embodiment, the frame windowing process of the sound source signal and the reference signal can effectively eliminate the frequency spectrum discontinuity of the frame boundary; the above-described transformation of the processed sound source signal or reference signal into the frequency domain may be achieved by Short-time fourier transformation (Short-time Fourier transform, STFT).

In an alternative embodiment, the sound source frequency domain features include at least: the method comprises the steps of obtaining a plurality of barker cepstrum coefficients (BFCC) frequency domain characteristics of a sound source signal, first-order differential information of the plurality of BFCC frequency domain characteristics of the sound source signal and second-order differential information of the plurality of BFCC frequency domain characteristics of the sound source signal;

The sound source tone characteristics include at least: discrete Cosine Transform (DCT) information of a plurality of operation coefficients corresponding to the tone of the sound source signal, the tone period dynamic characteristic of the sound source signal and the tone frequency spectrum dynamic characteristic of the sound source signal;

the reference frequency domain features include at least: the method comprises the steps of a plurality of BFCC frequency domain characteristics of a reference signal, first-order differential information of the plurality of BFCC frequency domain characteristics of the reference signal, and second-order differential information of the plurality of BFCC frequency domain characteristics of the reference signal;

the reference tone characteristics include at least: DCT information for a plurality of operation coefficients corresponding to a pitch of a reference signal, a pitch period dynamic characteristic of the reference signal, and a pitch spectrum dynamic characteristic of the reference signal.

It should be further noted that, the BFCC frequency domain is used to indicate the characteristics of 22 Back frequency bands, so in the alternative embodiment, the BFCC frequency domain characteristics of the sound source signal may be 22; the first order difference information of the plurality of BFCC frequency domain features of the sound source signal may employ a first order difference of the first 6 BFCC frequency domain features of the 22 BFCC frequency domain features of the sound source signal, and the second order difference information of the plurality of BFCC frequency domain features of the sound source signal may employ a second order difference of the first 6 BFCC frequency domain features of the 22 BFCC frequency domain features of the sound source signal. Meanwhile, discrete Cosine Transform (DCT) information of a plurality of operation coefficients corresponding to the tone of the sound source signal may employ DCT information of the first 6 Pitch-dependent operation coefficients.

Similarly, the BFCC frequency domain features of the reference signal may be 22; the first order differential information of the plurality of BFCC frequency domain features of the reference signal may employ a first order differential of the first 6 BFCC frequency domain features of the 22 BFCC frequency domain features of the reference signal, and the second order differential information of the plurality of BFCC frequency domain features of the reference signal may employ a second order differential of the first 6 BFCC frequency domain features of the 22 BFCC frequency domain features of the reference signal. Meanwhile, discrete Cosine Transform (DCT) information of a plurality of operation coefficients corresponding to the tones of the reference signal may employ DCT information of the first 6 Pitch related operation coefficients.

It should be further noted that in the above-mentioned alternative embodiment, by taking the corresponding features of the sound source signal and the reference signal as the input in the echo cancellation, a large number of neurons in the processing process of the neural network model can be avoided to generate a large number of outputs, and compared with the prior art that the sample of the signal or the signal spectrum is directly used as the input in the echo cancellation, the calculation amount of the system can be further reduced.

the output processing unit is configured to acquire a first output audio signal, filter the first output audio signal according to the sound source tone characteristic, and convert the filtered first output audio signal into a time domain so as to acquire a second output audio signal; wherein the first output audio signal is for indicating a sound source signal that cancels the echo signal;

The output processing unit is further configured to output the second output audio signal to an audio output channel of the terminal.

Fig. 6 is a flowchart of an echo cancellation system according to an embodiment of the present invention, in which the operation flows of an input processing unit, a detection unit, an estimation unit, a cancellation unit, and an output processing unit are shown in fig. 6.

It should be further noted that, in the above alternative embodiments, the output processing unit may be a virtual unit, that is, integrated in the processor of the terminal. The output processing unit filters the first output audio signal according to the tone characteristic of the sound source, so that the tone characteristic of a near-end signal in the sound source signal can be kept, and the integrity of the audio is better kept.

Example 2

The present embodiment provides an echo cancellation method, and fig. 7 is a flowchart of the echo cancellation method provided according to an embodiment of the present invention, as shown in fig. 7, where the echo cancellation method in the present embodiment includes:

s202, estimating echo signals in the sound source signals according to the reference signals and the echo detection information to obtain echo estimation information; the sound source signal is an audio signal received by an audio input channel of the terminal, the reference signal is an audio signal in an audio output channel of the terminal, and the echo detection information is used for indicating the probability of existence of an echo signal in the sound source signal;

S204, obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model, and eliminating echo signals in the sound source signal according to the output information; the first neural network model is trained according to the sample sound source signal, the sample echo signal and the sample output information.

It should be further noted that, the other optional embodiments and technical effects of the echo cancellation method in this embodiment correspond to those of the echo cancellation system in embodiment 1, so that the description thereof is omitted herein.

In an optional embodiment, in step S202, the estimating the echo signal in the sound source signal according to the reference signal and the echo detection information to obtain the echo estimation information includes:

obtaining echo estimation information according to the sound source signal, the reference signal, the echo detection information and the second neural network model;

In an optional embodiment, in step S202, before obtaining the echo estimation information according to the sound source signal, the reference signal, the echo detection information and the second neural network model, the method further includes:

Obtaining echo detection information according to the sound source signal, the reference signal and a preset third neural network model;

In an optional embodiment, in step S204, obtaining output information according to the sound source signal, the echo estimation information and the preset first neural network model includes:

and carrying out echo cancellation processing on the sound source signal corresponding to each frequency point in each frequency band according to the frequency point gain coefficient so as to obtain the sound source signal for canceling the echo signal.

before the echo detection information is obtained according to the sound source signal, the reference signal and the preset third neural network model, the method further comprises:

detecting whether the sound source signal comprises a far-end signal or not according to the sound source signal, the reference signal and the third neural network model;

in the case where the sound source signal includes only the near-end signal, outputting the sound source signal to an audio output channel of the terminal; or,

under the condition that the sound source signal comprises a near-end signal and a far-end signal, echo detection information is obtained according to the sound source signal, the reference signal and the third neural network model.

In an alternative embodiment, the echo cancellation method in the embodiment of the present invention further includes:

Acquiring a sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, and the sample far-end signal is used for indicating a sample echo signal;

determining a gain coefficient of the sample sound source signal, and taking the gain coefficient as sample output information;

and establishing a first neural network model according to the relation between the sample sound source signal and the sample output information.

In an alternative embodiment, the acquiring the sample sound source signal includes:

acquiring a sample reference signal, and obtaining a sample far-end signal according to the sample reference signal and a preset room impulse response;

acquiring a sample pure audio signal and a preset sample noise signal, and overlapping the sample pure audio signal and the sample noise signal to obtain a sample near-end signal;

the sample far-end signal and the sample near-end signal are superimposed to obtain a sample sound source signal.

In an alternative embodiment, the room impulse response is generated by a preset room impulse response generating unit, wherein the room impulse response generating unit is formed by connecting a linear filter and a nonlinear filter in series.

In an alternative embodiment, the determining the gain factor of the sample sound source signal includes:

Acquiring first frequency band energy of a sample pure audio signal and acquiring second frequency band energy of a sample sound source signal;

and determining the gain coefficient of the sample sound source signal according to the first frequency band energy and the second frequency band energy.

In the above alternative embodiments, the process of training the first neural network model is described below by the training method of the first neural network model described in embodiment 3, so that the description is omitted here.

acquiring a sample reference signal and a sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, and the sample far-end signal is used for indicating a sample echo signal;

determining a first label and a second label; the first label is used for indicating the probability of the existence of the audio in the sample reference signal, and the second label is used for indicating the probability of the existence of at least part of the audio in the sample sound source signal;

and establishing a third neural network model according to the relation between the sample reference signal and the first label and the relation between the sample sound source signal and the second label.

In an alternative embodiment, the determining the first tag and the second tag includes:

framing the sample reference signal, and determining the reference audio energy corresponding to each frame of audio in the sample reference signal;

according to the relation between the reference audio energy and a preset threshold value, determining the probability of each frame of audio in the sample reference signal, and marking the probability of each frame of audio in the sample reference signal as a first label;

carrying out framing treatment on the sample pure audio signal, and determining sound source audio energy corresponding to each frame of audio in the sample pure audio signal;

And determining the existence probability of each frame of audio in the sample pure audio signal according to the relation between the sound source audio energy and the preset threshold value, and setting the existence probability of each frame of audio in the sample pure audio signal as a second label.

In the above alternative embodiments, the process of training the third neural network model is described below by the training method of the third neural network model described in embodiment 4, and thus will not be described here again.

acquiring a sound source signal and a reference signal, determining sound source characteristics according to the sound source signal, and determining reference characteristics according to the reference signal;

obtaining echo detection information according to the sound source characteristics, the reference characteristics and the third neural network model;

obtaining echo estimation information according to the sound source characteristics, the reference characteristics, the echo detection information and the second neural network model;

and obtaining output information according to the sound source characteristics, the echo estimation information and the first neural network model.

In an alternative embodiment, the sound source features at least include a sound source frequency domain feature and a sound source tone feature; the reference features include at least reference frequency domain features and reference pitch features;

Acquiring a sound source signal and a reference signal, determining sound source characteristics according to the sound source signal, and determining reference characteristics according to the reference signal, and further comprising:

In an optional embodiment, the above sound source frequency domain feature at least includes: the method comprises the steps of obtaining a plurality of barker cepstrum coefficients (BFCC) frequency domain characteristics of a sound source signal, first-order differential information of the plurality of BFCC frequency domain characteristics of the sound source signal and second-order differential information of the plurality of BFCC frequency domain characteristics of the sound source signal;

In an optional embodiment, after obtaining the output information according to the sound source signal, the echo estimation information and the preset first neural network model, the method further includes:

acquiring a first output audio signal, filtering the first output audio signal according to the tone characteristic of a sound source, and converting the filtered first output audio signal into a time domain to obtain a second output audio signal; wherein the first output audio signal is for indicating a sound source signal that cancels the echo signal;

and outputting the second output audio signal to an audio output channel of the terminal.

In an alternative embodiment, the first neural network model is a recurrent neural network RNN model, the second neural network model is an RNN model, and the third neural network model is an RNN model.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

Example 3

The present embodiment provides a training method of a neural network model, which is used to implement the training of the first neural network model described in embodiment 2, and fig. 8 is a flowchart of the training method of the neural network model provided in the embodiment of the present invention, as shown in fig. 8, where the training method of the neural network model in the embodiment includes:

s302, acquiring a sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, and the sample far-end signal is used for indicating a sample echo signal;

s304, determining a gain coefficient of the sample sound source signal, and taking the gain coefficient as sample output information;

s306, a first neural network model is established according to the relation between the sample sound source signal and the sample output information.

In this embodiment, the sample sound source signal, the sample far-end signal, the sample near-end signal, and the sample output information correspond to the sound source signal, the far-end signal, the near-end signal, and the output information described in embodiment 1, respectively, that is, the sample sound source signal, the sample far-end signal, the sample near-end signal, and the sample output information are a plurality of samples of the sound source signal, the far-end signal, the near-end signal, and the output information, respectively.

In an optional embodiment, in step S302, acquiring the sample sound source signal includes:

It should be further noted that in the above-mentioned alternative embodiment, the sample reference signal and the preset room impulse response may be specifically convolved to obtain the sample far-end signal, or the sample reference signal and the room impulse response may be multiplied in the frequency domain to obtain the sample far-end signal.

In an alternative embodiment, the room impulse response is generated by a preset room impulse response generating unit, wherein the room impulse response generating unit is formed by a series connection of linear filters and nonlinear filters connected in series.

It should be further noted that the room impulse response generating unit in the above alternative embodiment corresponds to the room impulse response generating unit in embodiment 1, so that a detailed description thereof is omitted herein.

In an optional embodiment, in step S304, determining the gain coefficient of the sample sound source signal includes:

It should be further noted that in the above alternative embodiment, the energy of the first frequency band of the sample clean audio signal is set to be E _s,k Setting the energy of a second frequency band of the sample sound source signal as E _m,k The gain factor of the sample sound source signal should satisfy the following formula:

the gain coefficient is a label of the sample sound source signal in the training process of the first neural network model, namely sample output information.

It should be further noted that in step S306, the sample sound source signal is input in the characteristic manner in the input process, that is, before the training of the first neural network model according to the sample sound source signal and the sample output information, the characteristic of the sample sound source signal needs to be extracted, and the characteristic extraction manner correspond to the sound source characteristic and the extraction manner of the sound source signal described in embodiment 1, so that the description is omitted here. Fig. 9 is a training schematic diagram of a training method of a neural network model according to an embodiment of the present invention, where a training process indicated by the training method of the neural network model is shown in fig. 9.

Example 4

The present embodiment provides a training method of a neural network model, which is used to implement the training of the third neural network model described in embodiment 2, and fig. 10 is a flowchart of the training method of the neural network model provided in the embodiment of the present invention, as shown in fig. 10, where the training method of the neural network model in the embodiment includes:

s402, acquiring a sample reference signal and a sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, and the sample far-end signal is used for indicating a sample echo signal;

S404, determining a first label and a second label; the first label is used for indicating the probability of the existence of the audio in the sample reference signal, and the second label is used for indicating the probability of the existence of at least part of the audio in the sample sound source signal;

s406, establishing a third neural network model according to the relation between the sample reference signal and the first label and the relation between the sample sound source signal and the second label.

In this embodiment, the sample reference signal, the sample sound source signal, the sample far-end signal, and the sample near-end signal correspond to the reference signal, the sound source signal, the far-end signal, and the near-end signal described in embodiment 1, respectively, that is, the sample reference signal, the sample sound source signal, the sample far-end signal, the sample near-end signal, and the sample output information are a plurality of samples of the reference signal, the sound source signal, the far-end signal, and the near-end signal, respectively.

In an alternative embodiment, in step S402, acquiring the sample sound source signal includes:

In an optional embodiment, in step S404, determining the first tag and the second tag includes:

It should be further noted that in the above alternative embodiment, the audio energy of each frame of the audio signal in the sample reference signal or the sample clean audio signal is compared with a corresponding threshold value, so as to determine the probability of audio existence in the sample reference signal or the sample clean audio signal, that is, the first tag and the second tag. Specifically, three threshold values may be set as preset threshold values, taking a sample reference signal as an example, comparing the energy value of each frame of audio signal in the sample reference signal with the three threshold values respectively, if the energy value is greater than the threshold 2, marking the frame of audio signal as 1, if the energy value is greater than the threshold 1 and less than the threshold 2, marking the frame of audio signal as 0.5, if the energy value is less than the threshold 1, marking the frame of audio signal as 0; the above 0, 0.5, 1 can be used as a label of the frame audio signal of the sample reference signal, for indicating the existence probability of the frame audio signal. The sample clean audio signal may be used to obtain the second tag in the above manner, which is not described herein.

Based on determining the existence probability of each frame of audio signal in the sample reference signal or the sample pure audio signal, a sigmoid activation function can be adopted to complete the probability calculation of the sample reference signal or the sample pure audio signal, and then the corresponding first label and second label are obtained. The expression of the sigmoid activation function is as follows:

the above-mentioned process of performing probability calculation based on the sigmoid activation function is known to those skilled in the art, and will not be described herein.

Fig. 11 is a training schematic diagram of a training method of a neural network model according to an embodiment of the present invention, where a training process indicated by the training method of the neural network model is shown in fig. 11.

Example 5

The echo cancellation device provided in this embodiment is used to implement the foregoing embodiments and preferred embodiments, and will not be described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 12 is a block diagram of an echo cancellation device according to an embodiment of the present invention, and as shown in fig. 12, the echo cancellation device in this embodiment includes:

an estimation module 502, configured to estimate an echo signal in the sound source signal according to the reference signal and the echo detection information, so as to obtain echo estimation information; the sound source signal is an audio signal received by an audio input channel of the terminal, the reference signal is an audio signal in an audio output channel of the terminal, and the echo detection information is used for indicating the probability of existence of an echo signal in the sound source signal;

the cancellation module 504 is configured to obtain output information according to the sound source signal, the echo estimation information, and a preset first neural network model, and cancel the echo signal in the sound source signal according to the output information; the first neural network model is trained according to the sample sound source signal, the sample echo signal and the sample output information.

It should be further noted that, the other optional embodiments and technical effects of the echo cancellation device in this embodiment correspond to the echo cancellation method in embodiment 2, so that the description thereof is omitted herein.

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.

Example 6

Embodiments of the present invention also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the above-described embodiments.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

Example 7

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the steps in the above-described embodiment by a computer program.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An echo cancellation method, comprising:

obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model, and eliminating echo signals in the sound source signal according to the output information; the first neural network model is obtained by training according to a sample sound source signal, a sample echo signal and sample output information;

Wherein the estimating the echo signal in the sound source signal according to the reference signal and the echo detection information to obtain the echo estimation information comprises: obtaining echo estimation information according to the sound source signal, the reference signal, the echo detection information and a second neural network model; the second neural network model is obtained by training according to the sample sound source signal, the sample reference signal, the sample echo detection information and the sample echo signal;

wherein the second neural network model is a recurrent neural network model.

2. The method of claim 1, wherein before obtaining the echo estimation information from the sound source signal, the reference signal, the echo detection information, and the second neural network model, further comprising:

obtaining the echo detection information according to the sound source signal, the reference signal and a preset third neural network model;

3. The method of claim 1, wherein the obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model includes:

Dividing the sound source signal into a plurality of frequency bands according to a preset frequency band dividing mode, and determining a frequency band gain coefficient corresponding to each frequency band in the sound source signal according to the echo estimation information and the first neural network model; the frequency band gain coefficient is the output information;

and multiplying the sound source signal corresponding to each frequency point in each frequency band by the gain coefficient corresponding to the frequency point to perform echo cancellation processing so as to obtain the sound source signal for canceling the echo signal.

4. The method of claim 1, wherein the obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model includes:

dividing the sound source signal into a plurality of frequency bands according to a preset frequency band dividing mode, and determining a frequency band gain coefficient corresponding to each frequency band in the sound source signal according to the echo estimation information and the first neural network model;

determining a frequency point gain coefficient corresponding to each frequency point in each frequency band of the sound source signal according to the frequency band gain coefficient; the frequency point gain coefficient is the output information;

And carrying out echo cancellation processing on the sound source signals corresponding to each frequency point in each frequency band according to the frequency point gain coefficients so as to obtain sound source signals for canceling the echo signals.

5. The method of claim 2, wherein the sound source signal comprises: a near-end signal, or a near-end signal and a far-end signal; wherein the far-end signal is used to indicate the echo signal in the sound source signal;

before the echo detection information is obtained according to the sound source signal, the reference signal and a preset third neural network model, the method further comprises:

an audio output channel outputting the sound source signal to the terminal in a case where the sound source signal includes only the near-end signal; or,

and under the condition that the sound source signal comprises the near-end signal and the far-end signal, obtaining the echo detection information according to the sound source signal, the reference signal and the third neural network model.

6. The method according to claim 1, wherein the method further comprises:

Acquiring the sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, and the sample far-end signal is used for indicating a sample echo signal; determining a gain coefficient of the sample sound source signal, and taking the gain coefficient as sample output information;

and establishing the first neural network model according to the relation between the sample sound source signal and the sample output information.

7. The method of claim 6, wherein the determining the gain factor of the sample acoustic source signal comprises:

and determining a gain coefficient of the sample sound source signal according to the first frequency band energy and the second frequency band energy.

8. The method according to claim 2, wherein the method further comprises:

acquiring the sample reference signal and the sample sound source signal; the sample sound source signal is obtained by superposing a sample far-end signal and a sample near-end signal, and the sample far-end signal is used for indicating a sample echo signal;

Determining a first label and a second label; wherein the first tag is used for indicating the probability of the existence of audio in the sample reference signal, and the second tag is used for indicating the probability of the existence of at least part of audio in the sample sound source signal;

and establishing the third neural network model according to the relation between the sample reference signal and the first label and the relation between the sample sound source signal and the second label.

9. The method according to claim 6 or 8, wherein said acquiring said sample sound source signal comprises:

acquiring the sample reference signal, and obtaining the sample far-end signal according to the sample reference signal and a preset room impulse response;

the sample far-end signal and the sample near-end signal are superimposed to obtain the sample sound source signal.

10. The method of claim 9, wherein the room impulse response is generated by a preset room impulse response generation unit, wherein the room impulse response generation unit is comprised of a series of linear filters and a series of nonlinear filters connected in series.

11. The method of claim 8, wherein the determining the first tag and the second tag comprises:

determining the probability of each frame of audio in the sample reference signal according to the relation between the reference audio energy and a preset threshold value, and marking the probability of each frame of audio in the sample reference signal as a first label;

and determining the existence probability of each frame of audio in the sample pure audio signal according to the relation between the sound source audio energy and a preset threshold value, and setting the existence probability of each frame of audio in the sample pure audio signal as a second label.

12. The method according to claim 2, wherein the method further comprises:

acquiring the sound source signal and the reference signal, determining sound source characteristics according to the sound source signal, and determining reference characteristics according to the reference signal;

Obtaining the echo detection information according to the sound source characteristics, the reference characteristics and the third neural network model;

obtaining the echo estimation information according to the sound source characteristics, the reference characteristics, the echo detection information and the second neural network model;

and obtaining the output information according to the sound source characteristics, the echo estimation information and the first neural network model.

13. The method of claim 12, wherein the sound source signature comprises at least a sound source frequency domain signature and a sound source tone signature; the reference features include at least reference frequency domain features and reference pitch features;

the method for obtaining the sound source signal and the reference signal, determining the sound source characteristic according to the sound source signal, determining the reference characteristic according to the reference signal, and further comprises:

acquiring the sound source signal and the reference signal, and respectively carrying out frequency division windowing processing on the sound source signal and the reference signal;

transforming the processed sound source signal to a frequency domain to extract the sound source frequency domain characteristics, and performing pitch analysis on the processed sound source signal to determine the sound source pitch characteristics;

Transforming the processed reference signal to the frequency domain to extract the reference frequency domain features, and performing a pitch analysis on the processed reference signal to determine the reference pitch features.

14. The method of claim 13, wherein the sound source frequency domain features comprise at least: a plurality of barker cepstrum coefficient (BFCC) frequency domain features of the sound source signal, first order differential information of a plurality of BFCC frequency domain features of the sound source signal, and second order differential information of a plurality of BFCC frequency domain features of the sound source signal;

the sound source tone characteristics include at least: discrete cosine transform DCT information of a plurality of operation coefficients corresponding to a tone of the sound source signal, a tone period dynamic characteristic of the sound source signal,

A tonal spectral dynamics of the sound source signal;

the reference frequency domain features include at least: a plurality of BFCC frequency domain features of the reference signal, first order differential information of a plurality of the BFCC frequency domain features of the reference signal, second order differential information of a plurality of the BFCC frequency domain features of the reference signal;

the reference tone characteristic includes at least: DCT information of a plurality of operation coefficients corresponding to the tone of the reference signal, the tone period dynamic characteristic of the reference signal, and the tone spectrum dynamic characteristic of the reference signal.

15. The method of claim 13, further comprising, after obtaining output information from the sound source signal, the echo estimation information, and a predetermined first neural network model:

acquiring a first output audio signal, filtering the first output audio signal according to the sound source tone characteristic, and converting the filtered first output audio signal into a time domain to obtain a second output audio signal; wherein the first output audio signal is for indicating a sound source signal from which the echo signal is cancelled;

16. An echo cancellation device, comprising:

The elimination module is used for obtaining output information according to the sound source signal, the echo estimation information and a preset first neural network model, and eliminating echo signals in the sound source signal according to the output information; the first neural network model is obtained by training according to a sample sound source signal, a sample echo signal and sample output information;

the echo cancellation device is further configured to obtain the echo estimation information according to the sound source signal, the reference signal, the echo detection information and a second neural network model; the second neural network model is obtained by training according to the sample sound source signal, the sample reference signal, the sample echo detection information and the sample echo signal; wherein the second neural network model is a recurrent neural network model.

17. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, wherein the computer program is arranged to execute the method of any of the claims 1 to 15 when run.

18. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 15.