WO2023040322A1

WO2023040322A1 - Echo cancellation method, and terminal device and storage medium

Info

Publication number: WO2023040322A1
Application number: PCT/CN2022/093959
Authority: WO
Inventors: 王清泉
Original assignee: 中兴通讯股份有限公司
Priority date: 2021-09-14
Filing date: 2022-05-19
Publication date: 2023-03-23
Also published as: CN115810361A

Abstract

An echo cancellation method, and a terminal device and a storage medium. The echo cancellation method comprises: a first terminal device acquiring first voice data output from a second terminal device (S100); performing separation processing on the first voice data by means of a time-domain audio separation model, so as to obtain a plurality of pieces of branch voice data (S200); respectively performing relevancy determination on each of the plurality of pieces of branch voice data and a plurality of reference signals at different moments, so as to determine a residual echo signal (S300); and filtering the residual echo signal in the first voice data to obtain target voice data (S400).

Description

Echo cancellation method, terminal equipment and storage medium

Cross References to Related Applications

This application is based on a Chinese patent application with application number 202111073754.6 and a filing date of September 14, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.

technical field

The embodiments of the present application relate to but are not limited to the communication field, and in particular, relate to an echo cancellation method, a terminal device, and a storage medium.

Background technique

Echo cancellation is a key technology for real-time voice transmission. Without echo cancellation, during voice interaction, the local end can always hear the voice it transmits to the peer end, which greatly affects the auditory effect and even makes it impossible to communicate effectively. However, the traditional echo cancellation algorithm is limited by the length of the filter. If the echo tail length exceeds the length supported by the filter, the echo cannot be effectively eliminated. At the same time, it takes a certain amount of time for the filter parameters to converge during the delay and jittering process of the filter. Echo leakage will occur.

Contents of the invention

The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.

The main purpose of the embodiments of the present application is to provide an echo cancellation method, a terminal device, and a storage medium.

In a first aspect, an embodiment of the present application provides an echo cancellation method, which is applied to a first terminal device, and the method includes: acquiring output first voice data from a second terminal device, the first voice data being the The voice data replied by the second terminal device after receiving the second voice data from the first terminal device; the first voice data is separated and processed through the time-domain audio separation model to obtain a plurality of split voice data Carrying out a correlation determination between each of the branch voice data in a plurality of the branch voice data and a plurality of reference signals at different times, and determining a residual echo signal; The residual echo signal is filtered to obtain the target voice data.

In a second aspect, the embodiment of the present application provides an echo cancellation device, including: an acquisition module configured to acquire the output first voice data from the second terminal device, the first voice data being the second terminal device The voice data replied after receiving the second voice data from the first terminal device; the separation module is configured to separate the first voice data through a time-domain audio separation model to obtain multiple branches Speech data; a judging module, configured to determine the correlation between a plurality of the split speech data and the reference signal of the first speech data, and determine a residual echo signal; a filtering module, configured to convert the first The residual echo signal in the voice data is filtered to obtain target voice data.

In a third aspect, an embodiment of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and operable on the processor, when the processor executes the computer program, the following In one aspect, the echo cancellation method.

In a fourth aspect, a computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to execute the echo cancellation method described in the first aspect.

Additional features and advantages of the application will be set forth in the description which follows, and, in part, will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Description of drawings

FIG. 1 is a schematic diagram of a system architecture platform for performing an echo cancellation method provided by an embodiment of the present application;

FIG. 2 is a flowchart of an echo cancellation method provided by an embodiment of the present application;

Fig. 3 is a flowchart of determining a residual echo signal in an echo cancellation method provided by an embodiment of the present application;

FIG. 4 is another flowchart of an echo cancellation method provided by an embodiment of the present application;

FIG. 5 is a flowchart of an echo cancellation algorithm for adaptive filtering in an echo cancellation method provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of an echo cancellation device provided by an embodiment of the present application;

Fig. 7 is a schematic diagram of a terminal device provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.

It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, it can be executed in a different order than the module division in the device or the flowchart in the flowchart. steps shown or described. The terms "first", "second" and the like in the specification, claims or the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific order or sequence.

The embodiment of the present application provides an echo cancellation method, a terminal device, and a storage medium. The echo cancellation method includes but is not limited to the following steps: the first terminal device acquires the output first voice data from the second terminal device; The separation model separates the first voice data to obtain a plurality of branch voice data; each of the multiple branch voice data is correlated with a plurality of reference signals at different times to determine the residual Echo signal: filtering the residual echo signal in the first voice data to obtain the target voice data. Through the time-domain audio separation model, the first voice data is separated and processed to obtain the divided voice data and the reference signal to determine the residual echo signal, and then filter the residual echo signal, so as to solve the problem of echo leakage and provide more High quality voice output. The divided voice data obtained by separating the first voice data through the real-time and fast-response time-domain audio separation model can be judged with the reference signal to obtain the residual echo signal, and then the residual echo signal can be filtered to be more accurate. High-quality target voice data can solve the problem of echo leakage.

The technical solutions of the embodiments of the present application will be introduced below with reference to the accompanying drawings.

FIG. 1 is a functional block diagram of a system architecture platform 100 applicable to an embodiment of the present application. In one embodiment, the system architecture platform 100 includes a first terminal device 110 and a second terminal device 120, the first terminal device 110 includes a first echo canceling device 111, the second terminal device 120 includes a speaker 121, a pickup 122 and a second Echo cancellation device 123.

The echo cancellation method provided by the embodiment of the present application is applied to the system architecture platform 100 shown in FIG. 1, and the terminal devices (the first terminal device 110 and the second terminal device 120) shown in FIG. 1 can be personal computers PCs, mobile phones, etc. , set-top boxes, smart speakers, smart TVs and other devices. The speaker 121 and the pickup 122 may also be directly included on the terminal device, such as a mobile phone. The terminal device can also be connected to the speaker 121 and the pickup 122, such as a personal computer connected to the speaker 121 and the pickup 122, and a set-top box connected to a TV as an audio and video playback device. It can be understood that the first terminal device 110 may be the same device or may be a different device, which is not specifically limited in this embodiment.

The first terminal device 110 is configured to send an audio signal (second voice data and/or reference signal) to the second terminal device 120, and the first echo canceling device 111 of the first terminal device 110 is configured to send an audio signal to the second terminal device 120. The transmitted first voice signal is subjected to residual echo cancellation processing.

The second terminal device 120 is configured to output the second voice data sent from the first terminal device 110 to the speaker 121 , and also output the reference signal to the speaker 121 . The reference signal is usually a high-frequency signal, and its frequency is greater than the frequency range of audible sounds by human ears. Generally, the frequency range of the sound audible to the human ear is from 20 Hz to 20,000 Hz, so the frequency of the reference signal can be selected to be above 20,000 Hz.

The second echo cancellation device 123 of the second terminal device 120 is configured to collect the audio input signal of the pickup 122 (including the voice signal output by the second speaker 121 according to the second voice signal and the voice of the user), and process the audio signal The echo mixed in the input signal is eliminated, so as to obtain the first voice data.

The speaker 121 is set to play the signal acquired by the second terminal device 120 from the first terminal device 110 , including the second voice signal and/or the reference signal. The sound of the played second voice signal can be listened to by the user, but the sound of the played audio reference signal cannot be heard by the user, so that the user experience will not be affected. The sound of the second voice signal or the audio reference signal output by the speaker 121 will propagate to the pickup 122 to generate an echo.

The sound pickup 122 is configured to receive a mixed voice signal, the mixed voice signal at least includes a voice signal uttered by the user and a second voice signal output by the speaker 121 of the second terminal device 120 . The echo of the second voice signal output by the speaker 121 or the echo of the reference signal may be mixed in the sound received by the pickup 122.

The second voice signal and the reference signal output by the loudspeaker 121 will generate an echo in the sound pickup 122, and the reasons for this include diffraction and reflection of the sound. The echo signal can be regarded as the sound signal after the audio signal passes through the echo channel. The impact of the echo channel on sound includes: delay in time and attenuation in energy. In general, the echo channel has a similar effect on the audio content signal as it does on the audio reference signal. Therefore, the reference signal can be analyzed to obtain characteristic parameters of the echo channel, including time delay and attenuation coefficient, and then the echo of the audio content signal can be eliminated by using these two characteristic parameters of the echo channel.

Those skilled in the art can understand that the system architecture platform 100 can be applied to 3G cellular communications; for example, Code Division Multiple Access (Code Division Multiple Access, CDMA for short), EVDO, Global System for Mobile Communications (Global System for Mobile Communications, GSM for short)/General Packet Radio Service (GPRS for short), or 4G cellular communication, such as Long Term Evolution (LTE for short); or, 5G cellular communication; or, the subsequent evolution of mobile communication networks, This embodiment does not specifically limit it.

Those skilled in the art can understand that the system architecture platform 100 shown in FIG. 1 does not constitute a limitation to the embodiment of the present application, and may include more or less components than those shown in the illustration, or combine certain components, or be different layout of the components.

Based on the above system architecture platform, various embodiments of the echo cancellation method of the present application are proposed below.

As shown in Figure 2, Figure 2 is a flowchart of an echo cancellation method provided by an embodiment of the present application, the echo cancellation method is applied to the first terminal device, and the echo cancellation method includes but is not limited to step S100, step S200, Step S300 and Step S400.

In step S100, the output first voice data from the second terminal device is acquired, and the first voice data is the voice data that the second terminal device replies after receiving the second voice data from the first terminal device.

The first terminal device and the second terminal device are in the process of voice interaction, the first terminal device sends the second voice data to the second terminal device, and the second terminal device will transmit the second voice data through the speaker after receiving the second voice data playing, at the same time, the pickup of the second terminal device will simultaneously receive the second voice data and other sounds played by the speaker. Usually, the second echo cancellation device of the second terminal device will filter the echo signal in the pickup, The first voice data is generated and sent to the first terminal device. At this time, the first terminal device acquires the first voice data output from the second terminal device.

Step S200, performing separation processing on the first speech data through a time-domain audio separation model to obtain a plurality of split speech data.

Since the echo cancellation processing of the second terminal device is limited by the length of the filter, the echo cannot be effectively eliminated if the echo tail length exceeds the length supported by the filter. At the same time, it takes a certain amount of time for the filter parameters to converge during the delay and jittering process of the filter. For a long time, echo leakage will occur in this process, and residual echo signals will be generated. The second terminal device can separate the first voice data through the time-domain audio separation model to obtain multiple split voice data. Due to the time-domain audio separation The model has the ability of time-domain audio separation, and the residual echo signals are included in the multiple split voice data obtained through the time-domain audio separation model. Among them, the time-domain audio separation model is obtained by training with mixed speech data.

It should be noted that the time-domain audio separation model can be a time-domain audio separation network (Time-domain Audio Separation Network, referred to as TasNet) model, can be a full convolution time-domain audio separation network (Convolution Time-domain Audio Separation Network, referred to as Conv-TasNet) can also be other time-domain audio separation models, which are not specifically limited in this embodiment.

In step S300, the correlation degree of each branch voice data among the multiple branch voice data is determined with a plurality of reference signals at different times, and the residual echo signal is determined.

In order to identify the residual echo signal from the plurality of split voice data, the first terminal device may perform a correlation determination on each split voice data among the plurality of split voice data and a plurality of reference signals at different times, and may A residual echo signal is determined based on the correlation result.

Step S400: Filter the residual echo signal in the first voice data to obtain target voice data.

The first terminal device filters the residual echo signal in the first voice data, so as to obtain target voice data. The target voice data is pure voice data, which can solve the problem of echo leakage.

It should be noted that the filtering method for filtering the residual echo signal in the first voice data may be spectral subtraction or other filtering methods, which is not limited in this embodiment.

It can be understood that the spectrum subtraction is to subtract the spectrum of the noise signal from the spectrum of the noisy signal. Assuming that the noise in the speech has additive noise, the noise spectrum can be subtracted from the noisy speech spectrum to obtain pure speech, in which the noise signal is stable or slowly changing. The formula is as follows:

let D(w)＝P _s (w)-P _n (w)

Among them, Ps(w) is the spectrum of the input noisy speech (first speech data), Pn(w) is the spectrum of the estimated noise (residual echo signal), and the difference spectrum of D(w) is obtained by subtracting the two . Since negative values may appear after the subtraction, a judgment condition can be added to set all negative values to 0, and the result obtained is the spectrum of the final output denoising speech (target speech data).

In an embodiment, as shown in FIG. 3 , the first terminal device obtains the output first voice data from the second terminal device. Since the first voice data is the voice data processed by the second terminal device after echo cancellation, However, the echo cancellation processing method in the second terminal device is limited by the length of the filter. If the echo tail length exceeds the length supported by the filter, the echo cannot be effectively eliminated. At the same time, the filter parameters converge during the delay and jitter process It takes a certain period of time. In this process, echo leakage will occur and a residual echo signal will be generated, that is, the first voice data is data with a residual echo signal. Then the second terminal device can use the TasNet model to process the first voice data. Separation processing to obtain a plurality of branch voice data, because the TasNet model has the ability of time-domain audio separation, the multiple branch voice data obtained through the separation of the TasNet model includes residual echo signals, and then in order to be able to obtain from multiple branch voice data The residual echo signal is identified in the first terminal device, and the first terminal device can determine the correlation degree between the plurality of branched voice data and the reference signal of the first voice data, determine the residual echo signal according to the correlation result, and then compare the first voice data The residual echo signal is filtered to obtain the target voice data. The target voice data is pure voice data. The above method can not only use the traditional echo cancellation algorithm to cover most of the normal use scenarios, but also can make up for the traditional echo cancellation. The echo leakage generated by the algorithm during the delay change and jittering process, at the same time, the first terminal device sets a local echo cancellation mechanism that does not rely on the remote echo cancellation effect, which provides a guarantee for the voice quality of the local end and can solve the echo leakage The problem.

Referring to FIG. 4 , in one embodiment, step S300 includes but not limited to step S410 to step S430 .

Step S410, performing correlation calculations on each of the plurality of branch voice data and a plurality of reference signals at different times to obtain a reference signal with the maximum correlation;

Step S420, when the maximum correlation degree is greater than the preset threshold, determine the reference signal corresponding to the maximum correlation degree as the target reference signal;

Step S430, determining the divided voice data corresponding to the target reference signal as the residual echo signal.

Assume that the reference signal is s(t), and the divided voice data is yn(t), where n represents the nth channel signal. Since the residual echo in yn(t) has a certain delay compared to s(t), it is possible to retain historical information in s(t), and use si(t) to represent the history compared to the current moment The reference signal of the i-th frame. Then, through the correlation of the speech frame, estimate si(t) corresponding to yn(t), that is, by traversing all si(t), calculate the correlation with yn(t), and obtain the maximum correlation si(t ), if the correlation is greater than the preset threshold cohne, the si(t) of the maximum correlation is considered to be a valid reference frame; The yn(t) with the highest correlation is the residual echo signal. By performing i*n ergodic calculation on si(t) and yn(t), obtain si(t) and yn(t) with the highest correlation.

In order to meet the real-time requirements, the correlation calculation method can adopt the cross-correlation method in Web Real-Time Communication (WebRTC for short), and this algorithm can achieve a better balance in algorithm degree and performance. Convert si(t) and yn(t) to the frequency domain through Fast Fourier Transform (FFT for short), and divide them into 64 sub-bands equally, using the most important 32 frequency bands in the spectrum (that is, frequency band 12 -43). Then use this algorithm to estimate the mean threshold_spectrum of the spectrum and set the mean value as the threshold value. When a certain frequency band value is greater than the threshold value, set it to 1, otherwise set it to 0. It can effectively obtain the binarized spectrum values of the far-end and near-end signals, perform a bitwise XOR operation on the two values, and calculate the number of 1s, that is, the correlation between si(t) and yn(t) can be obtained .

In one embodiment, through this fast correlation calculation method, the correlation between si(t) and unseparated speech data y(t) can be traversed first, and the corresponding si(t) corresponding to the highest correlation can be selected. t), and then use si(t) and y(t) to traverse the separate voice data yn(t), and finally get the yn(t) with the highest correlation with si(t), which will be related to si(t) The yn(t) with the highest degree is determined as the residual echo signal.

Referring to FIG. 5 , in one embodiment, the first voice data is the voice data processed by the second terminal device through the echo cancellation algorithm of the adaptive filter to the echo data generated by the second voice data, wherein the echo of the adaptive filter The elimination algorithm includes but not limited to step S510, step S520, step S530, step S550, step S550.

Step S510, obtaining estimated delay information by the second terminal device according to the second voice data and the reference signal;

Step S520, obtaining the estimated echo signal according to the estimated delay information, the preset adaptive filter coefficient and the reference signal;

Step S530, performing correlation calculation on the second speech data and the estimated echo signal to obtain correlation coefficients at different time points;

Step S540, determine the echo input signal according to the relevant system;

Step S550, performing filtering processing on the echo input signal to obtain first voice data.

The echo cancellation algorithm is that the second terminal device obtains the estimated delay information according to the second voice data and the reference signal, that is, the correlation between the second voice data and the reference signal can be used to calculate the delay, and a set of initial Adaptive filter coefficients, according to the reference signal, make it iterate continuously to the minimum mean square error, or reach the upper limit of the iterative step, this step outputs the estimated echo signal, and uses the estimated echo signal to perform correlation calculations with the input signal to obtain different The correlation coefficient at the time point, and set a certain strategy, such as taking the nth power of the correlation coefficient (n>=2), and setting a certain threshold for the generated correlation coefficient. The part of the correlation coefficient higher than the threshold can be judged as close terminal input, and the part whose correlation coefficient is lower than the threshold value can be judged as the far-end echo input. The noise source is generated according to the random number generation algorithm, and low-frequency filtering and amplitude limitation are performed to generate comfort noise, which is used to send to the first terminal device. of the first voice data.

Based on the above-mentioned echo cancellation method, various embodiments of the echo cancellation device, the terminal device and the computer-readable storage medium of the present application are respectively proposed below.

An embodiment of the present application also provides an echo canceling device. As shown in FIG. 6 , the echo canceling device includes: an acquiring module 610 , a separating module 620 , a judging module 630 and a filtering module 640 . Wherein the obtaining module 610 is configured to obtain the outputted first voice data from the second terminal device, the first voice data is the voice data replied by the second terminal device after receiving the second voice data from the first terminal device The separation module 620 is set to separate the first voice data through the time-domain audio separation model to obtain a plurality of branch voice data; the determination module 630 is set to separate the multiple branch voice data with the first voice data respectively The reference signal is used for correlation determination to determine the residual echo signal; the filtering module 640 is configured to filter the residual echo signal in the first voice data to obtain the target voice data.

The separation module 620 is also configured to separate the first speech data through the TasNet model trained according to the mixed speech data to obtain a plurality of split speech data.

The judging module 630 is also configured to perform correlation calculations on the plurality of branch voice data and the reference signal of the first voice data respectively, to obtain a plurality of correlation degrees corresponding to the branch voice data; when the maximum correlation degree is greater than the preset threshold , determine the branch voice data corresponding to the highest correlation degree as the residual echo signal.

The judging module 630 is also configured to use the cross-correlation function of WebRTC to calculate the correlation between the plurality of branch voice data and the reference signal of the first voice data respectively.

The filtering module 640 is further configured to use spectral subtraction to filter the residual echo signal in the first speech data to obtain the target speech data.

It should be noted that the echo canceling device is set to perform the same technical means as the echo canceling method in the above-mentioned embodiment, and the technical problem solved by the echo canceling method in the above-mentioned embodiment is the same as that achieved by the echo canceling method in the above-mentioned embodiment. The technical effects of the echo canceling device are the same, and the technical means adopted by the echo canceling device, the technical problems solved and the technical effects achieved will not be repeated here.

An embodiment of the present application also provides a terminal device. As shown in FIG. 7 , the terminal device 700 includes a memory 720 , a processor 710 and a computer program stored in the memory 720 and operable on the processor 710 .

The processor 710 and the memory 720 may be connected via a bus or in other ways.

As a non-transitory computer-readable storage medium, the memory 720 can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory 720 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, memory 720 may include memory located remotely from the processor, which may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to realize the echo cancellation method of the above-mentioned embodiment are stored in the memory 720, and when executed by the processor 710, the echo cancellation method in the above-mentioned embodiment is executed, for example, the implementation of the above-described Figure 2 The method steps S100 to S400 in FIG. 4 , the method steps S410 to S430 in FIG. 4 , and the method steps S510 to S550 in FIG. 5 .

In addition, an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by the above-mentioned Execution by a processor in the communication device in the embodiment can cause the processor to execute the echo cancellation method corresponding to the terminal device side in the above embodiment, for example, execute steps S100 to S400 of the method in FIG. 2 described above, FIG. The method steps S410 to S430 in 4, and the method steps S510 to S550 in FIG. 5 .

The embodiment of the present application includes: the first terminal device obtains the output first voice data from the second terminal device, and the first voice data is the reply of the second terminal device after receiving the second voice data from the first terminal device voice data; the time-domain audio separation model is used to separate the first voice data to obtain a plurality of branch voice data; each branch voice data in the multiple branch voice data is respectively compared with a plurality of references at different times Signal correlation is judged to determine the residual echo signal; the residual echo signal in the first voice data is filtered to obtain the target voice data. Through the time-domain audio separation model, the first voice data is separated and processed to obtain the divided voice data and the reference signal to determine the residual echo signal, and then filter the residual echo signal, so as to solve the problem of echo leakage and provide more High quality voice output.

Those skilled in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware and an appropriate combination thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

The above is a specific description of several embodiments of the present application, but the present application is not limited to the above-mentioned embodiments, and those skilled in the art can also make various equivalent deformations or replacements without violating the spirit of the present application. Equivalent modifications or replacements are all within the scope defined by the claims of the present application.

Claims

An echo cancellation method applied to a first terminal device, the method comprising:

Acquiring the output first voice data from the second terminal device, the first voice data being the voice data replied by the second terminal device after receiving the second voice data from the first terminal device;

Separating the first voice data through a time-domain audio separation model to obtain a plurality of split voice data;

Carrying out a correlation determination between each of the plurality of branch voice data and a plurality of reference signals at different times to determine a residual echo signal;

Filtering the residual echo signal in the first voice data to obtain target voice data.
The echo cancellation method according to claim 1, wherein the correlation determination is performed between each of the plurality of branch voice data and a plurality of reference signals at different times to determine the residual Echo signals, including:

performing correlation calculation on each of the plurality of branch voice data and a plurality of reference signals at different times to obtain the reference signal with the maximum correlation;

When the maximum correlation degree is greater than a preset threshold, determine that the reference signal corresponding to the maximum correlation degree is a target reference signal;

Determining the split speech data corresponding to the target reference signal as a residual echo signal.
The echo cancellation method according to claim 2, wherein said determining the correlation between each of said branched voice data and a plurality of reference signals at different times includes:

Using the cross-correlation function of webpage instant messaging to determine the correlation between each of the plurality of branch voice data and a plurality of reference signals at different times.
The echo cancellation method according to claim 1, wherein the first voice data is the voice processed by the second terminal device on the echo data generated by the second voice data through an echo cancellation algorithm of adaptive filtering data.
The echo cancellation method according to claim 4, wherein the echo cancellation algorithm of adaptive filtering comprises:

obtaining estimated delay information by the second terminal device according to the second voice data and the reference signal;

obtaining an estimated echo signal according to the estimated delay information, preset adaptive filter coefficients, and the reference signal;

performing correlation calculation on the second voice data and the estimated echo signal to obtain correlation coefficients at different time points;

determining an echo input signal based on said correlation coefficient;

Filtering the echo input signal to obtain the first voice data.
The echo cancellation method according to claim 1, wherein the time-domain audio separation model is a TasNet model trained according to mixed speech data.
The echo cancellation method according to claim 1, wherein said filtering the residual echo signal in the first speech data to obtain the target speech data comprises:

The residual echo signal in the first speech data is filtered by spectral subtraction to obtain target speech data.
An echo canceling device, comprising:

An acquisition module, configured to acquire the outputted first voice data from the second terminal device, the first voice data is after the second terminal device receives the second voice data from the first terminal device The voice data of the reply;

The separation module is configured to perform separation processing on the first voice data through a time-domain audio separation model to obtain a plurality of split voice data;

The judging module is configured to perform a correlation judgment on each of the plurality of branch voice data and a plurality of reference signals at different times, and determine a residual echo signal;

The filtering module is configured to filter the residual echo signal in the first voice data to obtain target voice data.
A terminal device, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein, when the processor executes the computer program, the computer program described in any one of claims 1 to 7 is implemented. The echo cancellation method described above.
A computer-readable storage medium, storing computer-executable instructions, the computer can execute the echo cancellation method according to any one of claims 1-7.