CN117651096A

CN117651096A - Echo cancellation method, device, electronic equipment and storage medium

Info

Publication number: CN117651096A
Application number: CN202410120919.8A
Authority: CN
Inventors: 苏祥; 高毅; 陈静聪
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-03-05
Anticipated expiration: 2044-01-29
Also published as: CN117651096B

Abstract

The application discloses an echo cancellation method, an echo cancellation device, electronic equipment and a storage medium. The embodiment of the invention can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like. The method comprises the following steps: frequency shift processing and downsampling processing are respectively carried out on the near-end voice signal and the far-end voice signal corresponding to the near-end voice signal, so as to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal; determining a cross-correlation function between the near-end voice signal and the far-end voice signal according to the target far-end complex signal and the target near-end complex signal; determining an echo delay of the echo signal according to the cross-correlation function; echo cancellation is performed on the near-end speech signal based on the echo delay of the echo signal. By the method, the echo cancellation of the near-end voice signal is realized.

Description

Echo cancellation method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of electronic information technologies, and in particular, to an echo cancellation method, an echo cancellation device, an electronic device, and a storage medium.

Background

In a voice call, an echo refers to a phenomenon that a far-end voice signal sent to a terminal by a call opposite end is collected by a microphone and then transmitted back to the call opposite end after being played by a loudspeaker of the terminal. The signal collected by the microphone in the playing process of the far-end voice signal is called an echo signal, and the near-end voice signal collected by the microphone contains a locally generated sound signal besides the echo signal. Therefore, a means is needed to cancel the echo signal in the near-end speech signal.

Disclosure of Invention

In view of this, the embodiments of the present application provide an echo cancellation method, an echo cancellation device, an electronic device, and a storage medium.

In a first aspect, an embodiment of the present application provides an echo cancellation method, where the method includes: frequency shifting is carried out on the near-end voice signal and the far-end voice signal corresponding to the near-end voice signal respectively to obtain a far-end complex signal corresponding to the far-end voice signal and a near-end complex signal corresponding to the near-end voice signal, and echo signals generated by playing the corresponding far-end voice signals are superimposed in the near-end voice signal; respectively carrying out downsampling treatment on the far-end complex signal and the near-end complex signal to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal; determining a cross-correlation function between the near-end voice signal and the far-end voice signal according to the target far-end complex signal and the target near-end complex signal; determining an echo delay of the echo signal according to the cross-correlation function; echo cancellation is performed on the near-end speech signal based on the echo delay of the echo signal.

In a second aspect, an embodiment of the present application provides an echo cancellation device, including: the frequency shifting module is used for respectively carrying out frequency shifting treatment on the near-end voice signal and the far-end voice signal corresponding to the near-end voice signal to obtain a far-end complex signal corresponding to the far-end voice signal and a near-end complex signal corresponding to the near-end voice signal, and echo signals generated by playing the corresponding far-end voice signals are superimposed in the near-end voice signal; the sampling module is used for respectively carrying out downsampling processing on the far-end complex signal and the near-end complex signal to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal; the function determining module is used for determining a cross-correlation function between the near-end voice signal and the far-end voice signal according to the target far-end complex signal and the target near-end complex signal; the delay determining module is used for determining the echo delay of the echo signal according to the cross-correlation function; and the elimination module is used for carrying out echo elimination on the near-end voice signal based on the echo delay of the echo signal.

Optionally, the function determining module is further configured to transform the target far-end complex signal to obtain a far-end spectrum corresponding to the target far-end complex signal; transforming the target near-end complex signal to obtain a near-end frequency spectrum corresponding to the target near-end complex signal; a cross-correlation function between the near-end speech signal and the far-end speech signal is determined from the far-end spectrum and the near-end spectrum.

Optionally, the near-end speech signal is a near-end speech signal segment in a near-end speech signal sequence acquired in real time; optionally, the function determining module is further configured to transform the target far-end complex signal in response to the delay time length of the ending time of the near-end speech signal reaching the first time length, so as to obtain a far-end frequency spectrum corresponding to the target far-end complex signal; in response to the delay time reaching the second time length with the ending time, converting the target near-end complex signal to obtain a near-end frequency spectrum corresponding to the target near-end complex signal; the second time period is longer than the first time period; determining a cross-correlation function between the near-end speech signal and the far-end speech signal according to the far-end spectrum and the near-end spectrum in response to the delay time with the end time reaching a third time; the third time period is longer than the second time period; correspondingly, the delay determining module is further configured to determine an echo delay of the echo signal according to the cross-correlation function in response to the delay time length with the end time reaching a fourth time length; the fourth time period is longer than the third time period.

Optionally, the sampling module is further configured to perform downsampling processing on the far-end complex signal and the near-end complex signal for multiple times, so as to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal.

Optionally, the frequency band of interest is located in a frequency band of the target far-end complex signal and a frequency band of the target near-end complex signal; the sampling module is further used for respectively downsampling the far-end complex signal and the near-end complex signal by a first multiple to obtain an intermediate far-end complex signal corresponding to the far-end voice signal and an intermediate near-end complex signal corresponding to the near-end voice signal; respectively performing downsampling of a second multiple on the middle far-end complex signal and the middle near-end complex signal to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal; wherein the first multiple is higher than the second multiple; the product of the first multiple and the second multiple is equal to a target multiple, which is equal to a ratio of a highest frequency of the near-end complex signal to an upper limit frequency of the frequency band of interest.

Optionally, the sampling module is further configured to determine a transition band of the anti-aliasing filter corresponding to the first multiple based on an upper limit frequency of the frequency band of interest, the first multiple, and a highest frequency of the near-end complex signal; the lower limit frequency of the transition band is equal to the upper limit frequency of the concerned frequency band, the upper limit frequency of the transition band is not less than half of the reference frequency, and the reference frequency is equal to the quotient of the highest frequency of the near-end complex signal and the first multiple; filtering the far-end complex signal and the near-end complex signal through an anti-aliasing filter based on the transition zone to obtain a reference intermediate far-end complex signal corresponding to the far-end complex signal and a reference intermediate near-end complex signal corresponding to the near-end complex signal; wherein, the signals in the transition zone in the far-end complex signals and the near-end complex signals are attenuated after filtering; and respectively extracting the reference intermediate far-end complex signal and the reference intermediate near-end complex signal according to the sampling interval corresponding to the first multiple to obtain an intermediate far-end complex signal corresponding to the far-end voice signal and an intermediate near-end complex signal corresponding to the near-end voice signal.

Optionally, the function determining module is further configured to determine a cross power spectrum and a weight according to the far-end spectrum and the near-end spectrum; weighting the cross power spectrum based on the weight to obtain a weighted cross power spectrum; and performing inverse Fourier transform processing on the weighted mutual power spectrum to obtain a cross-correlation function between the near-end voice signal and the far-end voice signal.

Optionally, the cancellation module is further configured to delay the far-end speech signal according to an echo delay of the echo signal, so as to obtain a delayed far-end speech signal; and performing echo cancellation processing on the near-end voice signal according to the delayed far-end voice signal.

Optionally, the frequency shift module is further configured to determine a target frequency shift amount according to a voice scene of the near-end voice signal; and respectively performing frequency shift processing of target frequency shift quantity on the far-end voice signal and the near-end voice signal to obtain a far-end complex signal corresponding to the far-end voice signal and a near-end complex signal corresponding to the near-end voice signal.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory; the memory has stored thereon computer readable instructions which, when executed by the processor, implement the method described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having computer-readable instructions stored thereon, which when executed by a processor, implement the above-described method.

In a fifth aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions which, when executed by a processor, implement the above-described method.

In the method, the device, the electronic equipment and the storage medium for echo cancellation provided by the embodiment of the application, firstly, frequency shift processing is performed on a near-end voice signal and a far-end voice signal corresponding to the near-end voice signal respectively to obtain a far-end complex signal corresponding to the far-end voice signal and a near-end complex signal corresponding to the near-end voice signal, then, downsampling processing is performed on the far-end complex signal and the near-end complex signal respectively to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal, and through the target far-end complex signal and the target near-end complex signal, a cross-correlation function between the near-end voice signal and the far-end voice signal is determined, and then echo cancellation is performed according to echo delay near-end voice signals determined based on mutual interference functions, so that echo cancellation of the near-end voice signal is realized. Meanwhile, the near-end voice signal and the far-end voice signal of the real signal are converted into complex signals (namely, the near-end complex signal and the far-end complex signal) through frequency shifting, the near-end complex signal and the far-end complex signal are obtained by utilizing all frequency bands of the near-end voice signal and the far-end voice signal respectively, the integrity of information carried by the near-end complex signal and the far-end complex signal is guaranteed, and the near-end voice signal and the far-end voice signal of the real signal are converted into complex domains, so that the bandwidths of the far-end complex signal and the near-end complex signal are smaller relatively speaking, the bandwidths of signals to be processed are reduced, and the data processing amount is reduced. And then, downsampling the near-end complex signal and the far-end complex signal, and determining a cross-correlation function between the near-end voice signal and the far-end voice signal based on the downsampled signals, so that the data processing capacity can be further reduced. Moreover, the information of the low frequency band is invalid and even harmful based on the influence of the human voice characteristic, and the influence of the low frequency band can be reduced by converting the real signal into the complex signal and downsampling, so that the accuracy of the determined cross-correlation function is ensured, the echo cancellation effect is improved, and the echo cancellation efficiency is improved due to the reduction of the data processing amount.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic diagram of an application scenario applicable to an embodiment of the present application;

FIG. 2 is a flow chart illustrating an echo cancellation method according to one embodiment of the present application;

fig. 3 shows a schematic diagram of an impulse response of an echo signal in an embodiment of the present application;

fig. 4 shows a schematic diagram of a filtering principle of an anti-aliasing filter in an embodiment of the application;

FIG. 5 is a schematic diagram of a filtering and downsampling principle in an embodiment of the present application;

FIG. 6 is a schematic diagram of an echo delay determination process according to an embodiment of the present application;

fig. 7 is a schematic diagram illustrating an echo cancellation process in an embodiment of the present application;

FIG. 8 is a flowchart of steps S103 and S104 in an embodiment corresponding to FIG. 2;

Fig. 9 is a schematic diagram illustrating yet another echo cancellation procedure in an embodiment of the present application;

fig. 10 shows a block diagram of an echo cancellation device according to an embodiment of the present application;

fig. 11 shows a block diagram of an electronic device for performing an echo cancellation method according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the following description, the terms "first", "second", and the like are merely used to distinguish similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", and the like may be interchanged with one another, if permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application. It should be noted that: references herein to "a plurality" means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The application discloses an echo cancellation method, an echo cancellation device, electronic equipment and a storage medium, and relates to intelligent traffic systems and internet of vehicles.

The intelligent transportation system (Intelligent Traffic System, ITS), also called intelligent transportation system (Intelligent Transportation System), is a comprehensive transportation system which uses advanced scientific technology (information technology, computer technology, data communication technology, sensor technology, electronic control technology, automatic control theory, operation study, artificial intelligence, etc.) effectively and comprehensively for transportation, service control and vehicle manufacturing, and enhances the connection among vehicles, roads and users, thereby forming a comprehensive transportation system for guaranteeing safety, improving efficiency, improving environment and saving energy.

The intelligent vehicle-road cooperative system (Intelligent Vehicle Infrastructure Cooperative Systems, IVICS), which is simply called a vehicle-road cooperative system, is one development direction of an Intelligent Transportation System (ITS). The vehicle-road cooperative system adopts advanced wireless communication, new generation internet and other technologies, carries out vehicle-vehicle and vehicle-road dynamic real-time information interaction in all directions, develops vehicle active safety control and road cooperative management on the basis of full-time idle dynamic traffic information acquisition and fusion, fully realizes effective cooperation of people and vehicles and roads, ensures traffic safety, improves traffic efficiency, and forms a safe, efficient and environment-friendly road traffic system.

As shown in fig. 1, an application scenario applicable to the embodiment of the present application includes a first terminal 20 and a second terminal 10, where the first terminal 20 and the second terminal 10 are communicatively connected through a wired network or a wireless network. The first terminal 20 and the second terminal 10 may be smart phones, tablet computers, notebook computers, desktop computers, smart home appliances, vehicle-mounted, aircraft, wearable devices, virtual reality devices, and other devices capable of performing voice interaction.

The first terminal 20 may collect a far-end voice signal, then the first terminal 20 sends the far-end voice signal to the second terminal 10, and the second terminal 10 plays the far-end voice signal, the second terminal 10 may collect a near-end voice signal while playing the far-end voice signal, then the second terminal 10 performs echo cancellation on an echo signal (echo signal generated by playing the corresponding far-end voice signal) superimposed in the near-end voice signal according to the far-end voice signal and the near-end voice signal, so as to obtain a target near-end voice signal after echo signal cancellation, and then the second terminal 10 may send a clean near-end voice signal to the first terminal 20.

It may be appreciated that the second terminal 10 may collect a far-end voice signal, then the second terminal 10 sends the far-end voice signal to the first terminal 20, and the first terminal 20 plays the far-end voice signal, and the first terminal 20 may collect a near-end voice signal while playing the far-end voice signal, then the first terminal 20 performs echo cancellation on an echo signal (an echo signal generated by playing the corresponding far-end voice signal) superimposed in the near-end voice signal according to the far-end voice signal and the near-end voice signal, so as to obtain a clean near-end voice signal after echo signal cancellation, and then the first terminal 20 may send the clean near-end voice signal to the second terminal 10.

For convenience of description, in the following embodiments, an echo cancellation method is described as an example performed by an electronic device.

Referring to fig. 2, fig. 2 shows a flowchart of an echo cancellation method according to an embodiment of the present application, where the method may be used in an electronic device, and the electronic device may be the second terminal 10 or the first terminal 20 in fig. 1, and the method may include:

s101, frequency shift processing is carried out on the near-end voice signal and the far-end voice signal corresponding to the near-end voice signal respectively, so that a far-end complex signal corresponding to the far-end voice signal and a near-end complex signal corresponding to the near-end voice signal are obtained.

Wherein, echo signals generated by playing the corresponding far-end voice signals are superimposed in the near-end voice signals.

The far-end voice signal may refer to a voice signal that needs to be played by the electronic device (may be a voice signal sent by other devices or may be a voice signal recorded by the electronic device), and the near-end voice signal may refer to a voice signal collected by the electronic device during the process of playing the far-end voice signal by the electronic device.

For example, in the voice call process, both parties of the voice call are respectively a far end and a near end (both parties of the voice call are the far end and the near end), the voice signal collected by the far end is a far end voice signal, and the voice signal collected by the near end is a near end voice signal. For another example, when the electronic device plays the stored speech audio, the audio signal of the speech audio may be used as a far-end speech signal, and the speech signal of the user collected when the electronic device plays the speech audio may be used as a near-end speech signal.

When the far-end voice signal is played by the electronic equipment, the far-end voice signal propagates in the environment where the electronic equipment is located to generate an echo signal, and the echo signal is also acquired by the electronic equipment when the voice signal is acquired, so that the echo signal generated by playing the corresponding far-end voice signal is also superimposed in the near-end voice signal acquired by the electronic equipment.

Ignoring the nonlinear effects of the microphone and operating system in the electronic device on the far-end speech signal, modeling the echo signal as a linear convolution of the far-end speech signal with a finite length impulse response, i.e., the echo signal resulting from playing the far-end speech signal may be expressed as:，/>representing far-end speech signal,/->Representing the impulse response.

The impulse response versus time is shown in fig. 3, where the all zero part of the first half of the impulse response comes from the delay of the operating system of the electronic system, and the non-zero part of the second half is due to the hardware characteristics of the speaker and microphone in the electronic device, and the reverberation characteristics of the environment in which it is located, and it is apparent that the impulse responseThe first half of the full zero part of the echo signal is redundant, and the echo delay of the echo signal estimated in the application is the impulse response +. >Middle first half all zero duration.

In some embodiments, the far-end speech signal may be obtained by sampling the original far-end speech signal at a preset sampling rate, and the near-end speech signal may be obtained by sampling the original near-end speech signal at the preset sampling rate; wherein the preset sampling rate may be set based on demand, for example 16Khz. For example, in the voice call process, two parties of the voice call are respectively used as a far end and a near end, a voice signal collected by the far end is used as an original far end voice signal, the far end samples the original far end voice signal according to a preset sampling rate to obtain a far end voice signal, a voice signal collected by the near end is used as an original near end voice signal, and the near end samples the original near end voice signal according to the preset sampling rate to obtain a near end voice signal.

After the near-end voice signal and the far-end voice signal are obtained, frequency shift processing can be performed on the near-end voice signal to obtain a near-end complex signal corresponding to the near-end voice signal, and meanwhile, frequency shift processing is performed on the far-end voice signal to obtain a far-end complex signal corresponding to the far-end voice signal, so that high-frequency signals in the near-end voice signal and the far-end voice signal are transferred to a low-frequency region.

Because of the influence of the human voice characteristic and the low-frequency background noise, the information of the low frequency band is invalid or even harmful, in order to fully utilize the information of all frequency bands, frequency shifting processing is required to be performed on the near-end voice signal and the far-end voice signal, so that the high-frequency signals in the near-end voice signal and the far-end voice signal are transferred to a middle-low frequency area, the information of the high-frequency signals is transferred to a low-frequency area, the influence of the low frequency band is reduced, and the far-end complex signals and the near-end complex signals after frequency shifting can comprise the information of the high frequency band.

In this embodiment, it is possible toThe near-end voice signal and the far-end voice signal are modulated through the oscillation signal, so that frequency shifting processing of the near-end voice signal and the far-end voice signal is realized, and a near-end complex signal and a far-end complex signal after frequency shifting are obtained. Wherein the oscillation signal may be，/>Refers to the shift amount at the time of the shift processing.

For simplicity of description, the frequency shift operation is explained here by discrete time fourier transform: the signal before the frequency shift processing isIts discrete time Fourier transform is +.>Then the frequency shift property can be expressed asThe left is a representation of the time domain and the right is a representation of the frequency domain. That is, the frequency shift process corresponds to the time domain by the oscillating signal +. >The time domain signal is modulated.

Alternatively, S101 may include: determining a target frequency shift amount according to a voice scene of the near-end voice signal; and respectively performing frequency shift processing of target frequency shift quantity on the far-end voice signal and the near-end voice signal to obtain a far-end complex signal corresponding to the far-end voice signal and a near-end complex signal corresponding to the near-end voice signal.

The voice scene may refer to a scene where the near-end voice signal is located, for example, the voice scene may be a voice call scene only, a video call scene, a conference scene, and an audio recording scene, where the audio recording scene may refer to a scene where the voice signal needs to be recorded while the far-end voice signal is played. The target frequency shift amounts of different voice scenes may be different, and the user may set the corresponding target frequency shift amounts based on the requirements, without limitation.

Specifically, the frequency shift amount can be determined according to the targetDetermining the target oscillation signal +.>And then modulating the near-end voice signal and the far-end voice signal respectively through the target oscillation signal to realize frequency shifting processing of the near-end voice signal and the far-end voice signal, thereby obtaining a near-end complex signal and a far-end complex signal after frequency shifting.

S102, respectively performing downsampling processing on the far-end complex signal and the near-end complex signal to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal.

The far-end complex signal can be downsampled to reduce the data volume to obtain a target near-end complex signal, and the near-end complex signal is downsampled to reduce the data volume to obtain the target near-end complex signal. The ratio of the highest frequency of the near-end complex signal to the upper limit frequency of the concerned frequency band can be used as the target multiple of the downsampling. The frequency band of interest may be set based on the requirement, for example, the highest frequency of the near-end complex signal is 8Khz, the frequency band of interest is-500 hz to 500hz, and the target multiple is 16 times.

It has been found through experiments that, if the highest frequency of the finally sampled signal is 1kHz, the maximum echo delay of the echo signal is 512ms, so in the solution of the present application, for the near-end speech signal and the far-end speech signal, if the delay of 512ms is to be covered, the length of the time window during the downsampling is at least 512ms, and the instability of the cross correlation function at the edge of the speech frame is considered, so the downsampling time window can be widened to reduce the influence of the instability at the edge of the speech frame on the delay estimation, for example, the downsampling time window can be set to 1024ms (i.e. twice as long as 512 ms). Of course, in other embodiments, the final determined downsampled time window may also be determined to be other time windows greater than 512ms, such as 1.2 times 512ms, 1.5 times 512ms, 1.8 times 512ms, 2.2 times 512ms, 2.5 times 512ms, and so forth. It should be noted that the larger the down-sampling time window, the longer the length of the signal covered and the more data volume that needs to be processed subsequently, so in practice, a compromise needs to be made between reducing the impact of instability due to the edges of the speech frames on the delay estimation and reducing the amount of data processed.

In some embodiments, S102 may include: determining a transition band of a target anti-aliasing filter corresponding to the target multiple based on an upper limit frequency of the frequency band of interest, the target multiple and a highest frequency of the near-end complex signal; the lower limit frequency of the transition zone of the target anti-aliasing filter is equal to the upper limit frequency of the concerned frequency band, the upper limit frequency of the transition zone of the target anti-aliasing filter is not less than half of the first reference frequency, and the first reference frequency is equal to the quotient of the highest frequency of the near-end complex signal and the target multiple; filtering the far-end complex signal and the near-end complex signal through the target anti-aliasing filter based on the transition zone of the target anti-aliasing filter to obtain a first reference intermediate far-end complex signal corresponding to the far-end complex signal and a first reference intermediate near-end complex signal corresponding to the near-end complex signal, wherein signals in the transition zone of the target anti-aliasing filter in the far-end complex signal and the near-end complex signal are attenuated after filtering; and respectively extracting the first reference intermediate far-end complex signal and the first reference intermediate near-end complex signal according to sampling intervals corresponding to the target multiples to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal.

The target anti-aliasing filter may be a low-pass filter, so that the target anti-aliasing filter is used for filtering the far-end complex signal and the near-end complex signal respectively to filter out a high-frequency part, so as to avoid aliasing in the down-sampling process, and then the sampling interval is used for extracting, so that the target far-end complex signal and the target near-end complex signal are obtained. The sampling interval is equal to the inverse of the sampling frequency, the sampling interval corresponding to the target multiple is equal to the inverse of the target sampling frequency, and the target sampling frequency is the quotient of the highest frequency of the near-end complex signal and the target multiple. The highest frequency of the near-end complex signal is the sampling frequency of the near-end speech signal obtained by sampling.

The filtering principle of the target anti-aliasing filter is shown in fig. 4, and the abscissa is the frequency of the signal to be filtered; the ordinate is insertion loss, indicating the degree of loss of the target anti-aliasing filter on the signal, expressed in dB; the pass band refers to a frequency range through which the target anti-aliasing filter is allowed to pass, the stop band refers to a frequency range through which the target anti-aliasing filter is not allowed to pass, and the transition band refers to a frequency range between the pass band and the stop band.

After determining a transition zone of a target anti-aliasing filter according to an upper limit frequency of a concerned frequency band and a first reference frequency, filtering a far-end complex signal and a near-end complex signal through the target anti-aliasing filter respectively based on the transition zone of the target anti-aliasing filter so as to filter signals in a stop band and an attenuation transition zone of the target anti-aliasing filter, obtaining a filtered far-end complex signal as a first reference intermediate far-end complex signal and a filtered near-end complex signal as a first reference intermediate near-end complex signal, sampling a signal point construction signal in each M (M is a target multiple) signal points in the first reference intermediate far-end complex signal, realizing extraction of the target far-end complex signal corresponding to the far-end speech signal, and similarly, sampling a signal point construction signal in each M signal points in the first reference intermediate near-end complex signal, realizing extraction of the first reference intermediate near-end complex signal, and obtaining a target near-end complex signal corresponding to the near-end speech signal.

In this embodiment, the sampling principle of the downsampling process is shown in fig. 5, and the signal (which may be a far-end complex signal or a near-end complex signal) input to the target anti-aliasing filter isThe signal output by the target anti-aliasing filter is +.>(input)For far-end complex signal, output +.>For the first reference intermediate far-end complex signal, input +.>For near-end complex signal, output +.>As the first reference intermediate near-end complex signal), the relationship satisfied by the two is:

wherein,for coefficients of the target antialiasing filter, +.>Is the length of the target anti-aliasing filter.

Then, the output of the target anti-aliasing filterM-fold extraction (1-dot data construction signal is extracted per M-dot data) is performed to obtain downsampled +.>（/>For the first reference intermediate far-end complex signal, output +.>For target far-end complex signal,/->For the first reference intermediate near-end complex signal, output +.>For target near-end complex signal), the relationship satisfied by the two is:

wherein M is a target multiple, and 1 point is taken as output every M points.

The high-frequency signal is moved to the low-frequency area through the frequency shift processing, so that information carried by the high-frequency signal is transferred to the low-frequency area, then the far-end complex signal and the near-end complex signal after frequency shift are subjected to downsampling processing, a target far-end complex signal and a target near-end complex signal are obtained, and the downsampled target far-end complex signal and target near-end complex signal comprise the information carried by the high-frequency signal through the frequency shift processing, so that the aim of effectively retaining the information carried by the high-frequency signal is fulfilled.

In some embodiments, S102 may further include performing downsampling processing on the far-end complex signal and the near-end complex signal for multiple times, to obtain a target far-end complex signal corresponding to the far-end speech signal and a target near-end complex signal corresponding to the near-end speech signal. Wherein the product of the single downsampling multiple of each downsampling process of the plurality of downsampling processes is the aforementioned target multiple, i.e., the downsampling process is divided into a plurality of downsampling processes.

During multiple downsampling processes, each downsampling is firstly performed through an anti-aliasing filter, and then each Q point (Q is a natural number greater than 1) data of the filtered signal is subjected to 1 point data extraction to construct a signal, so that Q times downsampling is realized. The multiple of each downsampling may be different, i.e. the Q value chosen may be different for different downsampling processes. For example, the number of downsampling times is 3, Q of the first downsampling is 2, Q of the second downsampling is 2, Q of the third downsampling is 2, that is, 8 times downsampling is achieved by 3 downsampling.

Wherein the lower limit frequency of the transition band of the anti-aliasing filter of each filtering process is equal to the upper limit frequency of the concerned frequency band, the upper limit frequency of the transition band of the anti-aliasing filter of each filtering process is not less than half of the corresponding reference frequency, and the corresponding reference frequency of each filtering process is equal to the ratio of the highest frequency of the signal before the filtering process to the multiple of the downsampling process.

Compared with one downsampling process, the anti-aliasing filter used in each downsampling process in multiple downsampling processes can be provided with a wider transition zone, so that the data processing capacity of the anti-aliasing filter is reduced, the filtering efficiency of the anti-aliasing filter is improved, and the sampling efficiency of the whole downsampling process is greatly improved.

Optionally, downsampling the far-end complex signal and the near-end complex signal by a first multiple respectively to obtain an intermediate far-end complex signal corresponding to the far-end voice signal and an intermediate near-end complex signal corresponding to the near-end voice signal; then, respectively performing downsampling on the intermediate far-end complex signal and the intermediate near-end complex signal by a second multiple to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal; wherein the first multiple is higher than the second multiple; the product of the first multiple and the second multiple is equal to a target multiple, which is equal to a ratio of a highest frequency of the near-end complex signal to an upper limit frequency of the frequency band of interest.

That is, the filtered far-end complex signal and the filtered near-end complex signal are downsampled by a high multiple to obtain an intermediate far-end complex signal and an intermediate near-end complex signal, and then downsampled by a low multiple to obtain the target far-end complex signal and the target near-end complex signal. For example, when the downsampling factor between the far-end complex signal and the target far-end complex signal is 16 times, the first factor may be 8 times and the second factor may be 2 times.

It can be understood that, in order to avoid aliasing in the downsampling process, downsampling is performed on the far-end complex signal and the near-end complex signal by a first multiple to obtain an intermediate far-end complex signal corresponding to the far-end speech signal and an intermediate near-end complex signal corresponding to the near-end speech signal, including: determining a transition band of the anti-aliasing filter corresponding to the first multiple based on an upper limit frequency of the frequency band of interest, the first multiple, and a highest frequency of the near-end complex signal; the lower limit frequency of the transition band is equal to the upper limit frequency of the concerned frequency band, the upper limit frequency of the transition band is not less than half of the reference frequency, and the reference frequency is equal to the quotient of the highest frequency of the near-end complex signal and the first multiple; filtering the far-end complex signal and the near-end complex signal through an anti-aliasing filter based on the transition zone to obtain a reference intermediate far-end complex signal corresponding to the far-end complex signal and a reference intermediate near-end complex signal corresponding to the near-end complex signal, wherein signals in the transition zone in the far-end complex signal and the near-end complex signal are attenuated after filtering; and respectively extracting the reference intermediate far-end complex signal and the reference intermediate near-end complex signal according to the sampling interval corresponding to the first multiple to obtain an intermediate far-end complex signal corresponding to the far-end voice signal and an intermediate near-end complex signal corresponding to the near-end voice signal.

Similarly, down-sampling the intermediate far-end complex signal and the intermediate near-end complex signal by a second multiple to obtain a target far-end complex signal corresponding to the far-end speech signal and a target near-end complex signal corresponding to the near-end speech signal, where the down-sampling includes: determining a transition band of the anti-aliasing filter corresponding to the second multiple based on the upper frequency of the band of interest, the second multiple, and the highest frequency of the intermediate near-end complex signal; the lower limit frequency of the transition band of the anti-aliasing filter is equal to the upper limit frequency of the frequency band of interest, the upper limit frequency of the transition band of the anti-aliasing filter is not less than half of a second reference frequency, the second reference frequency is equal to the quotient of the highest frequency of the intermediate near-end complex signal and the second multiple; filtering the intermediate far-end complex signal and the intermediate near-end complex signal through the anti-aliasing filter based on the transition zone of the second anti-aliasing filter (the anti-aliasing filter corresponding to the second multiple) to obtain a second reference intermediate far-end complex signal corresponding to the far-end complex signal and a second reference intermediate near-end complex signal corresponding to the near-end complex signal, wherein signals in the transition zone of the second anti-aliasing filter in the intermediate far-end complex signal and the intermediate near-end complex signal are attenuated after filtering; and respectively extracting the second reference intermediate far-end complex signal and the second reference intermediate near-end complex signal according to the sampling interval corresponding to the second multiple to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal.

The anti-aliasing filters of the two filtering processes can be low-pass filters, so that the far-end complex signals and the near-end complex signals are respectively filtered for two times to filter out high-frequency parts, and aliasing is avoided in the down sampling process. The sampling interval corresponding to the first multiple is equal to the reciprocal of the first sampling frequency, the first sampling frequency is the quotient of the highest frequency of the near-end complex signal and the first multiple, and the highest frequency of the near-end complex signal is the sampling frequency for obtaining the near-end voice signal through sampling; the sampling interval corresponding to the second multiple is equal to the reciprocal of the second sampling frequency, the second sampling frequency is the quotient of the highest frequency of the intermediate near-end complex signal and the second multiple, and the highest frequency of the intermediate near-end complex signal is the quotient of the sampling frequency of the near-end speech signal obtained by sampling and the first multiple.

After determining a transition zone of an anti-aliasing filter according to an upper limit frequency of a concerned frequency band and a reference frequency, filtering a far-end complex signal and a near-end complex signal through the determined anti-aliasing filter based on the transition zone respectively to filter a stop band corresponding to the anti-aliasing filter and attenuate signals in the transition zone, obtaining a filtered far-end complex signal as a reference intermediate far-end complex signal and a filtered near-end complex signal as a reference intermediate near-end complex signal, sampling a signal point construction signal in each P (P is a first multiple) signal points in the reference intermediate far-end complex signal, realizing extraction of the reference intermediate far-end complex signal, obtaining an intermediate far-end complex signal corresponding to a far-end voice signal, and similarly, sampling a signal point construction signal in each P signal points in the reference intermediate near-end complex signal, realizing extraction of the reference intermediate near-end complex signal, and obtaining an intermediate near-end complex signal corresponding to the near-end voice signal.

After determining a transition band of the anti-aliasing filter according to the upper limit frequency of the concerned frequency band and the second reference frequency, filtering the intermediate far-end complex signal and the intermediate near-end complex signal through the determined anti-aliasing filter based on the transition band to filter signals in a stop band and an attenuation transition band corresponding to the anti-aliasing filter, obtaining a filtered intermediate far-end complex signal as a second reference intermediate far-end complex signal and a filtered intermediate near-end complex signal as a second reference intermediate near-end complex signal, sampling a signal point construction signal in each R (R is a second multiple) signal point in the second reference intermediate far-end complex signal, realizing extraction of the second reference intermediate far-end complex signal, obtaining a target far-end complex signal corresponding to the far-end speech signal, and similarly, sampling a signal point construction signal in each R signal point in the second reference intermediate near-end complex signal, realizing extraction of the second reference intermediate near-end complex signal, and obtaining the target near-end complex signal corresponding to the near-end speech signal.

In this embodiment, the cross-correlation function is determined to mainly utilize information carried by signals in the concerned frequency band, so that the downsampling process is split into the two downsampling processes, and the first multiple of the first downsampling process is higher than the second multiple of the second downsampling process, so that the anti-aliasing filter selected before the first downsampling process can have a wider transition band (even a certain aliasing is allowed), thereby reducing the data processing amount of the anti-aliasing filter for filtering processing, improving the filtering efficiency of the anti-aliasing filter, and improving the sampling efficiency of the whole downsampling process.

For example, the far-end voice signal and the near-end voice signal are both signals obtained by sampling at a reference sampling rate of 16Khz, the signal frequency bands of the obtained far-end complex signal and the near-end complex signal are both-8 Khz to 8Khz, if the downsampling multiple between the far-end complex signal and the target far-end complex signal is 16 times, and the frequency band of interest mainly utilized by the cross-correlation function is-500 Hz to 500Hz, at this time, the downsampling process can be split into the two downsampling processes, the first multiple of the first downsampling process is 8, and the second multiple of the second downsampling process is 2, so that the anti-aliasing filter selected by the first downsampling can have a wider transition band (the initial frequency of the transition band can be 500Hz, and the cutoff frequency can be 1500 Hz), thereby saving the data volume in the filtering process of the anti-aliasing filter and improving the filtering efficiency.

S103, determining a cross-correlation function between the near-end voice signal and the far-end voice signal according to the target far-end complex signal and the target near-end complex signal.

In this embodiment, the target far-end complex signal may be transformed to obtain a far-end spectrum corresponding to the target far-end complex signal, and the target near-end complex signal may be transformed to obtain a near-end spectrum corresponding to the target near-end complex signal; a cross-correlation function between the near-end speech signal and the far-end speech signal is determined from the far-end spectrum and the near-end spectrum.

Specifically, performing Fourier transform on the target far-end complex signal to obtain a far-end frequency spectrum corresponding to the target far-end complex signal; and carrying out Fourier transform on the target near-end complex signal to obtain a near-end frequency spectrum corresponding to the target near-end complex signal.

Then, determining a cross power spectrum and a weight according to the far-end spectrum and the near-end spectrum; weighting the cross power spectrum based on the weight to obtain a weighted cross power spectrum; and performing inverse Fourier transform processing on the weighted mutual power spectrum to obtain a cross-correlation function between the near-end voice signal and the far-end voice signal.

In this embodiment, the cross power spectrum and the weight can be calculated by the formula one as follows:

，/>(one)

Wherein,is a cross-power spectrum->Is weight(s)>For far-end spectrum, +.>For the near-end spectrum of the spectrum,by means of->Performing complex conjugate operation to obtain ∈>Is the energy of the signal.

After the cross power spectrum and the weight are obtained, the product of the cross power spectrum and the weight can be directly calculated, so that the weighting of the cross power spectrum based on the weight is realized, and the weighted cross power spectrum is obtained. Then, the weighted mutual power spectrum is subjected to inverse Fourier transform processing to obtain a cross-correlation function between the near-end voice signal and the far-end voice signal.

The cross power spectrum is weighted based on the weight, and a calculation process searching formula II of the weighted cross power spectrum is obtained, wherein the formula II is as follows:

(II)

Wherein,to weight the mutual power spectrum.

For example, in the case of a voice signal with a sampling rate of 16kHz, if it is desired to cover a delay range of 512ms (the time of echo delay is generally not more than 512ms, and thus the delay range covered by echo delay is determined to be 512 ms), and considering the instability of the cross-correlation function at the signal edge, a time window of 1024ms is required, which means that fourier transformation of a real signal at 16384 points is required (fourier transformation is required in determining the cross-correlation function), the calculation amount is huge, and since the energy of the voice signal is mainly concentrated in a low frequency part, the far-end voice signal and the near-end voice signal can be downsampled to 2kHz, so that the data amount can be reduced, and meanwhile, enough signals with energy can be obtained; in this case, 2048 points of real signal fourier transform is required.

However, the effective bandwidth of the real signal with the sampling rate of 2kHz is 0-1000 Hz, because of the influence of the human voice characteristic and the low frequency background noise, the low frequency background noise of the low frequency band in the near-end voice signal and the far-end voice signal is ineffective or even harmful, so that the high frequency signal is moved to the low frequency area through the frequency shift processing, then the frequency shifted signal is downsampled to obtain the target far-end complex signal and the target near-end complex signal, the information of the high frequency signal is included in the target far-end complex signal and the target near-end complex signal, thereby fully utilizing all frequency bands, and processing the target far-end complex signal and the target near-end complex signal carrying the information of the high frequency signal without filtering the low frequency background noise, but the information carried by the high frequency signal is added in the target far-end complex signal and the target near-end complex signal, thereby reducing the influence of the low frequency band background noise in the near-end voice signal and improving the accuracy of the determined cross-correlation function.

S104, determining the echo delay of the echo signal according to the cross-correlation function.

The cross-correlation function is used for describing the correlation degree between the values of two signals at any two different moments, after the cross-correlation function is obtained, peak value detection is carried out on the cross-correlation function, the time T of the detected peak value represents that the signal of the front T of the near-end voice signal is most similar to the far-end voice signal, and the time T is the echo delay of the echo signal in the near-end voice signal.

In this embodiment, the process of determining the echo delay according to the target far-end complex signal and the target near-end complex signal is shown in fig. 6, where the target far-end complex signal is fourier transformed to obtain a far-end spectrum, the target near-end complex signal is fourier transformed to obtain a near-end spectrum, then the weight and the cross-power spectrum are determined according to the near-end spectrum and the far-end spectrum, and the cross-power spectrum is weighted according to the weight to obtain a weighted cross-power spectrum. And then, carrying out inverse Fourier transform on the weighted cross power spectrum to obtain a cross-correlation function, and carrying out peak detection on the cross-correlation function to obtain echo delay.

S105, echo cancellation is carried out on the near-end voice signal based on echo delay of the echo signal.

After the echo delay is obtained, the far-end voice signal can be subjected to delay processing according to the echo delay of the echo signal, so that a delayed far-end voice signal is obtained; and performing echo cancellation processing on the near-end voice signal according to the delayed far-end voice signal to obtain a target near-end voice signal.

And when the far-end voice signal can be subjected to signal delay, the delay time is echo delay, and the delayed far-end voice signal is obtained and is used as a delayed far-end voice signal.

In some embodiments, an inverse delayed far-end voice signal (a signal with the same frequency and phase as the delayed far-end voice signal and opposite amplitude) corresponding to the delayed far-end voice signal may be determined, and the inverse delayed far-end voice signal and the near-end voice signal are overlapped to implement echo cancellation processing on the near-end voice signal, so as to obtain a target near-end voice signal.

In other embodiments, the delayed far-end speech signal may be subtracted directly from the near-end speech signal to perform echo cancellation processing on the near-end speech signal to obtain the target near-end speech signal.

In a real-time voice call scenario, the echo cancellation process is shown in fig. 7, and the far-end application in the far-end device collects the voice of the far-end speaker through the microphone to obtain the far-end voice signal Then the far-end voice signal is transmitted through the networkThe method comprises the steps that the method comprises the steps of sending the method to a near-end application in near-end equipment, controlling a loudspeaker of the near-end equipment to play a far-end voice signal by the near-end application through an operating system of the near-end equipment, and obtaining an echo signal when the far-end voice signal is transmitted through an echo path>(the echo signal is actually a linear convolution of the far-end speech signal and the finite-length impulse response, and thus, the echo signal is determined to be) Meanwhile, the near-end equipment collects the speaking voice signals of the near-end speaker>Due to the echo signal, the acquired near-end speech signal is +.>I.e. speaking speech signal->Superimposed with echo signalsThe following results.

Thereafter, according to the procedure of the foregoing S101-S104, based on the near-end speech signalFar-end speech signal +.>Performing delay estimation to obtain echo delay of echo signal, and performing echo delay on far-end voice signal>Performing delay processing to obtain delayed far-end voice signal +.>Then, by delaying the far-end voice signalFor near-end speech signal->Echo cancellation is carried out to obtain a target near-end voice signal +.>And the target near-end speech signal +.>Is sent to the remote application, so that the remote application controls the loudspeaker of the remote device to play the target near-end voice signal through the operating system of the remote device >At this time, the target near-end speech signal is subjected to echo cancellation processing, and does not include echo signals or includes very little echo signals.

In this embodiment, frequency shift processing is performed on a near-end speech signal and a far-end speech signal corresponding to the near-end speech signal, so as to obtain a far-end complex signal corresponding to the far-end speech signal and a near-end complex signal corresponding to the near-end speech signal, then downsampling processing is performed on the far-end complex signal and the near-end complex signal, so as to obtain a target far-end complex signal corresponding to the far-end speech signal and a target near-end complex signal corresponding to the near-end speech signal, and a cross-correlation function between the near-end speech signal and the far-end speech signal is determined through the target far-end complex signal and the target near-end complex signal, and then echo cancellation is performed according to the echo delayed near-end speech signal determined based on the mutual interference function, so as to implement echo cancellation of the near-end speech signal. Meanwhile, the near-end voice signal and the far-end voice signal of the real signal are converted into complex signals (namely, the near-end complex signal and the far-end complex signal) through frequency shifting, the near-end complex signal and the far-end complex signal are obtained by utilizing all frequency bands of the near-end voice signal and the far-end voice signal respectively, the integrity of information carried by the near-end complex signal and the far-end complex signal is guaranteed, and the near-end voice signal and the far-end voice signal of the real signal are converted into complex domains, so that the bandwidths of the far-end complex signal and the near-end complex signal are smaller relatively speaking, the bandwidths of signals to be processed are reduced, and the data processing amount is reduced. And then, downsampling the near-end complex signal and the far-end complex signal, and determining a cross-correlation function between the near-end voice signal and the far-end voice signal based on the downsampled signals, so that the data processing capacity can be further reduced. Moreover, the information of the low frequency band is invalid and even harmful based on the influence of the human voice characteristic, and the influence of the low frequency band can be reduced by converting the real signal into the complex signal and downsampling, so that the accuracy of the determined cross-correlation function is ensured, the echo cancellation effect is improved, and the echo cancellation efficiency is improved due to the reduction of the data processing amount.

In an embodiment, the near-end speech signal is a near-end speech signal segment in a near-end speech signal sequence acquired in real time, and the corresponding far-end speech signal is a far-end speech signal segment in a far-end speech signal sequence, where a far-end speech signal segment corresponds to a near-end speech signal segment, and the duration of the far-end speech signal segment is the same as that of the corresponding near-end speech signal segment. As shown in fig. 8, step S103 includes:

s201, in response to the delay time length of the ending time of the near-end voice signal reaching the first time length, the target far-end complex signal is transformed, and a far-end frequency spectrum corresponding to the target far-end complex signal is obtained.

In some embodiments, after playing the far-end voice signal, the near-end voice signal segments may be collected according to the target collection duration, where each collected near-end voice signal segment (the duration is the target collection duration) is sequentially arranged according to the collection sequence to form a near-end voice signal sequence, and any one near-end voice signal segment in the near-end voice signal sequence may be used as a near-end voice signal. The target acquisition duration may be set based on the requirement, for example, 1s.

Similarly, the far-end voice signals are a voice signal segment obtained by dividing the collected whole initial far-end voice signals according to the target collection time length, and the far-end voice signals are sequentially arranged according to the sequence in the initial far-end voice signals to obtain a far-end voice signal sequence.

In some embodiments, the far-end voice signal sequence may also be acquired in real time, and a far-end voice signal segment acquired according to the target acquisition duration is taken as a far-end voice signal.

In this embodiment, the near-end speech frames may be collected according to a preset collection duration (the duration of the near-end speech frames is the preset collection duration), and the collected near-end speech frames are sequentially arranged according to a collection sequence to form a near-end speech signal sequence; any continuous K near-end voice frames can be selected from the near-end voice signal sequence to construct a near-end voice signal segment, wherein K is a positive integer greater than 1. The preset acquisition time period may be 10ms. Thus, the echo delay estimation can be performed once every K near-end voice frames, and the echo delay estimation is not performed once every near-end voice frame, so that the calculation amount of echo cancellation can be reduced as a whole.

The end time of the near-end voice signal refers to the time when the near-end voice signal is acquired, and if the near-end voice signal segment is acquired according to the target acquisition time length and one near-end voice signal segment is taken as one near-end voice signal, the end time of the near-end voice signal refers to the time when the near-end voice signal is acquired; if the voice frame is collected according to the preset collection duration, the end time of the near-end voice signal refers to the collection end time of the last voice frame in the time sequence in the near-end voice signal.

In some embodiments, the far-end voice signal sequence may also be acquired in real time, and the far-end voice frames are acquired according to a preset acquisition time length, and the acquired far-end voice frames are sequentially arranged according to an acquisition sequence to form the far-end voice signal sequence; any continuous K far-end voice frames can be selected from the far-end voice signal sequence to construct a far-end voice signal segment, and the far-end voice signal segment can be used as a far-end voice signal.

The first duration may be a value set based on demand. For example, if the near-end speech signal segment is collected according to the target collection duration, and then a near-end speech signal segment is used as a near-end speech signal, the first duration may be not less than the duration of the first data processing duration, where the first data processing duration may refer to the duration of performing frequency shifting and downsampling on the near-end speech signal to obtain the target near-end complex signal; for another example, if the near-end speech frames are collected according to the preset collection duration, and then any K continuous near-end speech frames are selected to construct a near-end speech signal, the first duration may be not less than the duration of the second data processing duration, where the second data processing duration may refer to the duration of a complex signal corresponding to the last speech frame in the time sequence in the near-end speech signal obtained by performing frequency shifting and downsampling on the last speech frame in the time sequence in the near-end speech signal.

S202, in response to the delay time reaching the second time length with the ending time, the target near-end complex signal is transformed, and a near-end frequency spectrum corresponding to the target near-end complex signal is obtained.

And S203, determining a cross-correlation function between the near-end voice signal and the far-end voice signal according to the far-end frequency spectrum and the near-end frequency spectrum in response to the delay time with the end time reaching the third time.

Accordingly, step S104 includes: in response to the delay time length with the end time reaching a fourth time length, an echo delay of the echo signal is determined according to a cross-correlation function.

The second time period is longer than the first time period, the third time period is longer than the second time period, and the fourth time period is longer than the third time period.

The steps S201 to S203 and S104 are four steps with high calculation amount, and the four steps are processed at different times, so that the load consumption of simultaneous processing of the four steps can be reduced, and the purpose of load balancing is achieved.

In some embodiments, if the near-end speech signal comprises at least one near-end speech frame; the time difference between the second time length and the first time length, the time difference between the third time length and the second time length, and the time difference between the fourth time length and the third time length are equal, the time difference is equal to N times of a preset acquisition time length, N is a positive integer, and the preset acquisition time length is the time length of each near-end voice frame.

In some embodiments, in the case where the time difference may be equal to one time of the preset acquisition duration, the time when the delay duration of the ending time of the currently processed near-end speech signal reaches the first time may be the time when one near-end speech frame is newly acquired (assuming that the newly acquired near-end speech frame is the near-end speech frame S1), and the time when the delay duration of the ending time of the currently processed near-end speech signal reaches the second time is the next near-end speech frame S2 acquired after the near-end speech frame S1; the time when the delay time with the ending time of the currently processed near-end voice signal reaches the third time is the next near-end voice frame S3 acquired after the near-end voice frame S2; similarly, the time when the delay time length with the end time of the currently processed near-end speech signal reaches the fourth time length is the next near-end speech frame S4 acquired after the near-end speech frame S3.

In other embodiments, each near-end speech signal segment in the near-end speech signal sequence may be further used as a near-end speech signal, the echo delay corresponding to each near-end speech signal is determined sequentially according to the sequence of each near-end speech signal, and for each near-end speech signal, if the difference between the echo delay of the speech signal and the echo delay of the previous speech signal is smaller than the preset difference, the echo delay of the speech signal is determined as the final target echo delay, and then echo cancellation is performed on the near-end speech signal through the target echo delay. The preset difference may be, for example, 10ms.

It should be noted that, in the real-time call scenario, the near-end voice signal is also transmitted to the far-end in real time for playing, and after the target echo delay is determined, the echo cancellation processing may be performed on the near-end voice signal obtained after the target echo delay is obtained, and the target near-end voice signal after the echo cancellation is sent to the far-end for playing.

In an audio recording scene, the near-end voice signals may not be transmitted to the far-end in real time for playing, and after the target echo delay is determined, echo cancellation processing is performed on all the obtained near-end voice signals, and the target near-end voice signals after echo cancellation are stored.

In some scenes, near-end voice frames are collected in real time, frequency shifting and downsampling are carried out on the near-end voice frames to obtain near-end voice frame complex signals corresponding to the near-end voice frames (near-end voice frame complex signals corresponding to all near-end voice frames in the near-end voice signals form target near-end complex signals), the near-end voice frame complex signals can be cached in real time, meanwhile, frequency shifting and downsampling can be carried out on far-end voice frames corresponding to the near-end voice frames in far-end voice signal fragments to obtain far-end voice frame complex signals corresponding to the far-end voice frames (far-end voice frame complex signals corresponding to all far-end voice frames in the far-end voice signals form target far-end complex signals), and the far-end voice frame complex signals are cached in real time.

As shown in fig. 9, when any one of the near-end speech frames y1 is collected, that is, at the time t1 when the near-end speech frames are collected, selecting a far-end speech frame complex signal from the cached far-end speech frame complex signals as far-end complex signal data q1 (for example, a 1024-point far-end speech frame complex signal obtained recently may be selected, if the cached data does not exceed 1024 points, all far-end speech frame complex signals are obtained), taking a speech signal segment formed by the far-end speech frames corresponding to the selected far-end speech frame complex signal data as a far-end speech signal c1, and then performing fourier transform on the far-end complex signal data q1 to obtain a far-end frequency spectrum;

when the acquisition of the next near-end voice frame y2 of y1 is finished, namely, at the time t2 when the acquisition of the near-end voice frame y2 is finished, selecting a near-end voice signal c2 corresponding to a far-end voice signal c1 from cached far-end voice frame complex signals, wherein the near-end voice signal c2 comprises near-end voice frame complex signals corresponding to all near-end voice frames as near-end complex signal data q2, and then performing Fourier transform on the near-end complex signal data q2 to obtain a near-end frequency spectrum;

when the next near-end voice frame y3 of y2 is acquired, namely, at the time t3 when the near-end voice frame is acquired by y3, calculating a weighted cross-power spectrum and processing of inverse Fourier transform according to a far-end spectrum and the near-end spectrum to obtain a cross-correlation function;

And when the acquisition of the next near-end voice frame y4 of y3 is finished, namely, the peak value detection is carried out according to the cross correlation function at the time t4 when the acquisition of the near-end voice frame y4 is finished, so as to obtain the echo delay of the echo signal.

In this embodiment, the multiple high-load steps with higher load consumption in echo cancellation are dispersed at different moments to be processed, so that the situations of higher load and slow data processing caused by simultaneous processing of multiple high-load steps are avoided, the purpose of load balancing is achieved, and the situations of slow data processing are reduced.

Referring to fig. 10, fig. 10 shows a block diagram of an echo cancellation device according to an embodiment of the present application, where a device 1100 includes:

the frequency shift module 1110 is configured to perform frequency shift processing on the near-end voice signal and the far-end voice signal corresponding to the near-end voice signal, respectively, to obtain a far-end complex signal corresponding to the far-end voice signal and a near-end complex signal corresponding to the near-end voice signal, where an echo signal generated by playing the corresponding far-end voice signal is superimposed on the near-end voice signal;

the sampling module 1120 is configured to perform downsampling processing on the far-end complex signal and the near-end complex signal, respectively, to obtain a target far-end complex signal corresponding to the far-end speech signal and a target near-end complex signal corresponding to the near-end speech signal;

The function determining module 1130 is configured to determine a cross-correlation function between the near-end speech signal and the far-end speech signal according to the target far-end complex signal and the target near-end complex signal;

a delay determination module 1140, configured to determine an echo delay of the echo signal according to the cross-correlation function;

the cancellation module 1150 is configured to perform echo cancellation on the near-end speech signal based on the echo delay of the echo signal.

Optionally, the function determining module 1130 is further configured to transform the target far-end complex signal to obtain a far-end spectrum corresponding to the target far-end complex signal; transforming the target near-end complex signal to obtain a near-end frequency spectrum corresponding to the target near-end complex signal; a cross-correlation function between the near-end speech signal and the far-end speech signal is determined from the far-end spectrum and the near-end spectrum.

Optionally, the near-end speech signal is a near-end speech signal segment in a near-end speech signal sequence acquired in real time; optionally, the function determining module 1130 is further configured to transform the target far-end complex signal to obtain a far-end spectrum corresponding to the target far-end complex signal in response to the delay time with the end time of the near-end speech signal reaching the first time length; in response to the delay time reaching the second time length with the ending time, converting the target near-end complex signal to obtain a near-end frequency spectrum corresponding to the target near-end complex signal; the second time period is longer than the first time period; determining a cross-correlation function between the near-end speech signal and the far-end speech signal according to the far-end spectrum and the near-end spectrum in response to the delay time with the end time reaching a third time; the third time period is longer than the second time period; correspondingly, the delay determining module 1140 is further configured to determine an echo delay of the echo signal according to the cross-correlation function in response to the delay time length with the end time reaching the fourth time length; the fourth time period is longer than the third time period.

Optionally, the sampling module 1120 is further configured to perform multiple downsampling processing on the far-end complex signal and the near-end complex signal, respectively, to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal.

Optionally, the frequency band of interest is located in a frequency band of the target far-end complex signal and a frequency band of the target near-end complex signal; the sampling module 1120 is further configured to perform downsampling on the far-end complex signal and the near-end complex signal by a first multiple, to obtain an intermediate far-end complex signal corresponding to the far-end speech signal and an intermediate near-end complex signal corresponding to the near-end speech signal; respectively performing downsampling of a second multiple on the middle far-end complex signal and the middle near-end complex signal to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal; wherein the first multiple is higher than the second multiple; the product of the first multiple and the second multiple is equal to a target multiple, which is equal to a ratio of a highest frequency of the near-end complex signal to an upper limit frequency of the frequency band of interest.

Optionally, the sampling module 1120 is further configured to determine a transition band of the anti-aliasing filter corresponding to the first multiple based on the upper limit frequency of the frequency band of interest, the first multiple, and the highest frequency of the near-end complex signal; the lower limit frequency of the transition band is equal to the upper limit frequency of the concerned frequency band, the upper limit frequency of the transition band is not less than half of the reference frequency, and the reference frequency is equal to the quotient of the highest frequency of the near-end complex signal and the first multiple; filtering the far-end complex signal and the near-end complex signal through an anti-aliasing filter based on the transition zone to obtain a reference intermediate far-end complex signal corresponding to the far-end complex signal and a reference intermediate near-end complex signal corresponding to the near-end complex signal, wherein signals in the transition zone in the far-end complex signal and the near-end complex signal are attenuated after filtering; and respectively extracting the reference intermediate far-end complex signal and the reference intermediate near-end complex signal according to the sampling interval corresponding to the first multiple to obtain an intermediate far-end complex signal corresponding to the far-end voice signal and an intermediate near-end complex signal corresponding to the near-end voice signal.

Optionally, the function determining module 1130 is further configured to determine a cross-power spectrum and a weight according to the far-end spectrum and the near-end spectrum; weighting the cross power spectrum based on the weight to obtain a weighted cross power spectrum; and performing inverse Fourier transform processing on the weighted mutual power spectrum to obtain a cross-correlation function between the near-end voice signal and the far-end voice signal.

Optionally, the cancellation module 1150 is further configured to delay the far-end voice signal according to an echo delay of the echo signal, so as to obtain a delayed far-end voice signal; and performing echo cancellation processing on the near-end voice signal according to the delayed far-end voice signal.

Optionally, the frequency shift module 1110 is further configured to determine a target frequency shift amount according to a voice scene of the near-end voice signal; and respectively performing frequency shift processing of target frequency shift quantity on the far-end voice signal and the near-end voice signal to obtain a far-end complex signal corresponding to the far-end voice signal and a near-end complex signal corresponding to the near-end voice signal.

It should be noted that, in the present application, the device embodiment and the foregoing method embodiment correspond to each other, and specific principles in the device embodiment may refer to the content in the foregoing method embodiment, which is not described herein again.

Fig. 10 shows a block diagram of an electronic device for performing an echo cancellation method according to an embodiment of the present application. The electronic device may be the second terminal 10 or the first terminal 20 in fig. 1, and it should be noted that, the computer system 1200 of the electronic device shown in fig. 10 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 10, the computer system 1200 includes a central processing unit (Central Processing Unit, CPU) 1201 which can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a random access Memory (Random Access Memory, RAM) 1203. In the RAM 1203, various programs and data required for the system operation are also stored. The CPU1201, ROM1202, and RAM 1203 are connected to each other through a bus 1204. An Input/Output (I/O) interface 1205 is also connected to bus 1204.

The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker, etc.; a storage section 1208 including a hard disk or the like; and a communication section 1209 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The drive 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 1210 as needed, so that a computer program read out therefrom is installed into the storage section 1208 as needed.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. When executed by a Central Processing Unit (CPU) 1201, performs the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As another aspect, the present application also provides a computer-readable storage medium that may be included in the electronic device described in the above embodiments; or may exist alone without being incorporated into the electronic device. The computer readable storage medium carries computer readable instructions which, when executed by a processor, implement the method of any of the above embodiments.

According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the electronic device to perform the method of any of the embodiments described above.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function, and works together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (e.g., a processing circuit or a memory), or a combination thereof, and as such, a processor (or processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit of the module or unit function.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, in accordance with embodiments of the present application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause an electronic device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, one of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. An echo cancellation method, the method comprising:

frequency shifting is carried out on a near-end voice signal and a far-end voice signal corresponding to the near-end voice signal respectively to obtain a far-end complex signal corresponding to the far-end voice signal and a near-end complex signal corresponding to the near-end voice signal, and echo signals generated by playing the corresponding far-end voice signals are superimposed in the near-end voice signal;

respectively carrying out downsampling treatment on the far-end complex signal and the near-end complex signal to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal;

determining a cross-correlation function between the near-end speech signal and the far-end speech signal according to the target far-end complex signal and the target near-end complex signal;

determining an echo delay of the echo signal according to the cross-correlation function;

and performing echo cancellation on the near-end voice signal based on the echo delay of the echo signal.

2. The method of claim 1, wherein said determining a cross-correlation function between the near-end speech signal and the far-end speech signal from the target far-end complex signal and the target near-end complex signal comprises:

Transforming the target far-end complex signal to obtain a far-end frequency spectrum corresponding to the target far-end complex signal;

transforming the target near-end complex signal to obtain a near-end frequency spectrum corresponding to the target near-end complex signal;

and determining a cross-correlation function between the near-end voice signal and the far-end voice signal according to the far-end frequency spectrum and the near-end frequency spectrum.

3. The method of claim 2, wherein the near-end speech signal is a near-end speech signal segment in a sequence of near-end speech signals acquired in real-time;

the transforming the target far-end complex signal to obtain a far-end spectrum corresponding to the target far-end complex signal includes:

responding to the delay time length of the ending time of the near-end voice signal reaching a first time length, and converting the target far-end complex signal to obtain a far-end frequency spectrum corresponding to the target far-end complex signal;

the transforming the target near-end complex signal to obtain a near-end spectrum corresponding to the target near-end complex signal includes:

responding to the delay time reaching a second time length with the ending time, and transforming the target near-end complex signal to obtain a near-end frequency spectrum corresponding to the target near-end complex signal; the second time period is longer than the first time period;

The determining a cross-correlation function between the near-end speech signal and the far-end speech signal according to the far-end spectrum and the near-end spectrum comprises:

determining a cross-correlation function between the near-end speech signal and the far-end speech signal from the far-end spectrum and the near-end spectrum in response to the delay time with the end time reaching a third time; the third time period is longer than the second time period;

said determining an echo delay of said echo signal from said cross-correlation function comprises:

determining an echo delay of the echo signal according to the cross-correlation function in response to the delay duration with the end time reaching a fourth time length; the fourth time period is longer than the third time period.

4. A method according to claim 3, wherein the near-end speech signal comprises at least one near-end speech frame; the time difference between the second time length and the first time length, the time difference between the third time length and the second time length, and the time difference between the fourth time length and the third time length are equal, the time difference is equal to N times of a preset acquisition time length, N is a positive integer, and the preset acquisition time length is the time length of each near-end voice frame acquired by acquisition.

5. The method of claim 1, wherein the near-end speech signal comprises K consecutive near-end speech frames, K being a positive integer greater than 1.

6. The method according to claim 1, wherein the down-sampling the far-end complex signal and the near-end complex signal to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal, respectively, includes:

and respectively carrying out downsampling processing on the far-end complex signal and the near-end complex signal for a plurality of times to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal.

7. The method of claim 6, wherein the frequency band of interest is located in the frequency band of the target far-end complex signal and the frequency band of the target near-end complex signal;

the performing downsampling processing on the far-end complex signal and the near-end complex signal for multiple times to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal, respectively, including:

respectively performing downsampling on the far-end complex signal and the near-end complex signal by a first multiple to obtain an intermediate far-end complex signal corresponding to the far-end voice signal and an intermediate near-end complex signal corresponding to the near-end voice signal;

Respectively performing downsampling on the intermediate far-end complex signal and the intermediate near-end complex signal by a second multiple to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal; wherein the first multiple is higher than the second multiple; the product of the first multiple and the second multiple is equal to a target multiple that is equal to a ratio of a highest frequency of the near-end complex signal to an upper frequency of the frequency band of interest.

8. The method of claim 7, wherein downsampling the far-end complex signal and the near-end complex signal by a first multiple to obtain an intermediate far-end complex signal corresponding to the far-end speech signal and an intermediate near-end complex signal corresponding to the near-end speech signal, respectively, comprises:

determining a transition band of the anti-aliasing filter corresponding to the first multiple based on an upper frequency of the frequency band of interest, the first multiple, and a highest frequency of the near-end complex signal; the lower limit frequency of the transition band is equal to the upper limit frequency of the frequency band of interest, the upper limit frequency of the transition band being not less than half of a reference frequency, the reference frequency being equal to a quotient of the highest frequency of the near-end complex signal and the first multiple;

Filtering the far-end complex signal and the near-end complex signal through the anti-aliasing filter based on the transition zone to obtain a reference intermediate far-end complex signal corresponding to the far-end complex signal and a reference intermediate near-end complex signal corresponding to the near-end complex signal; wherein signals in the transition zone in the far-end complex signals are attenuated after filtering;

and respectively extracting the reference intermediate far-end complex signal and the reference intermediate near-end complex signal according to the sampling interval corresponding to the first multiple to obtain an intermediate far-end complex signal corresponding to the far-end voice signal and an intermediate near-end complex signal corresponding to the near-end voice signal.

9. The method of claim 2, wherein said determining a cross-correlation function between the near-end speech signal and the far-end speech signal from the far-end spectrum and the near-end spectrum comprises:

determining a cross power spectrum and a weight according to the far-end spectrum and the near-end spectrum;

weighting the cross power spectrum based on the weight to obtain a weighted cross power spectrum;

and performing inverse Fourier transform processing on the weighted mutual power spectrum to obtain a cross-correlation function between the near-end voice signal and the far-end voice signal.

10. The method according to any of claims 1-9, wherein echo cancellation of the near-end speech signal based on an echo delay of the echo signal comprises:

performing delay processing on the far-end voice signal according to the echo delay of the echo signal to obtain a delayed far-end voice signal;

and performing echo cancellation processing on the near-end voice signal according to the delayed far-end voice signal.

11. The method according to any one of claims 1-9, wherein the performing frequency shift processing on the near-end speech signal and the far-end speech signal corresponding to the near-end speech signal to obtain a far-end complex signal corresponding to the far-end speech signal and a near-end complex signal corresponding to the near-end speech signal respectively includes:

determining a target frequency shift amount according to the voice scene of the near-end voice signal;

and respectively performing frequency shift processing of a target frequency shift amount on the far-end voice signal and the near-end voice signal to obtain a far-end complex signal corresponding to the far-end voice signal and a near-end complex signal corresponding to the near-end voice signal.

12. An echo cancellation device, the device comprising:

The frequency shifting module is used for respectively carrying out frequency shifting processing on a near-end voice signal and a far-end voice signal corresponding to the near-end voice signal to obtain a far-end complex signal corresponding to the far-end voice signal and a near-end complex signal corresponding to the near-end voice signal, wherein echo signals generated by playing the corresponding far-end voice signals are superimposed in the near-end voice signal;

the sampling module is used for respectively carrying out downsampling processing on the far-end complex signal and the near-end complex signal to obtain a target far-end complex signal corresponding to the far-end voice signal and a target near-end complex signal corresponding to the near-end voice signal;

the function determining module is used for determining a cross-correlation function between the near-end voice signal and the far-end voice signal according to the target far-end complex signal and the target near-end complex signal;

a delay determining module, configured to determine an echo delay of the echo signal according to the cross-correlation function;

and the elimination module is used for carrying out echo elimination on the near-end voice signal based on the echo delay of the echo signal.

13. An electronic device, comprising:

a processor;

A memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1-11.

14. A computer readable storage medium having computer readable instructions stored thereon, which when executed by a processor, implement the method of any of claims 1-11.