WO2020125325A1

WO2020125325A1 - Method for eliminating echo and device

Info

Publication number: WO2020125325A1
Application number: PCT/CN2019/120452
Authority: WO
Inventors: 张真赫; 刘安; 熊张亮
Original assignee: 华为技术有限公司
Priority date: 2018-12-17
Filing date: 2019-11-23
Publication date: 2020-06-25
Also published as: CN111402910B; CN111402910A

Abstract

A method for eliminating echo, applied to a terminal device. The method comprises: outputting an audio reference signal (201); acquiring an audio input signal, the audio input signal comprising echo of the audio reference signal (202); determining a time delay and an attenuation coefficient of an echo channel according to the echo of the audio reference signal (203); and eliminating echo of an audio content signal according to the time delay and the attenuation coefficient (204). By means of the method, the interference of echo to voice input of a user is eliminated, and the quality of input voice is improved.

Description

Method and equipment for eliminating echo

This application requires the priority of the Chinese patent application submitted to the State Intellectual Property Office of China on December 17, 2018, with the application number 201811542603.9 and the invention titled "a method and equipment for echo cancellation", the entire content of which is incorporated by reference in In this application.

Technical field

The invention relates to the field of information processing, in particular to a method and device for eliminating echo.

Background technique

As a current human-computer interaction technology, voice is more and more widely used. At present, there are many terminal devices on the market that interact via voice, such as mobile phones, smart speakers, set-top boxes, smart TVs, and smart remote controls.

When the terminal device communicates with the user through voice, it is necessary to acquire and recognize voice first. In the process of voice interaction with the user, the terminal device often plays audio and video content at the same time. The played sound will generate an echo in the microphone, which affects the user's voice input and thus affects the accuracy of voice recognition.

In the prior art, there are some echo cancellation methods, such as an adaptive filtering algorithm, which can cancel the echo to a certain extent, but the calculation is complicated and the effect is relatively poor.

Summary of the invention

Embodiments of the present invention provide a method and terminal device for eliminating echoes to reduce the interference of echoes on user voice input and improve the quality of input voice.

In a first aspect, an embodiment of the present invention provides an echo cancellation method, which is applied to a terminal device and includes: outputting an audio reference signal; collecting an audio input signal, the audio input signal including an echo of the audio reference signal; according to the audio reference The echo of the signal determines the delay and attenuation coefficient of the echo channel; according to the delay and attenuation coefficient, the echo of the audio content signal in the audio input signal is eliminated.

The above method uses the audio reference signal to obtain the characteristic parameters of the echo channel, thereby eliminating the echo and improving the voice input quality.

In a possible design, determining the attenuation coefficient of the echo channel includes: calculating the amplitude of the echo signal at the frequency of the audio reference signal by Fourier transform of the audio input signal; the amplitude of the echo signal at the frequency of the audio reference signal The signal amplitude ratio with the output audio reference signal is the attenuation coefficient of the echo signal.

In another possible design, the above method further includes filtering the audio input signal through a band-pass filter to obtain the echo of the audio reference signal.

In another possible design, determining the attenuation coefficient of the echo channel includes: calculating the amplitude of the echo signal at the frequency of the audio reference signal by means of root mean square; the amplitude of the echo signal at the frequency of the audio reference signal and the The signal amplitude ratio of the output audio reference signal is the attenuation coefficient of the echo signal.

In another possible design, determining the delay of the echo channel includes: recording the first time when the audio reference signal starts to be output, and recording the second time when the echo of the audio reference signal starts to be detected in the audio input signal; The delay is the time difference between the second time and the first time.

In another possible design, the frequency of the audio reference signal is greater than the frequency range of human ear audible sound.

In another possible design, the output of the audio reference signal is performed when the terminal device is turned on, or periodically.

In a second aspect, an embodiment of the present invention provides a terminal device that has the function of implementing the above method. The function can be realized by hardware, or can also be realized by hardware executing corresponding software. The hardware or software includes one or more units corresponding to the above functions, such as an audio output unit, an audio input unit, and a processing unit.

In a possible design, the structure of the terminal device includes a processor and a memory, where the memory is used to store application program codes that support the above method, and the processor is configured to execute the program stored in the memory.

In a third aspect, an embodiment of the present invention provides a computer storage medium for storing computer software instructions for the above-mentioned terminal device, which includes a program designed to execute the above-mentioned method.

The above method and terminal device for echo cancellation provided by an embodiment of the present invention achieve echo cancellation by outputting audio echo parameters and collecting their echoes, thereby determining characteristic parameters of the echo channel. It greatly reduces the interference of echo to the user's voice input and improves the quality of the input voice. This can improve the quality and performance of subsequent speech processing, such as speech recognition.

BRIEF DESCRIPTION

1 is a schematic diagram of a system architecture for echo cancellation provided by an embodiment of the present invention;

2 is a schematic flowchart of an echo cancellation method according to an embodiment of the present invention;

3 is a schematic structural diagram of a terminal device according to an embodiment of the present invention;

4 is a schematic structural diagram of another terminal device according to an embodiment of the present invention.

detailed description

To make the objectives, technical solutions, and advantages of the present invention clearer, the following describes the embodiments of the present invention in further detail with reference to the accompanying drawings.

When the terminal device interacts with the user's voice, the audio and video content may be playing at the same time. The played sound will generate an echo in the microphone. The user's voice input is usually interfered by the echo generated by the playing voice, resulting in the terminal device's recognition of the voice input Reduced ability.

The echo cancellation method provided by the embodiment of the present invention is applied to the system shown in FIG. 1, and the system includes: a terminal device 101, a speaker 102, and a microphone 103. The terminal device shown in FIG. 1 may be a personal computer PC, a mobile phone, a set-top box, a smart speaker, a smart TV, and other devices. The terminal device may directly include a speaker 102 and a microphone 103, such as a mobile phone. The terminal device can also be connected with an external speaker and microphone, such as an external speaker and microphone of a personal computer, and an external TV set-top box as an audio and video playback device.

The terminal device 101 is used to output the audio content signal of the audiovisual program content to the speaker 102, and also output the audio reference signal to the speaker. The audio reference signal is usually a high-frequency signal, the frequency of which is greater than the frequency range of the human ear audible sound. The frequency range of the sound that can be heard by the general human ear is 20 Hz to 20,000 Hz, so the frequency of the audio reference signal can be selected above 20,000 Hz. The terminal device is used to collect the audio input signal of the microphone and process it to eliminate the echo mixed in the audio input signal and restore the user's voice input.

The speaker 102 is used to play audio signals output by the terminal device, including audio content signals or audio reference signals. The sound of the played audio content signal can be listened to by the user, while the sound of the played audio reference signal cannot be heard by the user, which does not affect the user experience. The sound of the audio content signal played by the speaker or the sound of the audio reference signal is propagated into the microphone 103 to generate an echo.

The microphone 103 is used to receive the voice of the user during voice interaction with the terminal device. The sound received by the microphone may be mixed with the echo of the audio content signal played by the speaker or the echo of the audio reference signal.

The sound output from the speaker will generate an echo in the microphone, and the causes include the diffraction and reflection of the sound. The echo signal can be considered as a sound signal after the audio signal passes through the echo channel. The effects of the echo channel on sound include: time delay and energy attenuation. In general, the effect of the echo channel on the audio content signal is similar to the effect on the audio reference signal. Therefore, it is possible to analyze the audio reference signal to obtain the echo channel characteristic parameters, including time delay and attenuation coefficient, and then use these two echo channel characteristic parameters to eliminate the echo of the audio content signal.

As shown in FIG. 1, let the terminal device 101 output an audio signal to the speaker 102, output an audio content signal X ₀ (n), or output an audio reference signal C ₀ (n). The sound from the speaker will be propagated to the echo signal X(n) of the audio content signal generated in the microphone, or the echo signal C(n) of the audio reference signal. When the user interacts with the system, the user's voice input S ₀ (n) is collected by the microphone 103, and the collected voice signal S(n) includes the user's voice input S ₀ (n) and echo signals of possible audio content signals X(n). The terminal device needs to eliminate the echo signal X(n) from the collected voice signal S(n). That is to calculate the following formula 1:

S ₀ (n)=S(n)-C(n) (1)

Applied to the system shown in FIG. 1 above, an embodiment of the present invention provides an echo cancellation method. As shown in Fig. 2, it specifically includes the following steps.

201. Output an audio reference signal.

As mentioned above, in order not to affect the user's use, the frequency of the audio reference signal C ₀ (n) is usually selected in a high frequency band that is inaudible to the human ear, for example, 20 kHz may be selected. If the terminal device is playing audio and video program content, the audio reference signal and the audio content signal can be superimposed and output without affecting the user's listening to the audio program content. An example of C ₀ (n) is:

C ₀ (n)=A ₀ *sin(2πf ₀ /f _s *n) (2)

Among them, A ₀ is the amplitude of the audio reference signal, and f ₀ is the frequency of the audio reference signal. fs is the sampling frequency of the system digitization.

The sampling frequency of the system needs to be greater than twice the frequency of the audio reference signal. For example, when the frequency of the audio reference signal is 20 kHz, the commonly used sampling frequency of 44.1 kHz can meet this requirement.

The audio reference signal can be output when the terminal device is turned on and the characteristic parameters of the echo channel can be determined. After the determination of the characteristic parameters is completed, the output of the audio reference signal can be stopped. Subsequent echo cancellation of voice input is performed according to the determined parameters.

The system can also periodically output audio reference signals and determine the echo channel characteristic parameters, and constantly update the echo channel characteristic parameters to adapt to changes in the possible surrounding environment of the terminal device.

202. Collect audio input signals.

The audio input signal S(n) of the microphone includes the echo C(n) of the audio reference signal through the echo channel in addition to the possible voice input of the terminal device user.

203. Determine the delay and attenuation coefficient of the echo channel according to the echo signal of the audio reference signal.

When the output of the audio reference signal starts at step 201, the recording start output time T _{1 is} recorded.

Perform a discrete discrete Fourier transform (Dcrete Fourier Transform, DFT) on the audio input signal S(n) of the collected microphone. For example, for an audio input signal sampled at 44.1kHZ, a 256-point fast Fourier transform FFT can be performed on the collected 5.8ms of data. In this way, when the frequency domain in the FFT calculation result contains the value of the reference signal frequency, it is considered that the collected audio input signal of the microphone contains the echo of the audio reference signal. Since the frequency of the audio reference signal is higher than the general sound signal, the audio content signal played does not contain the signal of the audio reference signal frequency. Of the collected audio input signals, the input of the audio reference signal frequency comes from the audio reference signal. echo.

Record the time T2 at this time, that is, the time when the microphone begins to receive the echo of the audio reference signal. The delay of the echo channel is:

t=T2-T1 (3)

The echo of the audio reference signal undergoes Fourier transform and is a pulse function in the frequency domain:

|c(f)|=∑A _i *δ(fi*f ₀ ) (4)

Where f ₀ is the frequency of the initial audio reference signal, that is, the main frequency after Fourier transform, A ₁ is the amplitude of the main frequency f ₀ , and the other is the sub-frequency, due to the spectral response characteristics of the speaker, microphone, and environment, The amplitude of the sub-frequency is usually negligible in practical applications.

In this way, the attenuation coefficient r of the echo channel, that is, the ratio of the amplitude of the echo of the audio reference signal to the amplitude of the original reference signal, can be expressed as:

r＝A ₁ /A ₀ (5)

204. Eliminate the echo of the audio content signal in the audio input signal according to the time delay and attenuation coefficient.

After determining the delay t and attenuation coefficient r of the echo channel according to the above steps, the terminal device removes the echo of the audio content signal played from the input voice signal of the microphone during the subsequent voice interaction with the user, and the user’s Voice input.

That is, the echo X(n) of the audio content signal can be expressed as: X(n)=r*X ₀ (n–t*f _s ), and the user's voice input is:

S ₀ (n)=S(n)-r*X ₀ (nt*f _s ) (6)

Among them, f _s is the sampling frequency of the system. The user's voice input after echo cancellation can be used as input for voice recognition.

Preferably, the audio input signal collected in the above step 202 may be band-pass filtered to filter out the echo signal of the audio reference signal. In this way, the discrete Fourier transform calculation in step 203 only includes the echo signal of the audio reference signal, which will greatly improve the calculation speed of the subsequent Fourier transform.

The system can set the bandwidth f _{B of the} band-pass filter according to the frequency f _{0 of the} audio reference signal. Bandpass filtering can be expressed as:

C(n)=bandpass(S(n), f ₀ , f _B ) (7)

Further, for the echo of the audio reference signal output by the band-pass filtering, the root-mean-square (RMS) value of the filtered output signal can be directly calculated in the time domain, thereby calculating the energy average E _{1 of the} echo of the audio reference signal . In the same time domain, the root mean square value is used to calculate the energy average E ₀ of the original audio reference signal. Then the attenuation coefficient r of the echo channel, that is, the ratio of the amplitude of the echo of the audio reference signal to the amplitude of the original audio reference signal, can be expressed as:

r = (E ₁ /E ₀ ) ^1/2 (8)

For the delay of the echo channel, the method of formula (3) can still be used.

In this way, the echo cancellation does not need to perform FFT calculation, which further improves the speed of the system echo cancellation calculation.

In the above-mentioned embodiments of the present invention, the echo channel characteristic parameters are determined through the audio reference signal, which achieves echo cancellation, reduces the interference of the echo on the user's voice input, and improves the quality of the input voice.

An embodiment of the present invention also provides a schematic structural diagram of a terminal device, as shown in FIG. 3, including an audio output unit 301, an audio input unit 302, and a processing unit 303; wherein:

Audio output unit, used to output audio reference signal;

The audio input unit is used to collect audio input signals, and the audio input signals include echoes of audio reference signals;

The processing unit is configured to determine the delay and attenuation coefficient of the echo channel according to the echo of the audio reference signal, and eliminate the echo of the audio content signal in the audio input signal according to the delay and attenuation coefficient.

Further, these units implement related functions in the foregoing method, and will not be described in detail.

In this embodiment, the terminal device is presented in the form of a functional unit. "Unit" here may refer to an application-specific integrated circuit (ASIC), a circuit, a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other functions that can provide the above functions Device. In a simple embodiment, those skilled in the art may think that the terminal device is implemented by using a processor, a memory, and a communication interface.

The terminal device in the embodiment of the present invention may also be implemented in the manner of the computer device (or system) in FIG. 4. 4 is a schematic diagram of a computer device provided by an embodiment of the present invention. The computer device includes at least one processor 401, a communication bus 402, a memory 403, and at least one communication interface 404, and may further include an IO interface 405.

The processor may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program of the present invention.

The communication bus may include a path to transfer information between the aforementioned components. The communication interface uses any transceiver-like device to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area network (Wireless Local Area Networks, WLAN), and so on.

The memory may be read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM) or other types of information and instructions that can be stored Dynamic storage devices can also be Electrically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM) or other optical disc storage, optical disc storage ( (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), disk storage media or other magnetic storage devices, or can be used to carry or store the desired program code in the form of instructions or data structures and can be stored by the computer Any other media, but not limited to this. The memory may exist independently and be connected to the processor through a bus. The memory can also be integrated with the processor.

Wherein, the memory is used to store application program code for executing the solution of the present invention, and is controlled and executed by the processor. The processor is used to execute application code stored in the memory.

In a specific implementation, the processor may include one or more CPUs, and each CPU may be a single-core (single-core) processor or a multi-core (multi-Core) processor. The processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).

In a specific implementation, as an embodiment, the computer device may further include an input/output (I/O) interface. For example, the output device may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector, etc. . The input device may be a mouse, a keyboard, a touch screen device or a sensing device, and at least two imaging sensors.

The aforementioned computer device may be a general-purpose computer device or a dedicated computer device. In a specific implementation, the computer device may be a desktop computer, a portable computer, a network server, a PDA (Personal Digital Assistant), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, or the like in FIG. 4 Structured equipment. The embodiment of the present invention does not limit the type of computer equipment.

The terminal device in FIG. 1 may be the device shown in FIG. 4, and one or more software modules are stored in the memory. The terminal device can implement the software module through the processor and the program code in the memory to complete the above method.

An embodiment of the present invention also provides a computer storage medium for storing computer software instructions for the device shown in FIG. 3 or FIG. 4 above, which includes a program designed to execute the above method embodiment. By executing the stored program, the above method can be realized.

Although the present invention has been described in conjunction with various embodiments herein, in the process of implementing the claimed invention, those skilled in the art can understand and understand by looking at the drawings, the disclosure, and the appended claims Other changes to the disclosed embodiments are implemented. In the claims, the word "comprising" does not exclude other components or steps, and "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill several functions recited in the claims. Certain measures are recited in mutually different dependent claims, but this does not mean that these measures cannot be combined to produce good results.

Those skilled in the art should understand that the embodiments of the present invention may be provided as a method, an apparatus (device), or a computer program product. Therefore, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code. The computer program is stored/distributed in a suitable medium, provided together with other hardware or as a part of the hardware, and may also adopt other distribution forms, such as via the Internet or other wired or wireless telecommunication systems.

The present invention is described with reference to the flowchart and/or block diagram of the method, apparatus (device) and computer program product of the embodiments of the present invention. It should be understood that each flow and/or block in the flowchart and/or block diagram and a combination of the flow and/or block in the flowchart and/or block diagram may be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processing machine, or other programmable data processing device to produce a machine that enables the generation of instructions executed by the processor of the computer or other programmable data processing device A device for realizing the functions specified in one block or multiple blocks of one flow or multiple flows of a flowchart and/or one block or multiple blocks of a block diagram.

These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.

These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device The instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.

Although the invention has been described in conjunction with specific features and embodiments thereof, it is obvious that various modifications and combinations can be made to it. Accordingly, the specification and drawings are merely exemplary illustrations of the invention as defined by the appended claims, and are deemed to cover any and all modifications, changes, combinations, or equivalents within the scope of the invention. Obviously, those skilled in the art can make various modifications and variations to the present invention without departing from the scope of the present invention. In this way, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and equivalent technologies thereof, the present invention is also intended to include these modifications and variations.

Claims

A method for eliminating echo, which is characterized in that it is applied to a terminal device and includes:

Output audio reference signal;

Collect audio input signals, which include echoes of audio reference signals;

Determine the delay and attenuation coefficient of the echo channel according to the echo of the audio reference signal;

The echo of the audio content signal in the audio input signal is eliminated according to the time delay and attenuation coefficient.
The method according to claim 1, wherein the determining the attenuation coefficient of the echo channel comprises:

Calculate the amplitude of the echo signal at the frequency of the audio reference signal through the Fourier transform of the audio input signal;

The ratio of the amplitude of the echo signal at the frequency of the audio reference signal to the amplitude of the signal of the output audio reference signal is the attenuation coefficient of the echo signal.
The method of claim 1, wherein the method further comprises filtering the audio input signal through a band-pass filter to obtain the echo of the audio reference signal.
The method of claim 3, wherein the determining the attenuation coefficient of the echo channel comprises:

Calculate the amplitude of the echo signal at the frequency of the audio reference signal by means of root mean square;

The ratio of the amplitude of the echo signal at the frequency of the audio reference signal to the amplitude of the signal of the output audio reference signal is the attenuation coefficient of the echo signal.
The method according to any one of claims 1 to 4, wherein the determining the delay of the echo channel comprises:

Record the first time when the audio reference signal starts to be output, and record the second time when the echo of the audio reference signal starts to be detected in the audio input signal; the time delay is the time difference between the second time and the first time.
The method according to any one of claims 1-5, wherein the frequency of the audio reference signal is greater than the frequency range of human ear audible sound.
The method according to any one of claims 1-6, wherein the outputting of the audio reference signal is performed when the terminal device is turned on, or periodically.
A terminal device is characterized by comprising: an audio output unit, an audio input unit and a processing unit; wherein:

The audio output unit is used to output an audio reference signal;

The audio input unit is used to collect audio input signals, and the audio input signals include echoes of audio reference signals;

The processing unit is configured to determine the delay and attenuation coefficient of the echo channel according to the echo of the audio reference signal, and eliminate the echo of the audio content signal in the audio input signal according to the delay and attenuation coefficient.
The terminal device according to claim 8, wherein the processing unit for determining the attenuation coefficient of the echo channel specifically includes:

The processing unit is further used to calculate the amplitude of the echo signal at the frequency of the audio reference signal by Fourier transform of the audio input signal;

The ratio of the amplitude of the echo signal at the frequency of the audio reference signal to the amplitude of the signal of the output audio reference signal is the attenuation coefficient of the echo signal.
The terminal device according to claim 8, wherein the processing unit is further configured to filter the audio input signal through a band-pass filter to obtain the echo of the audio reference signal.
The terminal device according to claim 10, wherein the processing unit for determining the attenuation coefficient of the echo channel specifically includes:

The processing unit is further used to calculate the amplitude of the echo signal at the frequency of the audio reference signal by means of root mean square;

The ratio of the amplitude of the echo signal at the frequency of the audio reference signal to the amplitude of the signal of the output audio reference signal is the attenuation coefficient of the echo signal.
The terminal device according to any one of claims 8 to 11, wherein the processing unit is configured to determine the delay of the echo channel including:

The processing unit is further used to record the first time when the audio reference signal starts to be output, and record the second time when the echo of the audio reference signal starts to be detected in the audio input signal; the delay is the second time and the second Time difference.
The terminal device according to any one of claims 8-12, wherein the frequency of the audio reference signal is greater than the frequency range of human ear audible sound.
The terminal device according to any one of claims 8 to 13, wherein the output of the audio reference signal by the audio output unit is performed when the terminal device is turned on, or periodically.