CN111885276A

CN111885276A - Method and system for eliminating echo

Info

Publication number: CN111885276A
Application number: CN202010759863.2A
Authority: CN
Inventors: 陈仁武; 余睿; 王青; 杜艳斌
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-11-03
Anticipated expiration: 2040-07-31
Also published as: CN111885276B

Abstract

An embodiment of the present specification provides a method and a system for canceling echo, where the system may include: a speaker, a unidirectional microphone, and a terminal device. And the loudspeaker is used for receiving the first voice signal and playing the first voice signal. The loudspeaker comprises a collecting circuit, wherein the collecting circuit is used for collecting a first voice signal before playing to obtain a first collected signal. The unidirectional microphone points to the direction of the loudspeaker and does not point to the position of the sound source, and is used for acquiring the voice played by the loudspeaker aiming at the received first voice signal to obtain a second acquisition signal. The terminal device includes a call microphone and a first filter component. And the call microphone is used for collecting the speech sound of the user to obtain a second voice signal. And the first filter component is used for taking the second acquisition signal as an echo reference signal and carrying out first filtering processing on an echo component in the second voice signal to obtain a first filtering signal. Parameters of each filter in the first filter assembly are adjusted according to the first acquisition signal.

Description

Method and system for eliminating echo

Technical Field

One or more embodiments of the present disclosure relate to the field of electronic technologies, and in particular, to a method and system for canceling echo.

Background

Communication devices such as mobile terminals often suffer from echo interference during a call, which may include echo interference received by a microphone from a speaker, and the like, and these echo interferences may directly affect the call quality.

In contrast, the prior art proposes an echo cancellation scheme, in which an opposite-end voice signal for transmission to a speaker is used as a reference in a communication process, and an echo cancellation operation of a local-end voice signal is performed.

Disclosure of Invention

One or more embodiments of the present disclosure describe a method and a system for echo cancellation, which can effectively improve the effect of echo cancellation, and thus can greatly improve the call quality.

In a first aspect, a system for canceling echo is provided, including:

the loudspeaker is used for receiving and playing the first voice signal; the loudspeaker comprises an acquisition circuit, a first voice signal acquisition circuit and a second voice signal acquisition circuit, wherein the acquisition circuit is used for acquiring the first voice signal before playing to obtain a first acquisition signal;

the unidirectional microphone points to the loudspeaker direction and does not point to a sound source position, and is used for acquiring the voice played by the loudspeaker aiming at the received first voice signal to obtain a second acquired signal; the sound source position refers to a speaking position of a user;

a terminal device; the terminal device includes:

the conversation microphone is used for collecting the speaking voice of the user to obtain a second voice signal;

the first filter component is used for taking the second acquisition signal as an echo reference signal and carrying out first filtering processing on an echo component in the second voice signal to obtain a first filtering signal; parameters of each filter in the first filter assembly are adjusted according to the first acquisition signal.

In a second aspect, a method for canceling echo is provided, including:

collecting voice played by a loudspeaker aiming at the received first voice signal through a unidirectional microphone to obtain a second collected signal; wherein the unidirectional microphone points in the direction of the speaker and does not point at a sound source location; the sound source position refers to a speaking position of a user;

taking the second collected signal as an echo reference signal, and performing first filtering processing on an echo component in a second voice signal collected by a call microphone to obtain a first filtering signal; and adjusting parameters of the first filtering process according to a first acquisition signal obtained by acquiring the first voice signal before playing.

In the method and system for eliminating echo provided in one or more embodiments of the present specification, a second collected signal collected by a unidirectional microphone may be used as an echo reference signal, and a first filtering process may be performed on an echo component in a second speech signal to obtain a first filtered signal. Since the second collected signal is closer to the echo component in the second speech signal. Therefore, the accuracy of eliminating echo interference can be improved by utilizing the second acquisition signal to carry out echo elimination, the effect of echo elimination is improved, and the conversation quality is improved. In addition, in the application, the filtering parameter can be adjusted according to a first acquisition signal obtained by acquiring the first voice signal before playing. Since the first acquisition signal is real raw data, the accuracy of parameter update can be improved when the parameters of the filtering process are adjusted based on the first acquisition signal.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic diagram of a system for canceling echo as provided herein;

fig. 2 is a flowchart of a method for canceling echo according to an embodiment of the present disclosure;

fig. 3 is a diagram illustrating a method of canceling echo in one example.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a system for canceling echo according to the present disclosure. In fig. 1, the system 10 may include a speaker 102, a unidirectional microphone 104, and a terminal device 106. The speaker 102 is used for receiving the first voice signal Q and playing. The first voice signal is a voice signal of the opposite communication terminal. The speaker 102 may include a collecting circuit, which is configured to collect the first voice signal Q before playing to obtain a first collected signal R. Thus, time delay fluctuations of the acquired signal can be greatly reduced. The unidirectional microphone 104 may also be referred to as a pick-up microphone (or reference microphone) and may be pointed in the direction of the loudspeaker 102 and not at the sound source location. The sound source position here refers to a position of a speech sound of the user. The unidirectional microphone 104 is configured to collect voice played by the speaker 102 for the received first voice signal Q, so as to obtain a second collected signal a. It should be noted that, by not pointing the unidirectional microphone 104 to the sound source, the influence of the speaking sound of the user on the unidirectional microphone 104 can be avoided, and the accuracy of the collected signal can be improved.

For the terminal device 106 in fig. 1, in the software level, a conference system may be deployed, and a user may talk with an opposite terminal through the conference system. At the hardware level, the end device in fig. 1 may include a talk microphone 202 and a first filter component 204. The call microphone 202 is used for collecting a speech sound of a user to obtain a second voice signal B (also called a local voice signal), which may be an omnidirectional microphone, for example. The first filter component 204 is configured to perform first filtering processing on an echo component in the second speech signal B by using the second acquisition signal a as an echo reference signal, so as to obtain a first filtered signal C. The parameters of each filter in the first filter component 204 are adjusted according to the first collected signal R. For example, the adjustment may be made when harmonics are present in the first acquisition signal R.

In one example, the first filter component 204 described above can include a Kalman (KALMAN) filter and a Normalized Least Mean Square (NLMS) filter.

Specifically, the KALMAN filter may receive the second collected signal a and the second voice signal B, and perform the first filtering process on an echo component in the second voice signal B using the second collected signal a as an echo reference signal. The obtained output is the intermediate voice signal after echo elimination through the KALMAN filter. The NLMS filter may receive the second acquisition signal a and the second speech signal B, and perform first filtering processing on an echo component in the second speech signal B by using the second acquisition signal a as an echo reference signal. The obtained output is the intermediate voice signal after the echo is eliminated by the NLMS filter.

Thereafter, the output of the KALMAN filter and the output of the NLMS filter may be fused by the fusion unit 206 in the terminal device 106, so as to obtain the first filtered signal C. The fusion here may include, but is not limited to, averaging, weighted averaging, or taking a maximum value, etc.

Optionally, the terminal device 106 may further include a second filter component 208, configured to perform a second filtering process on an echo component in the second speech signal B by using the first collected signal R as an echo reference signal, so as to obtain a second filtered signal D. It should be noted that the second filter component 208 here may also include two filters, and the two filters are in parallel relationship. The filtering process of each filter is similar to that of the above-mentioned KALMAN filter or NLMS filter, and the description thereof is omitted here.

After obtaining the second filtered signal D, the fusion unit 206 is further configured to fuse the first filtered signal C and the second filtered signal D to obtain the target speech signal E.

Optionally, the terminal device 106 may further include a nonlinear processing unit 210, where the nonlinear processing unit 210 is specifically configured to:

and calculating the energy difference between the second acquisition signal A and the second voice signal B. And judging the single-double-talk state of the terminal equipment 106 at least according to the calculated energy difference. The single-talk and double-talk states include a single-talk state and a double-talk state, the single-talk state refers to a state in which only the first voice signal Q exists, and the double-talk state refers to a state in which the first voice signal Q and the second voice signal B exist simultaneously. And determining the elimination intensity of the residual echo in the target voice signal based on the judgment result of the single-double-talk state. And based on the determined elimination intensity, carrying out nonlinear processing on the target voice signal according to a nonlinear processing algorithm to obtain a final second voice signal after echo elimination.

The residual echo is an echo remaining in the second voice signal B after the first filtering process and the second filtering process are performed on the initial second voice signal B collected by the call microphone 202.

It should be noted that, in practical application, after the energy difference between the second collected signal a and the second speech signal B is obtained through calculation, normalization processing may be performed on the energy difference, and then the single-double speaking state of the terminal device 106 is determined based on a result of the normalization processing. The specific normalization processing steps will be described later.

In addition, the step of determining the cancellation strength of the residual echo in the target speech signal based on the determination result of the single-double speech state may specifically be: and if the judging result of the single-double-talk state is the single-talk state, determining the eliminating intensity of the residual echo to be the first intensity. And if the judgment result of the single-double-talk state is the double-talk state, determining the cancellation intensity of the residual echo to be a second intensity, wherein the first intensity is greater than the second intensity. In other words, if the terminal device 106 is in the single-talk state, the residual echo is subjected to strong suppression processing, thereby achieving maximum cancellation of the residual echo. And if the terminal device 106 is in the dual-talk state, the residual echo is subjected to weak suppression processing so as to protect the second voice signal B.

The method of canceling echo by the system 10 shown in fig. 1 is described below by way of specific embodiments.

Fig. 2 is a flowchart of a method for canceling echo according to an embodiment of the present disclosure. The execution subject of the method may be a device with processing capabilities: a server or a system or a device, such as terminal device 106 in fig. 1. As shown in fig. 2, the method may specifically include:

step 22, the voice played by the speaker 102 for the received first voice signal Q is collected by the unidirectional microphone 104, so as to obtain a second collected signal a.

It should be noted that, in the embodiment of the present disclosure, the second collected signal a collected by the unidirectional microphone 104 includes the nonlinear distortion of the loudspeaker 102, so that the problem that the nonlinear distortion of the loudspeaker 102 cannot be estimated when performing echo cancellation in the prior art can be solved.

Step 24, taking the second collected signal a as an echo reference signal, and performing first filtering processing on an echo component in the second voice signal B collected by the call microphone 202 to obtain a first filtered signal C.

It should be understood that since the second collected signal a contains the nonlinear distortion of the loudspeaker 102, the first filtering process herein removes the nonlinear echo component, and thus the first filtered signal C is a signal after removing the nonlinear echo.

It should be noted that, in this specification, the acquisition steps of the second acquisition signal a, the first acquisition signal R and the second speech signal B may be performed simultaneously.

Optionally, before performing the filtering process, it may be determined whether the opposite end generates sound according to the first collecting signal R. For example, when the first acquisition signal R is greater than the predetermined threshold, it may be determined that the opposite end has sounded, otherwise, it is determined that the opposite end has not sounded. Wherein, the step 24 may be executed when the opposite end is determined to have sounded.

In one example, the steps of the filtering process described above may be performed by the first filter component 204. Taking the example where the first filter component 204 comprises a KALMAN filter and an NLMS filter, the second collected signal a may be input to the KALMAN filter as an echo reference signal, and the second speech signal B may also be input to the KALMAN filter. After receiving the two inputs, the KALMAN filter may perform a first filtering process on an echo component in the second speech signal B according to the second collected signal a, to obtain an intermediate speech signal after echo cancellation by the KALMAN filter. And the second acquisition signal a can be input into the NLMS filter as an echo reference signal, and the second speech signal B can also be input into the NLMS filter. After receiving the two inputs, the NLMS filter may perform first filtering processing on an echo component in the second speech signal B according to the second acquisition signal a, to obtain an intermediate speech signal after echo cancellation by the NLMS filter. The output of the KALMAN filter may then be fused with the output of the NLMS filter (i.e. the two intermediate speech signals are fused) to obtain a first filtered signal C. The fusion here may include, but is not limited to, averaging, weighted averaging, or taking a maximum value, etc.

It should be noted that, when the first filtering signal C is obtained by two filters, the convergence rate of the filters can be greatly increased, thereby avoiding echo leakage and reducing the damage to the second speech signal.

In one example, before each frame of signal is processed by each filter, whether each filter satisfies the parameter adjustment condition may be determined according to the first acquisition signal R. Specifically, harmonic analysis may be performed on the first collected signal R to determine whether a harmonic exists in the first collected signal R, and if so, the parameter adjustment condition is satisfied. In the case that each filter satisfies the parameter adjustment condition, the parameters of each filter may be adjusted according to the input signal (including the second collected signal a and the second speech signal B).

In summary, in the embodiments of the present disclosure, the parameters of each filter may be adjusted according to the first acquisition signal R. Further, it is to be understood that the adjustment of each filter parameter is repeatedly performed along with the steps of the filtering process described above.

Finally, after obtaining the first filtered signal C, the first filtered signal C may be directly used as a final output.

In summary, in the embodiment of the present specification, the second collected signal a collected by the unidirectional microphone 104 is closer to the echo component in the second speech signal B. Therefore, the accuracy of eliminating echo interference can be improved by utilizing the second acquisition signal A to carry out echo elimination, the effect of echo elimination is improved, and the conversation quality is improved. In addition, the second acquisition signal a is closer to the echo component in the second speech signal B, so that when the second acquisition signal a is used for echo cancellation, the estimation amount of echo can be increased, and further the pressure of subsequent residual echo cancellation can be reduced, thereby reducing the speech damage of the local terminal.

Optionally, in the embodiment of the present specification, the first acquisition signal R may be further used as an echo reference signal, and the second filtering process is performed on an echo component in the second speech signal B to obtain a second filtered signal D. And fusing the first filtering signal C and the second filtering signal D to obtain a target voice signal E.

It should be understood that the first collected signal R is not subjected to excessive time delay fluctuation, and is not distorted, and is real raw data. Therefore, the second filtering process removes the linear echo component, so that the second filtered signal D is a signal after removing the linear echo.

The steps of the second filtering process described above may be performed by the second filter component 208. The second filter element 208, similar to the first filter element 204, may also include two filters in a parallel relationship. The filtering process of each filter is similar to that of the above-mentioned KALMAN filter or NLMS filter, and the description thereof is omitted here.

It should be noted that, since the first collecting signal R is directly collected by the collecting circuit disposed in the speaker, it has no excessive time delay fluctuation and no distortion, and is real original data. Therefore, when the first acquisition signal R is used for echo cancellation, the accuracy of echo interference cancellation can be improved.

The fusion may include, but is not limited to, averaging, weighted averaging, or taking a maximum value, etc. That is, the first filtered signal C and the second filtered signal D may be subjected to weighted average, averaging, or maximum value calculation to obtain a calculation result. And taking the operation result as a target voice signal E.

Furthermore, the above-described step of fusing the first filtered signal C and the second filtered signal D may be performed by the fusing unit 206. It should be understood that when the target speech signal E is also acquired, the target speech signal E may be taken as a final output.

In the embodiment of the present description, when performing echo cancellation on the second speech signal B by using the second collected signal a and the first collected signal R, on one hand, the estimation amount of the linear echo may be improved, and on the other hand, the flexibility of the echo cancellation method may also be greatly improved.

The following describes the acquisition process of the target speech signal E with reference to fig. 3. Fig. 3 is a diagram illustrating a method of canceling echo in one example. In fig. 3, the second voice signal B collected by the call microphone 202 may be input to the KALMAN filter, the NLMS filter and the second filter component 208 in the first filter component 204, respectively. The second collected signal a collected by the unidirectional microphone 104 may be input to the KALMAN filter and the NLMS filter, respectively. The first acquisition signal R acquired by the acquisition circuit in the speaker 102 may be input to the second filter component 208. After that, after the outputs of the KALMAN filter and the NLMS filter are fused, a first filtered signal C may be obtained. And a second filtered signal D is obtained by the output of the second filter component 208. Finally, after the first filtered signal C and the second filtered signal D are fused, the target speech signal E may be obtained.

It should be understood that although a part of the echo can be eliminated after the first filtering process and the second filtering process, no matter how well the linear filter processes, residual echo may be annoying to the talker at the opposite end. The reasons are many, such as nonlinear factors of the terminal equipment structure, transformation of the use environment, poor convergence of each filter, and the like. Thus, the residual echo in the second speech signal can also be cancelled. The procedure for canceling the residual echo will be described below.

The step of canceling the residual echo may specifically include: and calculating the energy difference between the second acquisition signal A and the second voice signal B. And judging the single-double-talk state of the terminal equipment 106 at least according to the calculated energy difference. Wherein, the single-double speaking state comprises a single speaking state and a double speaking state. The one-talk state refers to a state in which only the first voice signal Q exists. The double talk state refers to a state in which the first voice signal Q and the second voice signal B exist simultaneously. And determining the elimination intensity of the residual echo in the target voice signal E based on the judgment result of the single-double-talk state. And based on the determined elimination intensity, carrying out nonlinear processing on the target voice signal E according to a nonlinear processing algorithm to obtain a final second voice signal after echo elimination.

Therefore, the existing echo double-talk shearing problem can be solved. The echo double-talk clipping here means that two parties of a call speak at the same time, so that the other party or a third party obviously feels that the sound is interrupted, not heard, and even presses the sound to be silent.

It should be noted that, in practical application, after the energy difference between the second collected signal a and the second speech signal B is obtained through calculation, normalization processing may be performed on the energy difference, and then the single-double speaking state of the terminal device 106 is determined based on a result of the normalization processing.

In one example, the above formula for the normalization process may be as follows:

(formula 1)

Wherein, A is the second acquisition signal, B is the second voice signal, (A-B) is the energy difference, and F is the normalization processing result.

In addition, the determining the cancellation strength of the residual echo in the target speech signal E based on the determination result of the single-double speech state may specifically include: and if the judging result of the single-double-talk state is the single-talk state, determining the eliminating intensity of the residual echo to be the first intensity. And if the judgment result of the single-double-talk state is the double-talk state, determining the cancellation intensity of the residual echo to be a second intensity. Wherein the first intensity is greater than the second intensity. That is, in this specification, if the terminal device 106 is in the single-talk state, strong suppression processing is performed on the residual echo, thereby achieving maximum cancellation of the residual echo. And if the terminal device 106 is in the dual-talk state, the residual echo is subjected to weak suppression processing so as to protect the second voice signal B.

It should be further noted that the above general idea of performing nonlinear processing on the target speech signal E according to the nonlinear processing algorithm is as follows: estimating a residual echo, and performing a suppression process on the estimated residual echo based on the determined cancellation strength. And determining the signal-to-noise ratio according to the residual echo after the suppression processing. And determining the final algorithm gain according to the signal-to-noise ratio. And determining a final second voice signal after echo cancellation according to the final algorithm gain and the target voice signal E.

Since the nonlinear processing of signals according to a nonlinear processing algorithm is a conventional technique, the detailed nonlinear processing process is not repeated here. It should be emphasized that the processing of the residual echo is guided by the judgment result of the single-talk and double-talk states, so that the echo leakage can be avoided during single talk, and the echo double-talk effect can be improved.

In summary, the method for canceling echo provided in the embodiments of the present disclosure can effectively improve the effect of echo cancellation, and thus can greatly improve the call quality.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or may be embodied in software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a server. Of course, the processor and the storage medium may reside as discrete components in a server.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The above-mentioned embodiments, objects, technical solutions and advantages of the present specification are further described in detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the present specification, and are not intended to limit the scope of the present specification, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present specification should be included in the scope of the present specification.

Claims

1. A system for canceling echo, comprising:

a terminal device; the terminal device includes:

2. The system of claim 1, the first filter component comprising a kalman filter and a normalized least mean square adaptive filter;

the first filtering signal is obtained by inputting the second acquisition signal and the second speech signal to the kalman filter and the normalized least mean square adaptive filtering, respectively, and fusing an output of the kalman filter and an output of the normalized least mean square adaptive filtering.

3. The system of claim 1, the parameters of each filter in the first filter component being adjusted in particular when harmonics are present in the first acquisition signal.

4. The system of claim 1, the terminal device further comprising:

the second filter component is used for taking the first acquisition signal as an echo reference signal and carrying out second filtering processing on an echo component in the second voice signal to obtain a second filtering signal;

and the fusion unit is used for fusing the first filtering signal and the second filtering signal to obtain a target voice signal.

5. The system of claim 4, the fusion unit being specifically configured to:

carrying out weighted average, average or maximum calculation on the first filtering signal and the second filtering signal to obtain a calculation result;

and taking the operation result as the target voice signal.

6. The system of claim 4, the terminal device further comprising:

the nonlinear processing unit is used for calculating the energy difference between the second acquisition signal and the second voice signal;

judging the single-double-talk state at least according to the calculated energy difference; the single-double-talk state comprises a single-talk state and a double-talk state; the single-speaking state refers to a state in which only the first voice signal exists, and the double-speaking state refers to a state in which the first voice signal and the second voice signal exist simultaneously;

determining the elimination intensity of residual echo in the target voice signal based on the judgment result of the single-double-talk state;

and based on the determined elimination intensity, carrying out nonlinear processing on the target voice signal according to a nonlinear processing algorithm to obtain a final second voice signal after echo elimination.

7. The system of claim 6, the non-linear processing unit to:

if the judging result of the single-speaking state and the double-speaking state is the single-speaking state, determining the eliminating intensity of the residual echo to be a first intensity;

if the judging result of the single-double-talk state is the double-talk state, determining the eliminating intensity of the residual echo to be a second intensity; wherein the first intensity is greater than the second intensity.

8. A method of canceling echo, comprising:

9. The method of claim 8, further comprising:

taking the first acquisition signal as an echo reference signal, and performing second filtering processing on an echo component in the second voice signal to obtain a second filtering signal;

and fusing the first filtering signal and the second filtering signal to obtain a target voice signal.

10. The method of claim 9, said fusing said first filtered signal and said second filtered signal to obtain a target speech signal, comprising:

and taking the operation result as the target voice signal.

11. The method of claim 9, further comprising:

calculating the energy difference between the second acquisition signal and the second voice signal;

12. The method of claim 11, wherein determining the cancellation strength of the residual echo in the target speech signal based on the determination result of the one-talk and two-talk states comprises: