CN115512712A

CN115512712A - Echo cancellation method, device and equipment

Info

Publication number: CN115512712A
Application number: CN202210995357.2A
Authority: CN
Inventors: 黄伟隆; 冯津伟
Original assignee: Dingtalk China Information Technology Co Ltd
Current assignee: Dingtalk China Information Technology Co Ltd
Priority date: 2022-03-22
Filing date: 2022-08-18
Publication date: 2022-12-23

Abstract

The application discloses an echo cancellation method and device and a conference terminal. The echo cancellation method utilizes an electrical reference signal in the traditional echo cancellation, and combines a microphone array beam to form an acoustic reference microphone, so that multi-reference echo cancellation is realized. By adopting the processing mode, the echo signals are estimated by using the microphone array, not only can the linear part of the echo signals in the transmission process be estimated, but also the nonlinear part can be estimated, so that the echo signals estimated by using the microphone array can be used as new reference signals to carry out linear self-adaptive filtering, the influence of nonlinearity in actual products on an echo cancellation system can be reduced, nonlinear components in the echo signals are effectively filtered, and the echo cancellation effect is improved.

Description

Echo cancellation method, device and equipment

The present application claims priority of chinese patent application having application number 2022102847248 entitled "echo cancellation method, apparatus and device", filed by chinese patent office on 22/3/2022, the entire contents of which are incorporated herein by reference.

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to an echo cancellation method and apparatus, and a conference terminal.

Background

The internet technology brings changes to communication tools of people, and audio and video conference systems based on cloud computing are gradually popularized. The audio and video conference terminal may generate echo during use, so that a speaker can hear the speaking voice of the speaker, and the conference effect is influenced. Therefore, echo cancellation in a video conference environment has been a hot point of research.

Disclosure of Invention

The application provides an echo cancellation method to solve the problem that non-linear echo signals cannot be cancelled in the prior art. The application further provides an echo cancellation device and a conference terminal.

The application provides an echo cancellation method, which is used for conference equipment, wherein the conference equipment comprises: a first microphone array, a second microphone, a loudspeaker;

the method comprises the following steps:

acquiring multi-path microphone signals through a first microphone array, wherein the microphone signals comprise sound source signals and echo signals; and, acquiring the microphone signal by the second microphone as an acoustic reference microphone signal; acquiring an extraction reference signal through a loudspeaker;

enhancing the echo signal and suppressing the sound source signal by a first beamforming algorithm directed to a loudspeaker to obtain a first sound signal, the first sound signal comprising a linear echo signal and a non-linear echo signal;

and performing linear adaptive filtering processing according to the first sound signal, the sound reference microphone signal, the extraction reference signal and the microphone signal to obtain an echo cancellation signal.

Optionally, the method further includes:

suppressing the echo signal and enhancing the sound source signal by a second beam forming algorithm pointing to a target sound source to obtain a second sound signal;

and executing linear adaptive filtering processing according to the first sound signal, the second sound signal, the acoustic reference microphone signal and the stoping reference signal to obtain an echo cancellation signal.

Optionally, the suppressing the echo signal and enhancing the sound source signal by a second beamforming algorithm pointing to a target sound source to obtain a second sound signal includes:

determining a suppression coefficient according to an acoustic propagation function between the loudspeaker and the microphone array and each microphone weight vector calculated in a frequency domain by the second beam forming algorithm;

acquiring echo signals after suppression according to the suppression coefficient;

and taking the sum of the suppressed echo signal and the sound source signal as the second sound signal.

Optionally, the performing, according to the first sound signal, the acoustic reference microphone signal, the mining reference signal, and the microphone signal, a linear adaptive filtering process to obtain an echo cancellation signal includes:

acquiring a third sound signal according to the first sound signal, the sound reference microphone signal and the extraction reference signal;

and performing linear adaptive filtering processing according to the second sound signal and the third sound signal to obtain an echo cancellation signal.

mapping the first sound signal, the acoustic reference microphone signal and the stopover reference signal into a fourth sound signal, a fifth sound signal and a sixth sound signal;

according to the second sound signal and the fourth sound signal, linear self-adaptive filtering processing is carried out to obtain a first echo cancellation signal;

according to the first echo cancellation signal and the fifth sound signal, performing linear adaptive filtering processing to obtain a second echo cancellation signal;

and performing linear adaptive filtering processing according to the second echo cancellation signal and the sixth sound signal to obtain a third echo cancellation signal.

Optionally, the mapping the first sound signal, the acoustic reference microphone signal, and the extraction reference signal into a fourth sound signal, a fifth sound signal, and a sixth sound signal is performed in one of the following manners:

taking the first sound signal as a fourth sound signal, taking the stopover reference signal as a fifth sound signal, and taking the acoustic reference microphone signal as a sixth sound signal;

taking the stopover reference signal as a fourth sound signal, the first sound signal as a fifth sound signal, and the acoustic reference microphone signal as a sixth sound signal;

taking the first sound signal as a fourth sound signal, taking the acoustic reference microphone signal as a fifth sound signal, and taking the stope reference signal as a sixth sound signal;

taking the acoustic reference microphone signal as a fourth sound signal, taking the first sound signal as a fifth sound signal, and taking the stoping reference signal as a sixth sound signal;

taking the stopover reference signal as a fourth sound signal, taking the acoustic reference microphone signal as a fifth sound signal, and taking the first sound signal as a sixth sound signal;

and taking the sound reference microphone signal as a fourth sound signal, taking the extraction reference signal as a fifth sound signal, and taking the first sound signal as a sixth sound signal.

Optionally, the method further includes:

and converting the multi-path microphone signals into time-frequency domain sound signals.

Optionally, the microphone array comprises a linear array, the microphones comprise omnidirectional microphones, and the microphone array collects sound source signals of a near field.

Optionally, the acoustic reference microphone comprises a low sensitivity microphone.

The present application also provides an echo cancellation device, for use in a conferencing apparatus, characterized in that,

the conference device includes: a first microphone array, a second microphone, a loudspeaker;

the device comprises:

the signal acquisition unit is used for acquiring multi-path microphone signals through a first microphone array, wherein the microphone signals comprise sound source signals and echo signals; and, acquiring, by the second microphone, the microphone signal as an acoustic reference microphone signal; acquiring an extraction reference signal through a loudspeaker;

a first beamforming unit configured to enhance the echo signal and suppress the sound source signal by a first beamforming algorithm directed to a speaker to obtain a first sound signal, wherein the first sound signal includes a linear echo signal and a nonlinear echo signal;

and the filtering unit is used for executing linear adaptive filtering processing according to the first sound signal, the sound reference microphone signal, the extraction reference signal and the microphone signal so as to obtain an echo cancellation signal.

Optionally, the method further includes:

a second beamforming unit, configured to suppress the echo signal and enhance the sound source signal by a second beamforming algorithm pointing to a target sound source to obtain a second sound signal;

and the filtering unit is specifically configured to perform linear adaptive filtering processing according to the first sound signal, the second sound signal, the acoustic reference microphone signal, and the back-sampling reference signal to obtain an echo cancellation signal.

Optionally, the method further includes:

a filtering unit, configured to obtain a third sound signal according to the first sound signal, the acoustic reference microphone signal, and the extraction reference signal; and performing linear adaptive filtering processing according to the second sound signal and the third sound signal to obtain an echo cancellation signal.

Optionally, the filtering unit includes:

a data mapping subunit for mapping the first sound signal, the acoustic reference microphone signal and the mining reference signal into a fourth sound signal, a fifth sound signal and a sixth sound signal;

the first filtering subunit is used for executing linear adaptive filtering processing according to the second sound signal and the fourth sound signal to obtain a first echo cancellation signal;

the second filtering subunit is configured to perform linear adaptive filtering processing according to the first echo cancellation signal and the fifth sound signal to obtain a second echo cancellation signal;

a third filtering subunit, configured to perform linear adaptive filtering processing according to the second echo cancellation signal and the sixth sound signal, so as to obtain a third echo cancellation signal.

The present application further provides a conference device, comprising:

a speaker;

a first microphone array;

a second microphone;

a processor; and

a memory for storing a program for implementing the above method, the terminal being powered on and the program for executing the method by the processor.

The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.

Compared with the prior art, the method has the following advantages:

the echo cancellation method provided by the embodiment of the application utilizes an electrical reference signal in the traditional echo cancellation, and combines a microphone array beam forming and an acoustic reference microphone to realize multi-reference echo cancellation. By adopting the processing mode, the echo signals are estimated by using the microphone array, not only can the linear part of the echo signals in the transmission process be estimated, but also the nonlinear part can be estimated, so that the echo signals estimated by using the microphone array can be used as new reference signals to carry out linear self-adaptive filtering, the influence of nonlinearity in actual products on an echo cancellation system can be reduced, nonlinear components in the echo signals are effectively filtered, and the echo cancellation effect is improved.

Drawings

Fig. 1 is a schematic flowchart of an embodiment of an echo cancellation method provided in the present application;

fig. 2 is a schematic diagram of an apparatus structure of an embodiment of an echo cancellation method provided in the present application;

FIG. 3 is a schematic sound direction diagram of an embodiment of an echo cancellation method provided in the present application;

FIG. 4 is a beam diagram of an embodiment of an echo cancellation method provided by the present application;

fig. 5 is a schematic flowchart of an embodiment of an echo cancellation method provided in the present application;

fig. 6 is a schematic flowchart of another embodiment of an echo cancellation method according to the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The application provides an echo cancellation method and device and a conference terminal. Each of the schemes is described in detail in the following examples.

A typical echo cancellation method is an echo cancellation method based on an acquisition reference signal, and the processing is as follows. The method comprises the following steps that an electric signal sent to a loudspeaker for playing is subjected to electric signal recovery at a loudspeaker playing end and is sent to a microphone collecting end to serve as a recovery reference signal; the method comprises the steps of performing linear adaptive filtering on an extraction reference signal and a signal acquired by a microphone, fitting an acoustic propagation function of an echo signal from an extraction signal acquisition point to a microphone acquisition point by using a coefficient after convergence of an adaptive filter, estimating an echo component in the microphone acquisition signal by multiplying the coefficient fitted by the extraction reference signal and the filter, and finally, cancelling acoustic echo played from a loudspeaker in the microphone acquisition signal by subtracting the estimated echo component from the microphone acquisition signal. The design principle of the method has a premise that the acoustic propagation function to be fitted is a linear system, so that linear adaptive filtering is generally adopted.

However, in the process of implementing the invention, the inventor finds that the existing scheme has at least the following problems: in an actual application scene, especially when an actual loudspeaker plays sound, some non-linear components are inevitably brought, and the non-linear components cannot be eliminated by linear filtering because the back-collected reference signal does not exist, and even the problem of linear filtering convergence occurs, so that an echo eliminating system cannot meet the expected work.

First embodiment

Please refer to fig. 1, which is a flowchart illustrating an echo cancellation method according to an embodiment of the present application. In this embodiment, the method may include the steps of:

step S101: acquiring multi-path microphone signals through a first microphone array, wherein the microphone signals comprise sound source signals and echo signals; and, acquiring, by the second microphone, the microphone signal as an acoustic reference microphone signal; and acquiring the extraction reference signal through a loudspeaker.

The method can be applied to an echo cancellation scene of conference terminal equipment in an audio and video conference system. The audio and video conference system is a system device which can transmit sound, image and file data to each other through a transmission line, a conference terminal and other devices to realize real-time and interactive communication so as to realize a conference at the same time, and therefore, the audio and video conference system is a typical real-time communication system.

In this embodiment, the method is applied to a conference terminal device. In an echo cancellation application scenario, an echo cancellation device (including an adaptive filter) on a local conference terminal may perform echo cancellation processing on a sound signal of a local conference room, and send an echo cancellation signal to a far-end conference room. The conference terminal may be a Speakerphone (Speakerphone) or a video conference terminal including a display and a camera. The conference terminal at least comprises a microphone array and a loudspeaker. Through the microphone array, the sound signals of the local meeting place can be collected. The sound signal includes the voice of the speaker from the local conference room and the voice of the remote speaker from the speaker, namely: the microphone signals include a sound source signal and an echo signal.

As shown in fig. 2, the audio-video integrated machine is based on a linear microphone array, and may be a speaker phone in specific implementation. In general, an audio communication device having a microphone array as a sound pickup unit and a speaker as a sound reproduction unit encounters echo cancellation problems in practical applications. In this embodiment, the av all-in-one device has M omnidirectional microphones arranged in a linear array as a sound pickup unit. For the av all-in-one shown in fig. 2, a speaker sound source is located at one side of the linear microphone array, and a microphone is further included beside the speaker, which is referred to as an acoustic reference microphone in this embodiment. Each microphone unit of the microphone array picks up the sound of a near-end sound source and the sound of a loudspeaker at the same time, for a communication system, the signal finally received and transmitted by the device is only the voice of the near-end sound source, and the loudspeaker signal received by the microphone, which is called acoustic echo, needs to be eliminated and is not transmitted to the far end by the communication system, so that the effective elimination of the acoustic echo is crucial for any communication device. In specific implementation, the microphone array is used for simultaneously collecting sound by utilizing the array forms of a plurality of microphones such as linear, annular and spherical. During specific implementation, the microphone array carries out near-end sound source pickup, and echo can be further suppressed to a certain degree. When the microphone is implemented, the microphone can be an omnidirectional microphone or a directional microphone.

As shown in fig. 3, the near-end sound source is generally in the broadside direction (broadside) of the linear microphone array, which may be defined as 90 degrees or 270 degrees, and the horn sound source is in the end-fire direction (end-fire) of the linear microphone array, which may be defined as 0 degree. In this embodiment, the audio/video communication all-in-one machine application device has M omnidirectional microphones arranged in a linear array as a sound pickup unit, and another microphone is located near the speaker unit as an acoustic reference microphone, and because the microphone is closer to the speaker, if the sound of the speaker is too loud, the sound pressure received by the acoustic reference microphone will exceed the upper limit of normal operation permission of a general microphone, so that a microphone with lower sensitivity can be selected as the acoustic reference microphone.

The back-sampling reference signal can be a signal source of a loudspeaker, and is a sound signal from a far-end meeting place, including the sound of a far-end speaker. And the local conference terminal deployed in the local conference place receives the sound signal collected by the remote conference terminal deployed in the remote conference place through the communication network, and plays the sound signal through a loudspeaker of the local conference terminal.

Step S103: enhancing the echo signal and suppressing the sound source signal by a first beam forming algorithm directed to a loudspeaker to obtain a first sound signal, the first sound signal comprising a linear echo signal and a non-linear echo signal.

The method provided by the embodiment of the application has the core idea that the beamforming of the directional loudspeaker is obtained by using the microphone array. A beamforming algorithm (beamforming filter algorithm) is a spatial filtering algorithm (spatial filtering algorithm) implemented based on a microphone array, and the spatial filtering algorithm is to set a target direction, signals within the target direction range are picked up, and signals outside the target direction range are suppressed. Thus, based on the beam forming algorithm, the microphone array can realize sound pickup in a specific direction range, and sound out of the specific direction can be suppressed.

Beamforming directed to the speaker will occur in correspondence with the beams described above, the 3D beam diagram, such as that shown in fig. 4, will pick up sound from the speaker and suppress the directional speaker beamforming algorithm pick-up direction.

In specific implementation, the design of the speaker-oriented beamforming or the speaker-oriented beamforming may be based on different theories, such as a differential beamforming theory, a super-directional beamforming theory, etc., and is not limited to the above theory.

In specific implementation, the method can further comprise the following steps: and converting the multi-path microphone signals into time-frequency domain sound signals.

For a number M of microphone inputs, transforming the microphone signal to the time-frequency domain based on a fourier transform as follows:

wherein the content of the first and second substances, 2 [ 2 ]] ^T The method is an operation of linear algebra transposition, wherein omega represents a frequency domain sub-band corresponding to the current time, and n represents a time frame identifier. The microphone array signal is mainly composed of signals from near-end sound source

And acoustic echo signals from the horn

The composition is represented by the following formula:

the above formula is further expressed as:

wherein s (ω, n) represents the signal of the near-end sound source at the sound-emitting point,

representing an acoustic propagation function between a near-end sound source sounding point and a microphone array; u (ω, n) represents the signal of the horn sounding point,

representing the acoustic propagation function between the point of sound production of the horn source and the microphone array.

It is emphasized that the method provided by the present application utilizes end-fire directional beamforming for z after processing _endfire The (ω, n) includes not only the linear play signal in the loudspeaker, but also the nonlinear component in the loudspeaker playing process, so that it will be used as the reference signal in the echo cancellation system subsequently.

Step S105: and performing linear adaptive filtering processing according to the first sound signal, the sound reference microphone signal, the extraction reference signal and the microphone signal to obtain an echo cancellation signal.

Multipath microphone signal, first sound signal z collected by the microphone array _endfire (omega, n), circuit recovery electrical reference signal (e) _ref ) Harmony reference microphone acoustic reference signal (m) _ref ) And carrying out self-adaptive filtering processing to obtain an echo cancellation signal. The adaptive filtering algorithm includes, but is not limited to: normalized Least Mean Square (NLMS), iterative least squares (RLS), and the like.

In one example, the method may further include the steps of:

step S201: suppressing the echo signal and enhancing the sound source signal by a second beamforming algorithm directed at a target sound source to obtain a second sound signal.

Wherein the microphone array is used to obtain beamforming directed to the loudspeakers (end side direction) and beamforming directed to the speaker (side direction). As shown in fig. 4, an example of a 3D beam pattern formed by beam forming in a direction toward the side can be seen, in which a speaker sound source sound is picked up and an echo sound is suppressed.

The design of beam forming in two directions is carried out by using a ring-shaped microphone array, the sound pickup direction of one beam forming (a first beam forming algorithm) can be set to point to a loudspeaker in the array, and the sound pickup direction of the other beam forming (a second beam forming algorithm) can be set to point to a speaker outside the ring-shaped array. The wave beam forming of the speaker pointing to the outer side of the annular array can inhibit signals from the array end-fire direction as interference noise, and mainly pick up sound aiming at a near-end sound source; beamforming directed to the loudspeakers inside the array suppresses the near-end speaker's signal outside the array as interfering noise, and picks up sound mainly for end-fire horn sources.

By using a linear microphone array to perform two-directional beam forming design, the sound pickup direction of one beam forming (second beam forming algorithm) can be set to be a side direction, and the sound pickup direction of the other beam forming (first beam forming algorithm) can be set to be an end-fire direction. The beam forming in the side direction can inhibit the signals from the array end-fire direction as interference noise, and mainly pick up sound aiming at a near-end sound source; the beam forming in the end-fire direction can suppress the signals from the side direction of the array as interference noise, and mainly collects sound for the horn sound source of the end-fire.

In this embodiment, the first beamforming algorithm (end-fire directional beamforming algorithm) for the directional speaker calculates a complex weight vector in the frequency domain subband as follows:

each microphone corresponds to a complex weight, and the complex weights of the microphones form a complex weight vector.

The second beamforming algorithm (side direction beamforming algorithm) for speaker will calculate a complex weight vector in the frequency domain subband as follows:

to better describe the method provided by the embodiment of the present application, as shown in fig. 2, a linear microphone with a diameter equal to a distance of 2 cm composed of 16 microphones is taken as an example, and the analysis and the explanation are performed as an example.

Generally, the most important frequency band of voice is 1khz, which is taken as an analysis example frequency band in this embodiment, for convenience of analysis and explanation, the end-fire direction is assumed to be a 0-degree direction in this embodiment, and the side direction is assumed to be a 90-degree direction in this embodiment. In general, the beam pattern describes the response of the beamforming algorithm in all directions in space, e.g., 0db means that the response of the beamforming is 1, i.e., the signal is picked up undistorted, -10db means that the beam will suppress the signal by 10db.

The beamforming method may be based on different theories, such as a differential beamforming theory, a super-directional beamforming theory, etc., and is not limited to any one of the theories, but in this embodiment, based on a differential beamforming (differential beamforming) algorithm principle as an example, the beamforming algorithms corresponding to fig. 4 may be designed in the endfire direction and the side direction, respectively. As can be seen from fig. 4, the beamforming in the end-fire direction can perform a theoretically undistorted pick-up on the signal in the end-fire direction (0 degrees), and suppress the signal in the side direction to the maximum extent; the beam forming in the side direction can restrain the signal in the terminal emission direction to the maximum extent, and the signal in the side direction can be picked up without distortion.

Based on the above idea, under the theoretical assumption of free field (acoustic field) and acoustic parallel wave transmission, the beamforming in the end-fire direction (first beamforming algorithm) and the transfer function in the near-end sound source direction and the transfer function in the horn direction have the following relationships:

meanwhile, the beam forming in the side direction (second beam forming algorithm) and the transfer function in the near-end sound source direction and the transfer function in the horn direction have the following relations:

as shown in fig. 5, after performing end-fire directional beamforming on the microphone array input signal based on the following formula, the following output is obtained:

the formula represents: suppressing a first sound signal (z) behind the speaker's source of sound _endfire (omega, n)) as time-frequency domain multi-path microphone signal

Product with a first beamforming algorithm (beamforming function). Wherein the content of the first and second substances,

a first beamforming algorithm is indicated.

Based on the above analysis, the following relationship can be obtained theoretically for the end-fire direction in the free field and the flat traveling wave:

z _inner (ω,n)＝u(ω,n)

the formula shows: the first sound signal after the speaker sound source direction is suppressed is the signal of the sound production point of the loudspeaker.

In a real environment, however, since the assumption of free fields is no longer satisfied,

this relationship is no longer true, however

The result of (c) will also be a relatively small value, especially in the application scenario of the embodiment, since the speaker is generally closer to the microphone in space, the signal energy can be preserved more after passing through the acoustic propagation the closer the distance is, and thus even though not the theoretical assumption of parallel waves and free fields, the following condition holds:

the formula can be expressed as: the loudspeaker direction signal after the first wave beam shaping processing is far larger than the signal of the sound source direction of the speaker.

Thus, in a practical environment, for the output of end-fire directional beamforming, the following relationship can be obtained:

z _endfire (ω,n)≈u(ω,n)

the formula can be expressed as: the first sound signal is passed through approximately equal to the loudspeaker direction signal.

After performing side direction beam forming processing on the input signals of the microphone array based on the following formula, outputting the following:

the formula represents: the second sound signal after suppressing the loudspeaker direction signal and enhancing the sound source signal is the product of the multi-path microphone signals of the time-frequency domain and the second beam forming algorithm (second beam forming function). Wherein the content of the first and second substances,

representing a second beamforming algorithm.

Based on the above analysis, the following relationship can be obtained theoretically for the free field and the flat traveling wave

z _broadside (ω,n)＝s(ω,n)

The formula represents: and the second sound signal after the loudspeaker direction signal is suppressed is the signal of the near-end sound source at the sound production point.

In one example, step S201 may include the following sub-steps: 1) Determining a suppression coefficient according to an acoustic propagation function between the loudspeaker and the microphone array and each microphone weight vector calculated in a frequency domain by the second beam forming algorithm; 2) Acquiring echo signals after suppression according to the suppression coefficient; 3) And taking the sum of the suppressed echo signal and the sound source signal as the second sound signal.

In a real environment, since the assumption of free fields is no longer satisfied,

this relationship is no longer true, and the loudspeaker is closer to the microphone, and the energy of the signal of the sounding point of the loudspeaker is still large after the signal is propagated to the microphone, so the side direction beam forming algorithm processes the signal, and only the echo signal of the sounding of the loudspeaker can be suppressed to a certain extent, therefore, the method provided by the embodiment of the application provides the following conditions

β<1, further giving the following relation:

z _broadside (ω,n)≈s(ω,n)+β*u(ω,n)

the formula can be expressed as: the second sound signal after the second beam forming process is approximately equal to the sum of the signal of the near-end sound source at the sound production point and the loudspeaker direction signal after being suppressed to a certain degree.

In this embodiment, step S105 can be implemented as follows: and executing linear adaptive filtering processing according to the first sound signal, the second sound signal, the acoustic reference microphone signal and the stoping reference signal to obtain an echo cancellation signal.

Based on the analysis of the signal model, a second sound signal z is used, as shown in fig. 5 _broadside (ω, n), first sound signal z _endfire (omega, n) circuit extraction electrical reference signal (e) _ref ) Harmony reference microphone acoustic reference signal (m) _ref ) And performing adaptive filtering processing to obtain an echo cancellation signal output1 (omega, n). Taking normalized least mean square error (NLMS) commonly used in the industry as an example, the echo cancellation signal (output 1) can be calculated as follows:

wherein the content of the first and second substances,

for an NLMS adaptive filter with tap length N,

for the current time frame data z _mix (ω, N) and previous N-1 frames historical time frame data [ z [ ] _mix (ω,n-1),…,z _mix (ω,n-N+1)](ii) a NLMS adaptive filter

The following formula can be used:

where μ is the adaptive filtering step size. The step size, based on NLMS filter characteristics, typically sets a fixed parameter, such as μ =0.1, when only the horn is sounding. When the horn and the near-end sound source are sounding simultaneously, μ =0.

To obtain z _mix (ω, n) in the fusion device of FIG. 5, the array reference signal z _endfire (omega, n) circuit extraction electrical reference signal (e) _ref ) Harmony reference microphone acoustic reference signal (m) _ref ) Fusion is carried out to obtain a fusion signal z _mix (ω, n), examples of the method of fusion are as follows (but not limited to this method):

z _mix (ω,n)＝α*z _endfire (ω,n)+β*m _ref (ω,n)+ρ*e _ref

where α or β or ρ may be fixed constants, and may be zero to indicate that a reference is not used.

It should be noted that, in the implementation, the following formula may also be used to determine the echo cancellation signal:

that is, the beamforming process may not be performed on the speaker-directional signal.

In one example, step S105 can be implemented as follows: 1) Mapping the first sound signal, the acoustic reference microphone signal and the stopover reference signal into a fourth sound signal, a fifth sound signal and a sixth sound signal; 2) According to the second sound signal and the fourth sound signal, linear self-adaptive filtering processing is carried out to obtain a first echo cancellation signal; according to the first echo cancellation signal and the fifth sound signal, linear adaptive filtering processing is carried out to obtain a second echo cancellation signal; and performing linear adaptive filtering processing according to the second echo cancellation signal and the sixth sound signal to obtain a third echo cancellation signal. By adopting the processing mode, the echo cancellation effect can be further improved.

As shown in fig. 6, unlike the above-mentioned flow of fig. 5, this approach does not perform adaptive filtering after fusing all reference signals, but instead performs adaptive filtering on three different reference signals (array reference signal z) _endfire (omega, n) circuit extraction electrical reference signal (e) _ref ) Harmony reference microphone acoustic reference signal (m) _ref ) The mapping scheme of fig. 6 is fed, still outputting three reference signals a (fourth sound signal), B (fifth sound signal) and C (sixth sound signal); based again on z _broadside (omega, n) three reference signals of A, B and C are used for three times of self-adaptive filtering in sequence, and the three self-adaptive filters can be the same filter or different self-adaptive filters. Adaptive filtering algorithms include, but are not limited to, normalized Least Mean Square (NLMS), kalman filter (Kalman filtering), and iterative least squares (Recursive least square) algorithms.

In specific implementation, the mapping of the first sound signal, the acoustic reference microphone signal and the extraction reference signal to the fourth sound signal, the fifth sound signal and the sixth sound signal may employ various mapping mechanisms. For example, one of the following ways may be employed: 1) Taking the first sound signal as a fourth sound signal, taking the stopover reference signal as a fifth sound signal, and taking the acoustic reference microphone signal as a sixth sound signal; 2) Taking the stopover reference signal as a fourth sound signal, the first sound signal as a fifth sound signal, and the acoustic reference microphone signal as a sixth sound signal; 3) Taking the first sound signal as a fourth sound signal, taking the acoustic reference microphone signal as a fifth sound signal, and taking the stope reference signal as a sixth sound signal; 4) Taking the acoustic reference microphone signal as a fourth sound signal, taking the first sound signal as a fifth sound signal, and taking the stoping reference signal as a sixth sound signal; 5) Taking the extraction reference signal as a fourth sound signal, taking the sound reference microphone signal as a fifth sound signal, and taking the first sound signal as a sixth sound signal; 6) And taking the acoustic reference microphone signal as a fourth sound signal, taking the stoping reference signal as a fifth sound signal, and taking the first sound signal as a sixth sound signal. Array reference signal z _endfire (omega, n) circuit extraction electrical reference signal (e) _ref ) Harmony reference microphone acoustic reference signal (m) _ref ) The mapping mechanism with the three reference signals a, B and C is shown in the following table:

as can be seen from the foregoing embodiments, the echo cancellation method provided in the embodiments of the present application utilizes an electrical reference signal in the conventional echo cancellation, and combines with a microphone array beam to form an acoustic reference microphone, thereby implementing multi-reference echo cancellation. By adopting the processing mode, the echo signals are estimated by using the microphone array, not only the linear part of the echo signals in the transmission process can be estimated, but also the nonlinear part can be estimated, so that the echo signals estimated by using the microphone array can be used as new reference signals to carry out linear self-adaptive filtering, the influence of nonlinearity in actual products on an echo cancellation system can be reduced, nonlinear components in the echo signals are effectively filtered, and the echo cancellation effect is improved.

Second embodiment

In the foregoing embodiments, an echo cancellation method is provided, and correspondingly, an echo cancellation device is also provided in the present application. The device corresponds to the embodiment of the method. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application further provides an echo cancellation device for a conference apparatus, where the conference apparatus includes: a first microphone array, a second microphone, a loudspeaker;

the device comprises:

the signal acquisition unit is used for acquiring multi-path microphone signals through a first microphone array, wherein the microphone signals comprise sound source signals and echo signals; and, acquiring the microphone signal by the second microphone as an acoustic reference microphone signal; acquiring an extraction reference signal through a loudspeaker;

a first beamforming unit configured to enhance the echo signal and suppress the sound source signal by a first beamforming algorithm directed to a speaker to obtain a first sound signal, the first sound signal comprising a linear echo signal and a nonlinear echo signal;

Optionally, the method further includes:

Optionally, the filtering unit includes:

a data mapping subunit for mapping the first sound signal, the acoustic reference microphone signal and the extraction reference signal into a fourth sound signal, a fifth sound signal and a sixth sound signal;

and the third filtering subunit is configured to perform linear adaptive filtering processing according to the second echo cancellation signal and the sixth sound signal to obtain a third echo cancellation signal.

Third embodiment

In the foregoing embodiment, an echo cancellation method is provided, and correspondingly, the present application further provides an electronic device. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application additionally provides an electronic device comprising: a speaker; a microphone array; a processor; and a memory. Wherein, the memorizer is used for storing the procedure for realizing the above-mentioned echo cancellation method, the terminal is powered on and runs the procedure of the method through the said processor.

The electronic equipment can be an audio and video conference terminal and can also be sound pickup equipment.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. An echo cancellation method for a conference device, characterized in that,

the method comprises the following steps:

enhancing the echo signal and suppressing the sound source signal by a first beam forming algorithm directed to a loudspeaker to obtain a first sound signal, the first sound signal comprising a linear echo signal and a non-linear echo signal;

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein suppressing the echo signal and enhancing the acoustic source signal by a second beamforming algorithm directed to a target acoustic source to obtain a second acoustic signal comprises:

4. The method of claim 2, wherein performing a linear adaptive filtering process based on the first sound signal, the acoustic reference microphone signal, the extraction reference signal, and the microphone signal to obtain an echo cancellation signal comprises:

5. The method of claim 2, wherein said performing a linear adaptive filtering process based on said first sound signal, said acoustic reference microphone signal, said extraction reference signal, and said microphone signal to obtain an echo cancellation signal comprises:

according to the first echo cancellation signal and the fifth sound signal, linear adaptive filtering processing is carried out to obtain a second echo cancellation signal;

6. The method according to any of claims 1 to 4, wherein the mapping of the first sound signal, the acoustic reference microphone signal and the extraction reference signal into a fourth sound signal, a fifth sound signal and a sixth sound signal is performed in one of the following ways:

taking the first sound signal as a fourth sound signal, taking the extraction reference signal as a fifth sound signal, and taking the sound reference microphone signal as a sixth sound signal;

taking the first sound signal as a fourth sound signal, taking the sound reference microphone signal as a fifth sound signal, and taking the back sampling reference signal as a sixth sound signal;

taking the extraction reference signal as a fourth sound signal, taking the sound reference microphone signal as a fifth sound signal, and taking the first sound signal as a sixth sound signal;

and taking the acoustic reference microphone signal as a fourth sound signal, taking the stoping reference signal as a fifth sound signal, and taking the first sound signal as a sixth sound signal.

7. The method of any of claims 1 to 4, further comprising:

8. The method of any of claims 1-4, wherein the microphone array comprises a linear array, wherein the microphones comprise omni-directional microphones, and wherein the microphone array collects near-field sound source signals.

9. The method of any of claims 1 to 4, wherein the acoustic reference microphone comprises a low sensitivity microphone.

10. An echo cancellation device, for use in a conferencing apparatus,

the device comprises:

11. The apparatus of claim 10, further comprising:

and the filtering unit is specifically configured to perform linear adaptive filtering processing according to the first sound signal, the second sound signal, the acoustic reference microphone signal, and the mining reference signal to obtain an echo cancellation signal.

12. The apparatus of claim 11, further comprising:

13. The apparatus of claim 11, wherein the filtering unit comprises:

a second filtering subunit, configured to perform linear adaptive filtering processing according to the first echo cancellation signal and the fifth sound signal to obtain a second echo cancellation signal;

14. A conferencing device, comprising:

a speaker;

a first microphone array;

a second microphone;

a processor; and

a memory for storing a program for implementing the method according to any one of claims 1 to 9, the terminal being powered on and running the method via said processor.