CN111883153A

CN111883153A - Microphone array-based double-talk state detection method and device

Info

Publication number: CN111883153A
Application number: CN202010600751.2A
Authority: CN
Inventors: 陈浩磊; 毕永建
Original assignee: Xiamen Yealink Network Technology Co Ltd
Current assignee: Xiamen Yealink Network Technology Co Ltd
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-11-03
Anticipated expiration: 2040-06-28
Also published as: CN111883153B

Abstract

The invention discloses a double-talk state detection method and a device based on a microphone array, wherein the method comprises the following steps: acquiring voice analog signals corresponding to different channels through a microphone, and converting the voice analog signals into voice digital signals to obtain first channel signals and second channel signals; performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm; performing GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals, and then performing peak value detection to obtain a maximum peak value; and comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone. The invention can effectively judge the double-talk state for the hardware terminal with the relative fixed position of the microphone and the loudspeaker, and improves the accuracy and the adaptability of judging the double-talk state.

Description

Microphone array-based double-talk state detection method and device

Technical Field

The invention relates to the technical field of communication audio detection, in particular to a method and a device for detecting a double-talk state based on a microphone array.

Background

In real-time telephony, speech-based scenarios can be generally divided into near-end speech, far-end speech, and double-talk. In a video conference call scene, the pickup distance is far (3-5 m), and the distance between the loudspeaker and the microphone is far closer to the microphone than the position of a speaker, so that the phenomenon that a far-end signal is far larger than a near-end signal exists in the microphone pickup, and the difference can reach more than 20db, thereby greatly increasing the difficulty of echo cancellation, and particularly for a double-end talk scene, the situations of serious near-end suppression and unclean echo cancellation easily occur.

In the double-talk scenario, near-end speech needs to be retained, and the requirements on the accuracy and robustness of the detection algorithm of the double-talk scenario are very high. For double-talk detection, if the detection is wrong, the convergence speed of the adaptive filtering algorithm is influenced, and even divergence can be caused when the detection is serious; in addition, the doubletalk detection is also used for guiding the subsequent nonlinear residual suppression algorithm. Currently, the prior art for doubletalk detection includes energy detection methods, as well as detection methods based on near-end signals and de-linear echo signals.

However, in the course of research and practice on the prior art, the inventors of the present invention found that the prior art has the following disadvantages: when the method is based on energy detection, the problem of high misjudgment rate is caused because the energy threshold value is relatively fixed and rough; on the basis of the near-end signal and the de-linear echo signal, normalized cross-correlation processing is required, so that detection parameters are obtained to perform double-talk detection. Therefore, a method and apparatus for detecting a talk-two state that overcomes the drawbacks of the prior art is desired.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide a method and an apparatus for detecting a double-talk state based on a microphone array, which can effectively identify the current double-talk states of a microphone and a speaker.

To solve the above problem, an embodiment of the present invention provides a method for detecting a double talk state based on a microphone array, which at least includes the following steps:

acquiring voice analog signals corresponding to different channels through a microphone, and converting the voice analog signals into voice digital signals to obtain first channel signals and second channel signals;

performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm;

performing GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals, and then performing peak value detection to obtain a maximum peak value;

and comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone.

As a preferred scheme, the comparing the maximum peak value with a preset first distance threshold value to determine the double-talk detection state of the current microphone specifically comprises:

judging whether the maximum peak value is larger than a preset first distance threshold value or not, and if not, judging that the current microphone is in a far-end speaking state; if yes, judging that the current microphone is in a near-end speaking state or a double-end speaking state, and carrying out next judgment;

judging whether a loudspeaker plays a signal or not; if yes, judging that the current microphone is in a double-talk state; if not, the current microphone is judged to be in the near-end speaking state.

Preferably, the method for detecting a double talk state based on a microphone array further includes:

and carrying out corresponding nonlinear residual error suppression treatment according to the double-talk detection state of the current microphone.

Preferably, the linear echo cancellation processing includes linear echo part processing, non-residual part processing, and a filter update strategy.

As a preferred scheme, the GCC operation is performed on the first channel signal and the second channel signal, and peak detection is performed after a time delay value is calculated through a generalized cross-correlation function of the two channel signals to obtain a maximum peak value, specifically:

respectively carrying out windowing framing processing and short-time Fourier transform on the time domain data of the first channel signal and the second channel signal;

after the frequency domain is converted, calculating to obtain a cross-correlation function of the first channel signal and the second channel signal;

performing Fourier transform after highlighting the peak value of the cross-correlation function by adopting a weighting function to obtain a generalized cross-correlation function;

and selecting a maximum value from the generalized cross-correlation function, and marking the maximum value as a maximum peak value.

Preferably, the selecting a maximum peak value in the generalized cross-correlation function further includes:

calculating the corresponding maximum time delay value according to the distance between the two microphones,

calculating a corresponding sampling rate according to the maximum time delay value, screening out corresponding points from the generalized cross-correlation function according to the sampling rate, and recording the points as a first sampling set;

and selecting a maximum value from the first sampling set, and recording the maximum value as a maximum peak value.

Preferably, the first distance threshold is specifically set according to actual formation arrangement information of the microphones and a distance between the two microphones.

One embodiment of the present invention provides a double talk state detection apparatus based on a microphone array, including:

the signal acquisition module is used for acquiring voice analog signals corresponding to different channels through a microphone and converting the voice analog signals into voice digital signals to obtain a first channel signal and a second channel signal;

the echo cancellation module is used for performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm;

the peak detection module is used for carrying out GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals and then carrying out peak detection to obtain a maximum peak value;

and the state judgment module is used for comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone.

An embodiment of the present invention provides a terminal device for a microphone array based doubletalk state detection, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the microphone array based doubletalk state detection method as described above when executing the computer program.

An embodiment of the invention provides a computer-readable storage medium comprising a stored computer program, wherein the computer program, when executed, controls a device on which the computer-readable storage medium is located to perform the method for detecting a double talk state based on a microphone array as described above.

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides a double-talk state detection method and a device based on a microphone array, wherein the method comprises the following steps: acquiring voice analog signals corresponding to different channels through a microphone, and converting the voice analog signals into voice digital signals to obtain first channel signals and second channel signals; performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm; performing GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals, and then performing peak value detection to obtain a maximum peak value; and comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone.

Compared with the prior art, the method and the device for detecting the double-talk state based on the microphone array can overcome the problem of poor adaptability of a DTD algorithm due to large acoustic environment difference in the prior art, can effectively judge the double-talk state of a hardware terminal with a fixed relative position of a microphone and a loudspeaker, improve the accuracy and the adaptability of judging the double-talk state, and improve the processing performance of an echo algorithm.

Drawings

Fig. 1 is a schematic flowchart of a double-talk state detection method based on a microphone array according to a first embodiment of the present invention;

fig. 2 is a schematic flowchart of another method for detecting a double-talk state based on a microphone array according to a first embodiment of the present invention;

fig. 3 is a schematic flow chart of a cross-correlation function according to a first embodiment of the present invention;

fig. 4 is a schematic structural diagram of a double-talk state detection apparatus based on a microphone array according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of another two-talk state detection apparatus based on a microphone array according to a second embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description of the present application, it is to be understood that the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.

First, an application scenario that can be provided by the present invention is introduced, such as effective determination of a double-talk state for a hardware terminal with a fixed relative position of a microphone and a speaker.

The first embodiment of the present invention:

please refer to fig. 1-3.

As shown in fig. 1, the present embodiment provides a method for detecting a double talk state based on a microphone array, which at least includes the following steps:

s101, acquiring voice analog signals corresponding to different channels through a microphone, and converting the voice analog signals into voice digital signals to obtain first channel signals and second channel signals;

specifically, for step S101, the original voice data of different channels are collected mainly by the microphone, and the voice analog signal of the original voice data is converted into a voice digital signal.

S102, linear echo cancellation processing is carried out on the first channel signal and the second channel signal by adopting an NLMS algorithm;

specifically, in step S102, an NLMS (Normalized Least Mean Square Normalized adaptive filter) is used to perform linear echo cancellation processing on the acquired digital voice signals, and in the whole call process, a speaker broadcast signal (referred to as reference data) and a microphone received signal (referred to as a near-end acquisition signal) are acquired, and the near-end acquisition signal is subjected to algorithm processing to cancel the speech sound of the opposite end and then transmitted to the opposite end.

S103, carrying out GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals, and then carrying out peak value detection to obtain a maximum peak value;

specifically, for step S103, in the DTD (Double Talk Detection) algorithm step: and carrying out GCC (generalized cross correlation) operation on the original data of the first channel and the residual signal of the second channel to obtain a peak value. The channel original data is the collected signal of the microphone, and the principle is to convert the analog signal into a digital signal which is obtained by a driving module of the system. For example, the currently used device has 8 microphones, and 8 paths of raw data can be obtained from the system driver. And selecting the 1 st path of original data as the original data of the first channel, and then performing linear echo cancellation on the 8 th path of original data to obtain a residual signal (namely the residual signal of the second channel) of the 8 th path.

In the embodiment, the GCC algorithm basically estimates the time difference through the generalized correlation function of the two signals, and on the basis, normalizes the signals by using the frequency domain weighting function, so as to reduce the influence of noise and reverberation as much as possible and detect the peak value.

Calculating the time difference between the first channel signal and the second channel signal, calculating the cross-correlation function of the two signals, finding the value that maximizes the cross-correlation function, i.e. the time difference between the two signals,

R(τ)＝E[x₁(m)·x₂(m+τ)]

finding the time difference D-argmaxR (τ) is the point when the cross-correlation function is found to be maximum.

And S104, comparing the maximum peak value with a preset first distance threshold value, and judging the double-talk detection state of the current microphone.

In a preferred embodiment, the comparing the maximum peak value with a preset first distance threshold value to determine the double-talk detection state of the current microphone specifically includes:

Specifically, for step S104, the speech signals may be classified into 3 types according to the scene: a proximal end, a distal end, and a double end. Through the DTD algorithm, a proper threshold value is selected, the far-end scene can be separated, and the rest part is that the near end and the double end (B) are not well distinguished; judging the collected playing signals (known quantity), namely simply speaking, if the playing is a double-end & far-end (C) scene, if no person speaks, the playing is a near-end scene, and the near-end scene can be separated through detection; and taking an intersection of B and C, namely a double-end scene.

In a preferred embodiment, as shown in fig. 2, the method for detecting a double talk state based on a microphone array further includes:

and S105, performing corresponding nonlinear residual error suppression processing according to the double-talk detection state of the current microphone.

Specifically, for step S105, different processing strategies are applied to the far-end scene and the double-end scene, for example, far-end scene suppression is increased (or directly set to 0); double-ended scene suppression may be a little smaller (e.g., multiply by 0.5, similar operation); the non-linear residual suppression is to suppress the residual far-end components in the result after the linear echo cancellation, so as to avoid the opposite end hearing his own voice.

In a preferred embodiment, the linear echo cancellation processing comprises linear echo part processing, non-residual part processing and a filter update strategy. A

Specifically, in the linear echo cancellation link, the method adopted in this embodiment is NLMS:

linear echo section: y (n) ═ wx (n);

nonlinear residual part: (n) -d (n) -y (n) -d (n) -wx (n);

and (3) updating the strategy of the filter: w (n +1) ═ w (n) + η e (n) x (n))/(+ x' (n) x (n));

wherein, the bold represents the vector, and the rest is the scalar.

In the formula, y represents the amount of linear echo estimated at the current point; e represents the remaining non-linear part of the current point; is the output value; d represents the current near-end acquisition signal; x represents a reference signal; w represents a filter coefficient; for example, if the order is 1600, it means that w is a vector formed by 1600 points, which is a key parameter to be obtained in linear echo cancellation; n represents a discrete point on the time series; y (n) represents the amount of linear echo at time n, calculated according to the first step; x (n) here represents a vector of reference data at time n, for example, of order 1600, which represents a known quantity, derived from 1600 points forward from time n; η is a constant introduced to protect the divisor too small.

In a preferred embodiment, the GCC operation is performed on the first channel signal and the second channel signal, and peak detection is performed after a time delay value is calculated by using a generalized cross-correlation function of the two channel signals to obtain a maximum peak value, specifically:

In particular, since the method of applying the cross-correlation function in the form of time-domain convolution is computationally complex, it will operate in the frequency domain (by fourier transform and inverse fourier transform) which is equal to the conjugate of the x-signal frequency domain multiplied by the y-signal frequency domain. As shown in FIG. 3, first, two paths of time domain data are processed (d)₁、e₂) To carry outWindowing and framing, performing short-time Fourier transform, converting to frequency domain, calculating to obtain cross-correlation function, highlighting peak value by using weighting function, performing inverse Fourier transform to obtain generalized cross-correlation function, wherein the generalized cross-correlation function is marked as R_de(τ), from R_deThe maximum value is found in (τ).

Where Φ (ω) is a weighting function, a phase transformation weighting function can be used, whose expression is:

the time domain analysis is to represent the relation of dynamic signals by taking a time axis as a coordinate; the frequency domain analysis is to transform the signal into a coordinate representation with the frequency axis. The time domain transformation to the frequency domain is mainly achieved by a fourier series and a fourier transformation. I.e., FFT operation, there are specialized libraries that can be directly called up and will not be described in detail here.

d₁、e₂Is time domain data which is D after FFT₁And E₂；

D₁＝FFT(d₁)

E₂＝FFT(e₂)

When calculating the cross-correlation function, E [ x ]₁(m)·x₂(m+τ)]X in (2)₁And x₂By changing to D₁And E₂。

In a preferred embodiment, the selecting a maximum peak in the generalized cross-correlation function further includes:

Specifically, considering the real time delay situation of the sound source reaching the two microphones, it needs to first go from R_de(tau) screening out corresponding points, wherein the point taking mode is as follows:

the maximum delay is: Δ t ═ dx/vs; dx is the distance of the two microphones and vs is the speed of sound, i.e. 340 m/s.

The number of points is: nd ═ Fs × Δ t; fs is the sampling rate.

From R_deThe value obtained by (τ) screening is noted as C_de(τ), and then from C_deFinding the maximum peak point C in (tau)_MAX。

If the distance between the two microphones is close and the Nd value is small, upsampling processing can be carried out on data, and the upsampling is realized through interpolation. For example, N times up-sampling, by inserting N-1 0 points equally spaced between 2 signal points.

In a preferred embodiment, the first distance threshold is specifically set according to actual formation information of the microphones and a distance between the two microphones.

In particular, the maximum peak C obtained for the near end and duplex states_MAXWill be significantly larger than the C obtained in the far end regime_MAXTherefore, an appropriate threshold thr setting can be determined by combining the actual array arrangement and the distance between the two microphones, and the near-end state and the double-end state can be distinguished.

Currently common microphone type arrangements are linear and circular arrays; the linear array is that the microphones are on the same straight line; the circular array is formed by a plurality of microphones; also, it is common to use a circular array plus the center of the circle, for example, a "6 + 1" circular array is formed by dividing 6 microphones into a circle and adding a microphone at the center of the circle.

Thereby carrying out algorithm simulation according to the actual array type arrangement and spacing to obtain the maximum peak value C_MAXCorrespondingly, the distance is different, so that a proper threshold value is selected as the first distance threshold value.

The embodiment provides a double-talk state detection method based on a microphone array, which comprises the following steps: acquiring voice analog signals corresponding to different channels through a microphone, and converting the voice analog signals into voice digital signals to obtain first channel signals and second channel signals; performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm; performing GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals, and then performing peak value detection to obtain a maximum peak value; and comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone.

Compared with the prior art, according to the double-talk state detection method based on the microphone array, after linear echo cancellation, the maximum peak value of relative time delay is found through the generalized cross-correlation function of the two signals, and the double-talk state can be effectively detected by combining whether the loudspeaker produces sound or not, so that the problem that in the prior art, due to the fact that the difference of acoustic environments is large, the adaptability of a DTD algorithm is poor is solved. And the double-talk state can be effectively judged for the hardware terminal with the relative fixed positions of the microphone and the loudspeaker, and the accuracy and the adaptability of judging the double-talk state are improved.

Second embodiment of the invention:

please refer to fig. 4-5.

As shown in fig. 4, the present embodiment provides a double talk state detection apparatus based on a microphone array, including:

the signal acquisition module 100 is configured to acquire voice analog signals corresponding to different channels through a microphone, and convert the voice analog signals into voice digital signals to obtain a first channel signal and a second channel signal;

an echo cancellation module 200, configured to perform linear echo cancellation processing on the first channel signal and the second channel signal by using an NLMS algorithm;

the peak detection module 300 is configured to perform GCC operation on the first channel signal and the second channel signal, calculate a delay value through a generalized cross-correlation function of the two channel signals, and then perform peak detection to obtain a maximum peak value;

and the state judgment module 400 is configured to judge the double-talk detection state of the current microphone by comparing the maximum peak value with a preset first distance threshold.

In a preferred embodiment, the state determining module 400 specifically includes:

In a preferred embodiment, the microphone array based doubletalk state detection apparatus further comprises:

a nonlinear residual suppression module 500, configured to perform corresponding nonlinear residual suppression processing according to the double-talk detection state of the current microphone.

In a preferred embodiment, the linear echo cancellation processing comprises linear echo part processing, non-residual part processing and a filter update strategy.

In a preferred embodiment, the peak detection module 300 specifically includes:

In a preferred embodiment, the peak detection module 300 further includes:

The embodiment provides a double talk state detection device based on a microphone array, which comprises: the signal acquisition module 100 is configured to acquire voice analog signals corresponding to different channels through a microphone, and convert the voice analog signals into voice digital signals to obtain a first channel signal and a second channel signal; an echo cancellation module 200, configured to perform linear echo cancellation processing on the first channel signal and the second channel signal by using an NLMS algorithm; the peak detection module 300 is configured to perform GCC operation on the first channel signal and the second channel signal, calculate a delay value through a generalized cross-correlation function of the two channel signals, and then perform peak detection to obtain a maximum peak value; and the state judgment module 400 is configured to judge the double-talk detection state of the current microphone by comparing the maximum peak value with a preset first distance threshold.

According to the double-talk state detection device based on the microphone array, after linear echo cancellation, the maximum peak value of relative time delay is found through the generalized cross-correlation function of two signals, whether a loudspeaker sounds or not is combined, the double-talk state can be effectively detected, and the problems that in the prior art, the difference of acoustic environments is large, and the adaptability of a DTD algorithm is poor are solved. And the double-talk state can be effectively judged for the hardware terminal with the relative fixed positions of the microphone and the loudspeaker, and the accuracy and the adaptability of judging the double-talk state are improved.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the modules may be a logical division, and in actual implementation, there may be another division, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The foregoing is directed to the preferred embodiment of the present invention, and it is understood that various changes and modifications may be made by one skilled in the art without departing from the spirit of the invention, and it is intended that such changes and modifications be considered as within the scope of the invention.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Claims

1. A double-talk state detection method based on a microphone array is characterized by at least comprising the following steps:

2. The method as claimed in claim 1, wherein the determining the double talk detection status of the current microphone by comparing the maximum peak value with a preset first distance threshold is specifically:

3. The microphone array based doubletalk state detection method as claimed in claims 1 and 2, further comprising:

4. The microphone array based doubletalk state detection method of claim 1, wherein the linear echo cancellation process comprises a linear echo part process, a non-residual part process and a filter update strategy.

5. The method for detecting the double-talk state based on the microphone array as claimed in claim 1, wherein the GCC operation is performed on the first channel signal and the second channel signal, and a peak detection is performed after a time delay value is calculated through a generalized cross-correlation function of the two channel signals, so as to obtain a maximum peak value, specifically:

6. The microphone array based doubletalk state detection method of claim 1, wherein the selecting a maximum peak in the generalized cross-correlation function further comprises:

7. The method as claimed in claim 1, wherein the first distance threshold is set according to actual microphone formation and the distance between two microphones.

8. A two-talk state detection apparatus based on a microphone array, comprising:

9. A terminal device for microphone array based double talk state detection, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor when executing the computer program implementing a microphone array based double talk state detection method as claimed in any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program, wherein the computer program, when running, controls a device on which the computer-readable storage medium is located to perform the method for detecting double talk state based on a microphone array according to any of claims 1 to 7.