CN111883153A - Microphone array-based double-talk state detection method and device - Google Patents

Microphone array-based double-talk state detection method and device Download PDF

Info

Publication number
CN111883153A
CN111883153A CN202010600751.2A CN202010600751A CN111883153A CN 111883153 A CN111883153 A CN 111883153A CN 202010600751 A CN202010600751 A CN 202010600751A CN 111883153 A CN111883153 A CN 111883153A
Authority
CN
China
Prior art keywords
channel signal
double
microphone
state
talk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010600751.2A
Other languages
Chinese (zh)
Other versions
CN111883153B (en
Inventor
陈浩磊
毕永建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Yealink Network Technology Co Ltd
Original Assignee
Xiamen Yealink Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Yealink Network Technology Co Ltd filed Critical Xiamen Yealink Network Technology Co Ltd
Priority to CN202010600751.2A priority Critical patent/CN111883153B/en
Publication of CN111883153A publication Critical patent/CN111883153A/en
Application granted granted Critical
Publication of CN111883153B publication Critical patent/CN111883153B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephone Function (AREA)

Abstract

The invention discloses a double-talk state detection method and a device based on a microphone array, wherein the method comprises the following steps: acquiring voice analog signals corresponding to different channels through a microphone, and converting the voice analog signals into voice digital signals to obtain first channel signals and second channel signals; performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm; performing GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals, and then performing peak value detection to obtain a maximum peak value; and comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone. The invention can effectively judge the double-talk state for the hardware terminal with the relative fixed position of the microphone and the loudspeaker, and improves the accuracy and the adaptability of judging the double-talk state.

Description

Microphone array-based double-talk state detection method and device
Technical Field
The invention relates to the technical field of communication audio detection, in particular to a method and a device for detecting a double-talk state based on a microphone array.
Background
In real-time telephony, speech-based scenarios can be generally divided into near-end speech, far-end speech, and double-talk. In a video conference call scene, the pickup distance is far (3-5 m), and the distance between the loudspeaker and the microphone is far closer to the microphone than the position of a speaker, so that the phenomenon that a far-end signal is far larger than a near-end signal exists in the microphone pickup, and the difference can reach more than 20db, thereby greatly increasing the difficulty of echo cancellation, and particularly for a double-end talk scene, the situations of serious near-end suppression and unclean echo cancellation easily occur.
In the double-talk scenario, near-end speech needs to be retained, and the requirements on the accuracy and robustness of the detection algorithm of the double-talk scenario are very high. For double-talk detection, if the detection is wrong, the convergence speed of the adaptive filtering algorithm is influenced, and even divergence can be caused when the detection is serious; in addition, the doubletalk detection is also used for guiding the subsequent nonlinear residual suppression algorithm. Currently, the prior art for doubletalk detection includes energy detection methods, as well as detection methods based on near-end signals and de-linear echo signals.
However, in the course of research and practice on the prior art, the inventors of the present invention found that the prior art has the following disadvantages: when the method is based on energy detection, the problem of high misjudgment rate is caused because the energy threshold value is relatively fixed and rough; on the basis of the near-end signal and the de-linear echo signal, normalized cross-correlation processing is required, so that detection parameters are obtained to perform double-talk detection. Therefore, a method and apparatus for detecting a talk-two state that overcomes the drawbacks of the prior art is desired.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present invention is to provide a method and an apparatus for detecting a double-talk state based on a microphone array, which can effectively identify the current double-talk states of a microphone and a speaker.
To solve the above problem, an embodiment of the present invention provides a method for detecting a double talk state based on a microphone array, which at least includes the following steps:
acquiring voice analog signals corresponding to different channels through a microphone, and converting the voice analog signals into voice digital signals to obtain first channel signals and second channel signals;
performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm;
performing GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals, and then performing peak value detection to obtain a maximum peak value;
and comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone.
As a preferred scheme, the comparing the maximum peak value with a preset first distance threshold value to determine the double-talk detection state of the current microphone specifically comprises:
judging whether the maximum peak value is larger than a preset first distance threshold value or not, and if not, judging that the current microphone is in a far-end speaking state; if yes, judging that the current microphone is in a near-end speaking state or a double-end speaking state, and carrying out next judgment;
judging whether a loudspeaker plays a signal or not; if yes, judging that the current microphone is in a double-talk state; if not, the current microphone is judged to be in the near-end speaking state.
Preferably, the method for detecting a double talk state based on a microphone array further includes:
and carrying out corresponding nonlinear residual error suppression treatment according to the double-talk detection state of the current microphone.
Preferably, the linear echo cancellation processing includes linear echo part processing, non-residual part processing, and a filter update strategy.
As a preferred scheme, the GCC operation is performed on the first channel signal and the second channel signal, and peak detection is performed after a time delay value is calculated through a generalized cross-correlation function of the two channel signals to obtain a maximum peak value, specifically:
respectively carrying out windowing framing processing and short-time Fourier transform on the time domain data of the first channel signal and the second channel signal;
after the frequency domain is converted, calculating to obtain a cross-correlation function of the first channel signal and the second channel signal;
performing Fourier transform after highlighting the peak value of the cross-correlation function by adopting a weighting function to obtain a generalized cross-correlation function;
and selecting a maximum value from the generalized cross-correlation function, and marking the maximum value as a maximum peak value.
Preferably, the selecting a maximum peak value in the generalized cross-correlation function further includes:
calculating the corresponding maximum time delay value according to the distance between the two microphones,
calculating a corresponding sampling rate according to the maximum time delay value, screening out corresponding points from the generalized cross-correlation function according to the sampling rate, and recording the points as a first sampling set;
and selecting a maximum value from the first sampling set, and recording the maximum value as a maximum peak value.
Preferably, the first distance threshold is specifically set according to actual formation arrangement information of the microphones and a distance between the two microphones.
One embodiment of the present invention provides a double talk state detection apparatus based on a microphone array, including:
the signal acquisition module is used for acquiring voice analog signals corresponding to different channels through a microphone and converting the voice analog signals into voice digital signals to obtain a first channel signal and a second channel signal;
the echo cancellation module is used for performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm;
the peak detection module is used for carrying out GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals and then carrying out peak detection to obtain a maximum peak value;
and the state judgment module is used for comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone.
An embodiment of the present invention provides a terminal device for a microphone array based doubletalk state detection, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the microphone array based doubletalk state detection method as described above when executing the computer program.
An embodiment of the invention provides a computer-readable storage medium comprising a stored computer program, wherein the computer program, when executed, controls a device on which the computer-readable storage medium is located to perform the method for detecting a double talk state based on a microphone array as described above.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a double-talk state detection method and a device based on a microphone array, wherein the method comprises the following steps: acquiring voice analog signals corresponding to different channels through a microphone, and converting the voice analog signals into voice digital signals to obtain first channel signals and second channel signals; performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm; performing GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals, and then performing peak value detection to obtain a maximum peak value; and comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone.
Compared with the prior art, the method and the device for detecting the double-talk state based on the microphone array can overcome the problem of poor adaptability of a DTD algorithm due to large acoustic environment difference in the prior art, can effectively judge the double-talk state of a hardware terminal with a fixed relative position of a microphone and a loudspeaker, improve the accuracy and the adaptability of judging the double-talk state, and improve the processing performance of an echo algorithm.
Drawings
Fig. 1 is a schematic flowchart of a double-talk state detection method based on a microphone array according to a first embodiment of the present invention;
fig. 2 is a schematic flowchart of another method for detecting a double-talk state based on a microphone array according to a first embodiment of the present invention;
fig. 3 is a schematic flow chart of a cross-correlation function according to a first embodiment of the present invention;
fig. 4 is a schematic structural diagram of a double-talk state detection apparatus based on a microphone array according to a second embodiment of the present invention;
fig. 5 is a schematic structural diagram of another two-talk state detection apparatus based on a microphone array according to a second embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the description of the present application, it is to be understood that the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.
First, an application scenario that can be provided by the present invention is introduced, such as effective determination of a double-talk state for a hardware terminal with a fixed relative position of a microphone and a speaker.
The first embodiment of the present invention:
please refer to fig. 1-3.
As shown in fig. 1, the present embodiment provides a method for detecting a double talk state based on a microphone array, which at least includes the following steps:
s101, acquiring voice analog signals corresponding to different channels through a microphone, and converting the voice analog signals into voice digital signals to obtain first channel signals and second channel signals;
specifically, for step S101, the original voice data of different channels are collected mainly by the microphone, and the voice analog signal of the original voice data is converted into a voice digital signal.
S102, linear echo cancellation processing is carried out on the first channel signal and the second channel signal by adopting an NLMS algorithm;
specifically, in step S102, an NLMS (Normalized Least Mean Square Normalized adaptive filter) is used to perform linear echo cancellation processing on the acquired digital voice signals, and in the whole call process, a speaker broadcast signal (referred to as reference data) and a microphone received signal (referred to as a near-end acquisition signal) are acquired, and the near-end acquisition signal is subjected to algorithm processing to cancel the speech sound of the opposite end and then transmitted to the opposite end.
S103, carrying out GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals, and then carrying out peak value detection to obtain a maximum peak value;
specifically, for step S103, in the DTD (Double Talk Detection) algorithm step: and carrying out GCC (generalized cross correlation) operation on the original data of the first channel and the residual signal of the second channel to obtain a peak value. The channel original data is the collected signal of the microphone, and the principle is to convert the analog signal into a digital signal which is obtained by a driving module of the system. For example, the currently used device has 8 microphones, and 8 paths of raw data can be obtained from the system driver. And selecting the 1 st path of original data as the original data of the first channel, and then performing linear echo cancellation on the 8 th path of original data to obtain a residual signal (namely the residual signal of the second channel) of the 8 th path.
In the embodiment, the GCC algorithm basically estimates the time difference through the generalized correlation function of the two signals, and on the basis, normalizes the signals by using the frequency domain weighting function, so as to reduce the influence of noise and reverberation as much as possible and detect the peak value.
Calculating the time difference between the first channel signal and the second channel signal, calculating the cross-correlation function of the two signals, finding the value that maximizes the cross-correlation function, i.e. the time difference between the two signals,
R(τ)=E[x1(m)·x2(m+τ)]
finding the time difference D-argmaxR (τ) is the point when the cross-correlation function is found to be maximum.
And S104, comparing the maximum peak value with a preset first distance threshold value, and judging the double-talk detection state of the current microphone.
In a preferred embodiment, the comparing the maximum peak value with a preset first distance threshold value to determine the double-talk detection state of the current microphone specifically includes:
judging whether the maximum peak value is larger than a preset first distance threshold value or not, and if not, judging that the current microphone is in a far-end speaking state; if yes, judging that the current microphone is in a near-end speaking state or a double-end speaking state, and carrying out next judgment;
judging whether a loudspeaker plays a signal or not; if yes, judging that the current microphone is in a double-talk state; if not, the current microphone is judged to be in the near-end speaking state.
Specifically, for step S104, the speech signals may be classified into 3 types according to the scene: a proximal end, a distal end, and a double end. Through the DTD algorithm, a proper threshold value is selected, the far-end scene can be separated, and the rest part is that the near end and the double end (B) are not well distinguished; judging the collected playing signals (known quantity), namely simply speaking, if the playing is a double-end & far-end (C) scene, if no person speaks, the playing is a near-end scene, and the near-end scene can be separated through detection; and taking an intersection of B and C, namely a double-end scene.
In a preferred embodiment, as shown in fig. 2, the method for detecting a double talk state based on a microphone array further includes:
and S105, performing corresponding nonlinear residual error suppression processing according to the double-talk detection state of the current microphone.
Specifically, for step S105, different processing strategies are applied to the far-end scene and the double-end scene, for example, far-end scene suppression is increased (or directly set to 0); double-ended scene suppression may be a little smaller (e.g., multiply by 0.5, similar operation); the non-linear residual suppression is to suppress the residual far-end components in the result after the linear echo cancellation, so as to avoid the opposite end hearing his own voice.
In a preferred embodiment, the linear echo cancellation processing comprises linear echo part processing, non-residual part processing and a filter update strategy. A
Specifically, in the linear echo cancellation link, the method adopted in this embodiment is NLMS:
linear echo section: y (n) ═ wx (n);
nonlinear residual part: (n) -d (n) -y (n) -d (n) -wx (n);
and (3) updating the strategy of the filter: w (n +1) ═ w (n) + η e (n) x (n))/(+ x' (n) x (n));
wherein, the bold represents the vector, and the rest is the scalar.
In the formula, y represents the amount of linear echo estimated at the current point; e represents the remaining non-linear part of the current point; is the output value; d represents the current near-end acquisition signal; x represents a reference signal; w represents a filter coefficient; for example, if the order is 1600, it means that w is a vector formed by 1600 points, which is a key parameter to be obtained in linear echo cancellation; n represents a discrete point on the time series; y (n) represents the amount of linear echo at time n, calculated according to the first step; x (n) here represents a vector of reference data at time n, for example, of order 1600, which represents a known quantity, derived from 1600 points forward from time n; η is a constant introduced to protect the divisor too small.
In a preferred embodiment, the GCC operation is performed on the first channel signal and the second channel signal, and peak detection is performed after a time delay value is calculated by using a generalized cross-correlation function of the two channel signals to obtain a maximum peak value, specifically:
respectively carrying out windowing framing processing and short-time Fourier transform on the time domain data of the first channel signal and the second channel signal;
after the frequency domain is converted, calculating to obtain a cross-correlation function of the first channel signal and the second channel signal;
performing Fourier transform after highlighting the peak value of the cross-correlation function by adopting a weighting function to obtain a generalized cross-correlation function;
and selecting a maximum value from the generalized cross-correlation function, and marking the maximum value as a maximum peak value.
In particular, since the method of applying the cross-correlation function in the form of time-domain convolution is computationally complex, it will operate in the frequency domain (by fourier transform and inverse fourier transform) which is equal to the conjugate of the x-signal frequency domain multiplied by the y-signal frequency domain. As shown in FIG. 3, first, two paths of time domain data are processed (d)1、e2) To carry outWindowing and framing, performing short-time Fourier transform, converting to frequency domain, calculating to obtain cross-correlation function, highlighting peak value by using weighting function, performing inverse Fourier transform to obtain generalized cross-correlation function, wherein the generalized cross-correlation function is marked as Rde(τ), from RdeThe maximum value is found in (τ).
Where Φ (ω) is a weighting function, a phase transformation weighting function can be used, whose expression is:
Figure BDA0002558552050000091
the time domain analysis is to represent the relation of dynamic signals by taking a time axis as a coordinate; the frequency domain analysis is to transform the signal into a coordinate representation with the frequency axis. The time domain transformation to the frequency domain is mainly achieved by a fourier series and a fourier transformation. I.e., FFT operation, there are specialized libraries that can be directly called up and will not be described in detail here.
d1、e2Is time domain data which is D after FFT1And E2
D1=FFT(d1)
E2=FFT(e2)
When calculating the cross-correlation function, E [ x ]1(m)·x2(m+τ)]X in (2)1And x2By changing to D1And E2
In a preferred embodiment, the selecting a maximum peak in the generalized cross-correlation function further includes:
calculating the corresponding maximum time delay value according to the distance between the two microphones,
calculating a corresponding sampling rate according to the maximum time delay value, screening out corresponding points from the generalized cross-correlation function according to the sampling rate, and recording the points as a first sampling set;
and selecting a maximum value from the first sampling set, and recording the maximum value as a maximum peak value.
Specifically, considering the real time delay situation of the sound source reaching the two microphones, it needs to first go from Rde(tau) screening out corresponding points, wherein the point taking mode is as follows:
the maximum delay is: Δ t ═ dx/vs; dx is the distance of the two microphones and vs is the speed of sound, i.e. 340 m/s.
The number of points is: nd ═ Fs × Δ t; fs is the sampling rate.
From RdeThe value obtained by (τ) screening is noted as Cde(τ), and then from CdeFinding the maximum peak point C in (tau)MAX
If the distance between the two microphones is close and the Nd value is small, upsampling processing can be carried out on data, and the upsampling is realized through interpolation. For example, N times up-sampling, by inserting N-1 0 points equally spaced between 2 signal points.
In a preferred embodiment, the first distance threshold is specifically set according to actual formation information of the microphones and a distance between the two microphones.
In particular, the maximum peak C obtained for the near end and duplex statesMAXWill be significantly larger than the C obtained in the far end regimeMAXTherefore, an appropriate threshold thr setting can be determined by combining the actual array arrangement and the distance between the two microphones, and the near-end state and the double-end state can be distinguished.
Currently common microphone type arrangements are linear and circular arrays; the linear array is that the microphones are on the same straight line; the circular array is formed by a plurality of microphones; also, it is common to use a circular array plus the center of the circle, for example, a "6 + 1" circular array is formed by dividing 6 microphones into a circle and adding a microphone at the center of the circle.
Thereby carrying out algorithm simulation according to the actual array type arrangement and spacing to obtain the maximum peak value CMAXCorrespondingly, the distance is different, so that a proper threshold value is selected as the first distance threshold value.
The embodiment provides a double-talk state detection method based on a microphone array, which comprises the following steps: acquiring voice analog signals corresponding to different channels through a microphone, and converting the voice analog signals into voice digital signals to obtain first channel signals and second channel signals; performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm; performing GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals, and then performing peak value detection to obtain a maximum peak value; and comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone.
Compared with the prior art, according to the double-talk state detection method based on the microphone array, after linear echo cancellation, the maximum peak value of relative time delay is found through the generalized cross-correlation function of the two signals, and the double-talk state can be effectively detected by combining whether the loudspeaker produces sound or not, so that the problem that in the prior art, due to the fact that the difference of acoustic environments is large, the adaptability of a DTD algorithm is poor is solved. And the double-talk state can be effectively judged for the hardware terminal with the relative fixed positions of the microphone and the loudspeaker, and the accuracy and the adaptability of judging the double-talk state are improved.
Second embodiment of the invention:
please refer to fig. 4-5.
As shown in fig. 4, the present embodiment provides a double talk state detection apparatus based on a microphone array, including:
the signal acquisition module 100 is configured to acquire voice analog signals corresponding to different channels through a microphone, and convert the voice analog signals into voice digital signals to obtain a first channel signal and a second channel signal;
an echo cancellation module 200, configured to perform linear echo cancellation processing on the first channel signal and the second channel signal by using an NLMS algorithm;
the peak detection module 300 is configured to perform GCC operation on the first channel signal and the second channel signal, calculate a delay value through a generalized cross-correlation function of the two channel signals, and then perform peak detection to obtain a maximum peak value;
and the state judgment module 400 is configured to judge the double-talk detection state of the current microphone by comparing the maximum peak value with a preset first distance threshold.
In a preferred embodiment, the state determining module 400 specifically includes:
judging whether the maximum peak value is larger than a preset first distance threshold value or not, and if not, judging that the current microphone is in a far-end speaking state; if yes, judging that the current microphone is in a near-end speaking state or a double-end speaking state, and carrying out next judgment;
judging whether a loudspeaker plays a signal or not; if yes, judging that the current microphone is in a double-talk state; if not, the current microphone is judged to be in the near-end speaking state.
In a preferred embodiment, the microphone array based doubletalk state detection apparatus further comprises:
a nonlinear residual suppression module 500, configured to perform corresponding nonlinear residual suppression processing according to the double-talk detection state of the current microphone.
In a preferred embodiment, the linear echo cancellation processing comprises linear echo part processing, non-residual part processing and a filter update strategy.
In a preferred embodiment, the peak detection module 300 specifically includes:
respectively carrying out windowing framing processing and short-time Fourier transform on the time domain data of the first channel signal and the second channel signal;
after the frequency domain is converted, calculating to obtain a cross-correlation function of the first channel signal and the second channel signal;
performing Fourier transform after highlighting the peak value of the cross-correlation function by adopting a weighting function to obtain a generalized cross-correlation function;
and selecting a maximum value from the generalized cross-correlation function, and marking the maximum value as a maximum peak value.
In a preferred embodiment, the peak detection module 300 further includes:
calculating the corresponding maximum time delay value according to the distance between the two microphones,
calculating a corresponding sampling rate according to the maximum time delay value, screening out corresponding points from the generalized cross-correlation function according to the sampling rate, and recording the points as a first sampling set;
and selecting a maximum value from the first sampling set, and recording the maximum value as a maximum peak value.
The embodiment provides a double talk state detection device based on a microphone array, which comprises: the signal acquisition module 100 is configured to acquire voice analog signals corresponding to different channels through a microphone, and convert the voice analog signals into voice digital signals to obtain a first channel signal and a second channel signal; an echo cancellation module 200, configured to perform linear echo cancellation processing on the first channel signal and the second channel signal by using an NLMS algorithm; the peak detection module 300 is configured to perform GCC operation on the first channel signal and the second channel signal, calculate a delay value through a generalized cross-correlation function of the two channel signals, and then perform peak detection to obtain a maximum peak value; and the state judgment module 400 is configured to judge the double-talk detection state of the current microphone by comparing the maximum peak value with a preset first distance threshold.
According to the double-talk state detection device based on the microphone array, after linear echo cancellation, the maximum peak value of relative time delay is found through the generalized cross-correlation function of two signals, whether a loudspeaker sounds or not is combined, the double-talk state can be effectively detected, and the problems that in the prior art, the difference of acoustic environments is large, and the adaptability of a DTD algorithm is poor are solved. And the double-talk state can be effectively judged for the hardware terminal with the relative fixed positions of the microphone and the loudspeaker, and the accuracy and the adaptability of judging the double-talk state are improved.
An embodiment of the present invention provides a terminal device for a microphone array based doubletalk state detection, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the microphone array based doubletalk state detection method as described above when executing the computer program.
An embodiment of the invention provides a computer-readable storage medium comprising a stored computer program, wherein the computer program, when executed, controls a device on which the computer-readable storage medium is located to perform the method for detecting a double talk state based on a microphone array as described above.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the modules may be a logical division, and in actual implementation, there may be another division, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
The foregoing is directed to the preferred embodiment of the present invention, and it is understood that various changes and modifications may be made by one skilled in the art without departing from the spirit of the invention, and it is intended that such changes and modifications be considered as within the scope of the invention.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Claims (10)

1. A double-talk state detection method based on a microphone array is characterized by at least comprising the following steps:
acquiring voice analog signals corresponding to different channels through a microphone, and converting the voice analog signals into voice digital signals to obtain first channel signals and second channel signals;
performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm;
performing GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals, and then performing peak value detection to obtain a maximum peak value;
and comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone.
2. The method as claimed in claim 1, wherein the determining the double talk detection status of the current microphone by comparing the maximum peak value with a preset first distance threshold is specifically:
judging whether the maximum peak value is larger than a preset first distance threshold value or not, and if not, judging that the current microphone is in a far-end speaking state; if yes, judging that the current microphone is in a near-end speaking state or a double-end speaking state, and carrying out next judgment;
judging whether a loudspeaker plays a signal or not; if yes, judging that the current microphone is in a double-talk state; if not, the current microphone is judged to be in the near-end speaking state.
3. The microphone array based doubletalk state detection method as claimed in claims 1 and 2, further comprising:
and carrying out corresponding nonlinear residual error suppression treatment according to the double-talk detection state of the current microphone.
4. The microphone array based doubletalk state detection method of claim 1, wherein the linear echo cancellation process comprises a linear echo part process, a non-residual part process and a filter update strategy.
5. The method for detecting the double-talk state based on the microphone array as claimed in claim 1, wherein the GCC operation is performed on the first channel signal and the second channel signal, and a peak detection is performed after a time delay value is calculated through a generalized cross-correlation function of the two channel signals, so as to obtain a maximum peak value, specifically:
respectively carrying out windowing framing processing and short-time Fourier transform on the time domain data of the first channel signal and the second channel signal;
after the frequency domain is converted, calculating to obtain a cross-correlation function of the first channel signal and the second channel signal;
performing Fourier transform after highlighting the peak value of the cross-correlation function by adopting a weighting function to obtain a generalized cross-correlation function;
and selecting a maximum value from the generalized cross-correlation function, and marking the maximum value as a maximum peak value.
6. The microphone array based doubletalk state detection method of claim 1, wherein the selecting a maximum peak in the generalized cross-correlation function further comprises:
calculating the corresponding maximum time delay value according to the distance between the two microphones,
calculating a corresponding sampling rate according to the maximum time delay value, screening out corresponding points from the generalized cross-correlation function according to the sampling rate, and recording the points as a first sampling set;
and selecting a maximum value from the first sampling set, and recording the maximum value as a maximum peak value.
7. The method as claimed in claim 1, wherein the first distance threshold is set according to actual microphone formation and the distance between two microphones.
8. A two-talk state detection apparatus based on a microphone array, comprising:
the signal acquisition module is used for acquiring voice analog signals corresponding to different channels through a microphone and converting the voice analog signals into voice digital signals to obtain a first channel signal and a second channel signal;
the echo cancellation module is used for performing linear echo cancellation processing on the first channel signal and the second channel signal by adopting an NLMS algorithm;
the peak detection module is used for carrying out GCC operation on the first channel signal and the second channel signal, calculating a time delay value through the generalized cross-correlation function of the two channel signals and then carrying out peak detection to obtain a maximum peak value;
and the state judgment module is used for comparing the maximum peak value with a preset first distance threshold value to judge the double-talk detection state of the current microphone.
9. A terminal device for microphone array based double talk state detection, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor when executing the computer program implementing a microphone array based double talk state detection method as claimed in any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program, wherein the computer program, when running, controls a device on which the computer-readable storage medium is located to perform the method for detecting double talk state based on a microphone array according to any of claims 1 to 7.
CN202010600751.2A 2020-06-28 2020-06-28 Microphone array-based double-end speaking state detection method and device Active CN111883153B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010600751.2A CN111883153B (en) 2020-06-28 2020-06-28 Microphone array-based double-end speaking state detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010600751.2A CN111883153B (en) 2020-06-28 2020-06-28 Microphone array-based double-end speaking state detection method and device

Publications (2)

Publication Number Publication Date
CN111883153A true CN111883153A (en) 2020-11-03
CN111883153B CN111883153B (en) 2024-02-23

Family

ID=73158114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010600751.2A Active CN111883153B (en) 2020-06-28 2020-06-28 Microphone array-based double-end speaking state detection method and device

Country Status (1)

Country Link
CN (1) CN111883153B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113409808A (en) * 2021-06-18 2021-09-17 上海盈方微电子有限公司 Echo cancellation time delay estimation method and echo cancellation method
CN113949776A (en) * 2021-10-19 2022-01-18 随锐科技集团股份有限公司 Double-end talk detection method and device based on double-step fast echo cancellation
CN114283844A (en) * 2021-12-24 2022-04-05 苏州蛙声科技有限公司 Double-talk detection method and device for audio and video conference

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060245583A1 (en) * 2003-07-17 2006-11-02 Matsushita Electric Industrial Co., Ltd. Speech communication apparatus
US20140146963A1 (en) * 2012-11-29 2014-05-29 Texas Instruments Incorporated Detecting Double Talk in Acoustic Echo Cancellation Using Zero-Crossing Rate
CN109348072A (en) * 2018-08-30 2019-02-15 湖北工业大学 A kind of double talk detection method applied to acoustic echo cancellation system
CN109862200A (en) * 2019-02-22 2019-06-07 北京达佳互联信息技术有限公司 Method of speech processing, device, electronic equipment and storage medium
CN110838300A (en) * 2019-11-18 2020-02-25 紫光展锐(重庆)科技有限公司 Echo cancellation processing method and processing system
CN111083297A (en) * 2019-11-14 2020-04-28 维沃移动通信(杭州)有限公司 Echo cancellation method and electronic equipment
CN111145771A (en) * 2020-03-03 2020-05-12 腾讯科技(深圳)有限公司 Voice signal processing method, processing device, terminal and storage medium thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060245583A1 (en) * 2003-07-17 2006-11-02 Matsushita Electric Industrial Co., Ltd. Speech communication apparatus
US20140146963A1 (en) * 2012-11-29 2014-05-29 Texas Instruments Incorporated Detecting Double Talk in Acoustic Echo Cancellation Using Zero-Crossing Rate
CN109348072A (en) * 2018-08-30 2019-02-15 湖北工业大学 A kind of double talk detection method applied to acoustic echo cancellation system
CN109862200A (en) * 2019-02-22 2019-06-07 北京达佳互联信息技术有限公司 Method of speech processing, device, electronic equipment and storage medium
CN111083297A (en) * 2019-11-14 2020-04-28 维沃移动通信(杭州)有限公司 Echo cancellation method and electronic equipment
CN110838300A (en) * 2019-11-18 2020-02-25 紫光展锐(重庆)科技有限公司 Echo cancellation processing method and processing system
CN111145771A (en) * 2020-03-03 2020-05-12 腾讯科技(深圳)有限公司 Voice signal processing method, processing device, terminal and storage medium thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GANSLER, T: ""The fast normalized cross-correlation double-talk detector"", 《SIGNAL PROCESSING》, vol. 86, no. 6, XP024997666, DOI: 10.1016/j.sigpro.2005.07.035 *
张正文: ""基于信号包络和互相关的双端通话检测算法研究"", 《现代电子技术》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113409808A (en) * 2021-06-18 2021-09-17 上海盈方微电子有限公司 Echo cancellation time delay estimation method and echo cancellation method
CN113409808B (en) * 2021-06-18 2024-05-03 上海盈方微电子有限公司 Echo cancellation time delay estimation method and echo cancellation method
CN113949776A (en) * 2021-10-19 2022-01-18 随锐科技集团股份有限公司 Double-end talk detection method and device based on double-step fast echo cancellation
CN113949776B (en) * 2021-10-19 2024-04-16 随锐科技集团股份有限公司 Double-end speaking detection method and device based on double-step rapid echo cancellation
CN114283844A (en) * 2021-12-24 2022-04-05 苏州蛙声科技有限公司 Double-talk detection method and device for audio and video conference

Also Published As

Publication number Publication date
CN111883153B (en) 2024-02-23

Similar Documents

Publication Publication Date Title
US11297178B2 (en) Method, apparatus, and computer-readable media utilizing residual echo estimate information to derive secondary echo reduction parameters
CN104158990B (en) Method and audio receiving circuit for processing audio signal
US7831035B2 (en) Integration of a microphone array with acoustic echo cancellation and center clipping
JP5288723B2 (en) Multi-channel echo compensation
KR101339592B1 (en) Sound source separator device, sound source separator method, and computer readable recording medium having recorded program
JP3506138B2 (en) Multi-channel echo cancellation method, multi-channel audio transmission method, stereo echo canceller, stereo audio transmission device, and transfer function calculation device
US7773743B2 (en) Integration of a microphone array with acoustic echo cancellation and residual echo suppression
CN111161751A (en) Distributed microphone pickup system and method under complex scene
US10979100B2 (en) Audio signal processing with acoustic echo cancellation
CN111883153A (en) Microphone array-based double-talk state detection method and device
CN108447496B (en) Speech enhancement method and device based on microphone array
CN110211602B (en) Intelligent voice enhanced communication method and device
US8761410B1 (en) Systems and methods for multi-channel dereverberation
Papp et al. Hands-free voice communication with TV
CN110265054A (en) Audio signal processing method, device, computer readable storage medium and computer equipment
CN110992923B (en) Echo cancellation method, electronic device, and storage device
WO2020020247A1 (en) Signal processing method and device, and computer storage medium
JP2003500937A (en) Acoustic echo cancellation
EP3692703A1 (en) Echo canceller and method therefor
JP4155774B2 (en) Echo suppression system and method
Zhang et al. A Deep Learning Approach to Multi-Channel and Multi-Microphone Acoustic Echo Cancellation.
CN112929506B (en) Audio signal processing method and device, computer storage medium and electronic equipment
CN109215672B (en) Method, device and equipment for processing sound information
CN106161820B (en) A kind of interchannel decorrelation method for stereo acoustic echo canceler
EP3566228A1 (en) Audio capture using beamforming

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant