CN110634496B - Double-talk detection method and device, computer equipment and storage medium - Google Patents

Double-talk detection method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN110634496B
CN110634496B CN201911008388.9A CN201911008388A CN110634496B CN 110634496 B CN110634496 B CN 110634496B CN 201911008388 A CN201911008388 A CN 201911008388A CN 110634496 B CN110634496 B CN 110634496B
Authority
CN
China
Prior art keywords
power
audio signal
determining
double
threshold value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911008388.9A
Other languages
Chinese (zh)
Other versions
CN110634496A (en
Inventor
王亮亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201911008388.9A priority Critical patent/CN110634496B/en
Priority to PCT/CN2019/127730 priority patent/WO2021077599A1/en
Publication of CN110634496A publication Critical patent/CN110634496A/en
Application granted granted Critical
Publication of CN110634496B publication Critical patent/CN110634496B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Telephone Function (AREA)

Abstract

The embodiment of the invention discloses a method and a device for detecting double talk, computer equipment and a storage medium, wherein the method comprises the following steps: receiving an audio signal from a microphone when performing voice communication, the audio signal having an echo signal; determining a power of the echo signal; determining a threshold value of a current detection double-talk state according to the power of the echo signal; determining a power of the audio signal; and if the power of the audio signal is greater than the threshold value, determining that the voice communication has the double-talk state. Combining the statistical characteristics of echo signals, dynamically adapting the state of the sound emitted by the opposite end user to generate the threshold value of the double-talk detection, and when different levels of interference or noise exist in the environment, still adapting the environment state, and keeping the accuracy of the threshold value, thereby ensuring that the double-talk detection using the threshold value is maintained at a lower false alarm probability.

Description

Double-talk detection method and device, computer equipment and storage medium
Technical Field
The embodiment of the invention relates to the technology of audio processing, in particular to a double-talk detection method, a double-talk detection device, computer equipment and a storage medium.
Background
In an audio device having both a loudspeaker and a microphone, an audio signal emitted by the loudspeaker reaches the microphone via multiple reflections in space to form an echo signal.
For voice communication such as video conference, echo signals may seriously impair communication quality and reduce voice recognition rate, and especially in the presence of double-talk (double-talk), echo cancellation generally includes a double-talk detection function in order to ensure the performance of a filter under the double-talk condition.
At present, a static threshold value is adopted for double-talk detection based on correlation or energy detection as a basis for judgment, and when different levels of interference or noise exist in the environment, the static threshold value can cause the double-talk detection to have higher false alarm probability.
Disclosure of Invention
The embodiment of the invention provides a double-talk detection method, a double-talk detection device, computer equipment and a storage medium, which aim to solve the problem of high false alarm probability caused by using a static threshold value to carry out double-talk detection.
In a first aspect, an embodiment of the present invention provides a method for detecting a talkback, including:
receiving an audio signal from a microphone when performing voice communication, the audio signal having an echo signal;
determining a power of the echo signal;
determining a threshold value of a current detection double-talk state according to the power of the echo signal;
determining a power of the audio signal;
and if the power of the audio signal is greater than the threshold value, determining that the voice communication has the double-talk state.
Optionally, the determining the power of the echo signal includes:
determining a reference audio signal;
determining an average power of the reference audio signal;
and performing attenuation gain on the average power of the reference audio signal as the power of the echo signal.
Optionally, the determining the reference audio signal comprises:
and collecting an audio signal to be played from a loudspeaker as a reference audio signal.
Optionally, the performing attenuation gain on the average power of the reference audio signal as the power of the echo signal includes:
determining an echo path between the microphone and the speaker;
and performing attenuation gain on the average power of the reference audio signal according to the echo path to serve as the power of the echo signal.
Optionally, the determining the threshold value of the current dual-talk detection state according to the average power includes:
determining a target value of the false alarm probability of detecting the double-talk state;
and under the limitation of the target value, determining a threshold value of the current detection double-talk state based on the power of the echo signal.
Optionally, the determining a threshold value of a currently detected two-talk state based on the power of the echo signal under the limitation of the target value includes:
decomposing the power of the echo signal into a variance and a mean;
and determining a threshold value of the current detection double-talk state based on the variance and the average value so as to enable the false alarm probability of the detection double-talk state based on the threshold value to be lower than the target value.
Optionally, the method further comprises:
and carrying out echo cancellation on the voice signal according to the double-talk state.
In a second aspect, an embodiment of the present invention further provides a dual-talk detection apparatus, including:
the audio signal receiving module is used for receiving an audio signal from a microphone when voice communication is carried out, wherein the audio signal has an echo signal;
a first power determination module for determining a power of the echo signal;
a threshold value determining module, configured to determine a threshold value of a currently detected dual-talk state according to the power of the echo signal;
a second power determination module for determining a power of the audio signal;
and the double-talk state determining module is used for determining that the voice communication exists in the double-talk state if the power of the audio signal is greater than the threshold value.
Optionally, the first power determining module includes:
a reference audio signal determination submodule for determining a reference audio signal;
a reference average power determination sub-module for determining a power of the reference audio signal;
and the attenuation gain submodule is used for carrying out attenuation gain on the average power of the reference audio signal to be used as the power of the echo signal.
Optionally, the reference audio signal determination sub-module comprises:
and the audio signal acquisition unit is used for acquiring the audio signal to be played from the loudspeaker as a reference audio signal.
Optionally, the attenuation gain sub-module comprises:
an echo path determination unit for determining an echo path between the microphone and the speaker;
and the echo path attenuation unit is used for carrying out attenuation gain on the average power of the reference audio signal according to the echo path, and the attenuation gain is used as the power of the echo signal.
Optionally, the threshold value determining module includes:
a false alarm probability determination submodule for determining a target value of the false alarm probability for detecting the double-talk state;
and the threshold value calculation sub-module is used for determining the threshold value of the current detection double-talk state based on the power of the echo signal under the limit of the target value.
Optionally, the threshold value calculation sub-module includes:
a power decomposition unit for decomposing the power of the echo signal into a variance and an average;
and the limit calculation unit is used for determining a threshold value of the current detection double-talk state based on the variance and the average value so as to enable the false alarm probability of the detection double-talk state based on the threshold value to be lower than the target value.
Optionally, the method further comprises:
and the echo cancellation module is used for carrying out echo cancellation on the voice signal according to the double-talk state.
In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of detecting double talk as in any one of the first aspects.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the double talk detection method according to any one of the first aspect.
In an embodiment of the present invention, when voice communication is performed, a voice signal is received from a microphone, the voice signal having an echo signal, the power of the echo signal is determined, therefore, the threshold value of the current detection double-talk state is determined according to the power, the power of the voice signal is determined, if the power of the voice signal is larger than the threshold value, the voice signal of the user at the local end is also existed besides the voice signal of the user at the opposite end, the existence of the double-talk state in the voice communication can be determined, because the state of the sound emitted by the user at the opposite end is constantly changed, the threshold value of the double-talk detection generated by the state of the sound emitted by the user at the opposite end is dynamically adapted in combination with the statistical characteristic of the echo signal, when different levels of interference or noise exist in the environment, the environment state can still be self-adapted, and the accuracy of the threshold value is kept, so that the threshold value is used for carrying out double-talk detection and the lower false alarm probability is maintained.
Drawings
Fig. 1 is a flowchart of a method for detecting a double talk according to an embodiment of the present invention;
fig. 2 is a flowchart of a double talk detection method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a dual-talk detection apparatus according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a double-talk detection method according to an embodiment of the present invention, which is applicable to a situation where a threshold is dynamically set for double-talk detection, and the method can be executed by a double-talk detection apparatus, the double-talk detection apparatus can be implemented by software and/or hardware, and can be configured in a computer device, such as a personal computer, a mobile terminal, a conference machine, and the like, and the computer device is configured with a microphone and a speaker, wherein the microphone is known as a microphone, which is also called a microphone and a microphone, and is an energy conversion device that converts a sound signal into an electrical signal, and the speaker is also called a speaker, which is a transducer device that converts an electrical signal into a sound signal.
Further, the microphone and the speaker may be independent components, or may be integrated into the same component, such as an earphone configured with a microphone; the microphone and the speaker may be directly disposed in the computer device, or may be connected to the computer device through a wire or wirelessly (e.g., Wi-Fi, bluetooth, etc.), which is not limited in this embodiment of the present invention.
As shown in fig. 1, the method specifically includes the following steps:
s101, when voice communication is carried out, an audio signal is received from a microphone.
In the present embodiment, at least two computer devices perform Voice communication including a teleconference, a car hands-free phone, VoIP (Voice over Internet Protocol), and the like.
In order to distinguish different computer devices, a computer device to which the present embodiment is applied may be referred to as a local-end computer device, a user using the local-end computer device to perform voice communication may be referred to as a local-end user, another computer device performing voice communication with the local-end computer device may be referred to as an opposite-end computer device, and a user using the opposite-end computer device to perform voice communication may be referred to as an opposite-end user.
On one hand, the computer equipment of the local terminal receives the audio signal sent by the computer equipment of the opposite terminal and plays the audio signal through the loudspeaker.
On the other hand, the computer device at the home terminal receives the audio signal through the microphone and transmits the audio signal to the computer device at the opposite terminal.
In a specific implementation, an audio signal sent by the local computer device reaches a microphone in the computer device through multiple spatial reflections, and therefore, the audio signal received by the local computer device has an echo signal.
When the user at the opposite end makes a sound, the audio signal sent by the computer device at the opposite end to the computer device at the home end contains the voice signal of the user at the opposite end, so that the echo signal received after the computer device at the home end sends the audio signal contains the voice signal of the user at the opposite end.
When the user at the opposite end does not make a sound, the audio signal sent by the computer device at the opposite end to the computer device at the home end does not contain the voice signal of the user at the opposite end, so that the echo signal received after the computer device at the home end sends the audio signal does not contain the voice signal of the user at the opposite end.
Of course, in the process of voice communication, besides the audio signal sent out by the local computer device, the local user also sends out a sound, and there may be environmental noise, so that besides the echo signal, the audio signal received by the local computer device may include the voice signal of the local user and may also include a noise signal.
And S102, determining the power of the echo signal.
In this embodiment, the local computer device may determine the power of the echo signal contained therein for the currently received audio signal.
It should be noted that, in the process of voice communication, the local computer device continuously receives the audio signal and continuously determines the power of the echo signal in the current audio signal.
S103, determining a threshold value of the current detection double-talk state according to the power of the echo signal.
In this embodiment, the local computer device calculates a threshold value suitable for performing double-talk detection on the current voice communication by using the real-time characteristic of the echo signal.
Among them, the double talk detection is a technique for detecting a double talk state.
The two-talk state means that the user at the local end and the user at the opposite end talk at the same time.
Further, in the voice communication process, there is a certain delay in the transmission of the audio signal, but the delay is short, so that if the audio signal received by the local computer device includes both the voice signal of the user at the opposite end and the voice signal of the user at the local end, it can be considered that the user at the local end and the user at the opposite end speak at the same time, and it is determined that the dual-talk state exists.
And S104, determining the power of the audio signal.
In this embodiment, the local computer device may determine the power of the currently received audio signal.
It should be noted that, in the process of voice communication, the local computer device continuously receives the audio signal and continuously determines the current power of the audio signal.
And S105, if the power of the audio signal is larger than the threshold value, determining that the voice communication has the double-talk state.
And comparing the power of the audio signal with a threshold value calculated based on the echo signal, wherein the echo signal may contain a voice signal of the user at the opposite end, and if the power of the audio signal is greater than the threshold value, the voice signal of the user at the local end is also present besides the voice signal of the user at the opposite end, so that the existence of a double-talk state in the voice communication can be determined.
On the contrary, if the power of the audio signal is less than or equal to the threshold value, the voice signal of the user of the local terminal is absent except the voice signal of the user of the opposite terminal, which indicates that the voice signal is absent, and it can be determined that the dual-talk state does not exist in the voice communication.
In an embodiment of the present invention, when voice communication is performed, a voice signal is received from a microphone, the voice signal having an echo signal, the power of the echo signal is determined, therefore, the threshold value of the current detection double-talk state is determined according to the power, the power of the voice signal is determined, if the power of the voice signal is larger than the threshold value, the voice signal of the user at the local end is also existed besides the voice signal of the user at the opposite end, the existence of the double-talk state in the voice communication can be determined, because the state of the sound emitted by the user at the opposite end is constantly changed, the threshold value of the double-talk detection generated by the state of the sound emitted by the user at the opposite end is dynamically adapted in combination with the statistical characteristic of the echo signal, when different levels of interference or noise exist in the environment, the environment state can still be self-adapted, and the accuracy of the threshold value is kept, so that the threshold value is used for carrying out double-talk detection and the lower false alarm probability is maintained.
Example two
Fig. 2 is a flowchart of a method for detecting a double talk according to a second embodiment of the present invention, and the present embodiment further details the processing operations for calculating the average power and the threshold value based on the foregoing embodiments. The method specifically comprises the following steps:
s201, when voice communication is carried out, an audio signal is received from a microphone.
Wherein the audio signal has an echo signal.
S202, determining a reference audio signal.
The reference audio signal, which belongs to the audio signal, can be used to estimate the power of the echo signal.
In one case, the audio signal to be played is captured by internal circuitry before the audio signal played by the speaker and set as the reference audio signal.
For the local computer equipment, after the audio signal is played by the loudspeaker, the emitted audio signal reaches the microphone through multiple reflections in space, although the audio signal is attenuated through process reflection, the whole audio signal is unchanged, the propagation speed of the audio signal is high, the environment of the space (such as indoor space, vehicle interior and the like) is small, the time for the audio signal to reach the microphone through emission from the loudspeaker is short and can be ignored, therefore, the audio signal received by the microphone is similar to the audio signal played by the loudspeaker, and the accuracy of the subsequent estimation of the power of the echo signal is ensured.
S203, determining the average power of the reference audio signal.
In a specific implementation, assuming that the n frames of reference audio signals are s (n), the average power of the reference audio signals may be represented as:
δs(n)=λδs(n-1)+(1-λ)|S(n)|2
wherein, deltasDenotes an average power of the reference audio signal, and λ denotes a forgetting factor, which belongs to a preset constant.
S204, performing attenuation gain on the average power of the reference audio signal to serve as the power of the echo signal.
Obtaining the average power delta of the reference audio signalsThen, the power value of the echo signal can be estimated as:
δx(n)=Gδs(n)
wherein, deltaxRepresenting the power of the echo signal and G the attenuation gain.
Further, the audio signal played by the speaker reaches the microphone after being spatially attenuated to form an echo signal. Generally, when the distance between the speaker and the microphone is fixed, there is a relatively fixed attenuation gain between the echo signal and the reference audio signal (the audio signal played by the speaker).
Thus, an echo path between the Microphone and the Loudspeaker, e.g. Loudspeaker-Room-Microphone (LRM), can be determined.
And performing attenuation gain on the average power of the reference audio signal according to the echo path to serve as the power of the echo signal.
In general, the microphone-to-speaker distance increases by a factor of 1, with a gain that is relatively attenuated by 6 dB.
In the embodiment of the invention, the average power of the reference audio signal is attenuated and gained so as to obtain the power of the echo signal, and the echo signal is estimated through the real reference voice signal, so that the simplicity of operation can be reduced under the condition of ensuring that the echo signal keeps higher accuracy.
Of course, the above way of calculating the power of the echo signal is only an example, and when the embodiment is implemented, other ways of calculating the power of the echo signal may be set according to actual situations, which is not limited in the embodiment. In addition, besides the above-mentioned way of calculating the power of the echo signal, a person skilled in the art may also adopt other ways of calculating the power of the echo signal according to actual needs, and this embodiment is not limited to this.
For example, the spatial impulse response may be expressed as:
Figure BDA0002243446270000111
wherein the content of the first and second substances,
Figure BDA0002243446270000112
for impulse response, y is the audio signal received by the microphone, and x is the reference speech signal.
At the present moment, the echo signal can be expressed as:
Figure BDA0002243446270000113
wherein, yeIs an echo signal.
The average power of the echo signal at time n can be expressed as
Pe(n)=λPe(n-1)+(1-λ)|Pe|2
Wherein, PeDenotes the average power of the reference speech signal and λ denotes the forgetting factor.
S205, determining a target value of the false alarm probability of the double-talk state.
And S206, under the limit of the target value, determining the threshold value of the current detection double-talk state based on the power of the echo signal.
In a specific implementation, the detection of the double-talk state can be measured in the following probability form:
the false alarm probability may refer to the probability that the dual-talk state exists when the dual-talk state does not exist.
In the present embodiment, a target value may be set in advance.
When voice communication is carried out, the false alarm probability is initialized to the target value, and under the limit of the target value, the power of the echo signal is used for calculating the threshold value of the current detection double-talk state, so that other indexes (such as the detection probability, the undetected probability and the like) are in accordance with expected conditions (such as the maximum detection probability, the minimum undetected probability and the like).
The detection probability refers to the probability of the system judging the occurrence of the target when the target exists.
In the embodiment of the invention, the target value of the false alarm probability of the double-talk state is determined, the threshold value of the current double-talk state is determined based on the average power under the limitation of the target value, and the threshold value is calculated under the condition of limiting the false alarm probability, so that the condition that the double-talk detection performed by subsequently using the threshold value meets the expected condition can be ensured, and the quality of the double-talk detection can be ensured.
Of course, in addition to using the false alarm probability for limitation, the present embodiment also uses the threshold of the current dual-talk detection state to be determined based on the average power under the limitation of other indicators (such as the detection probability, the undetected probability, and the like), so that the false alarm probability meets the expected conditions (such as the minimum false alarm probability, and the like), and the present embodiment does not limit this.
In one embodiment of the present invention, S206 includes:
s2061, decomposing the power of the echo signal into variance and average value.
S2062, determining the threshold value of the current dual-talk detection state based on the variance and the average value, so that the false alarm probability of the dual-talk detection state based on the threshold value is lower than the target value.
The threshold value should reflect the power value of the current echo signal, and the observed value of any signal can be decomposed into a representation mode of mean value and variance. When the power of the echo signal cannot be accurately obtained, the power of the current echo signal can be characterized by the mean value and the standard deviation of the power of the echo signal contained in the observation signal.
At the same time, a confidence level is set for the threshold value.
When the false alarm probability is set to 0, any echo signal can be detected as a double-talk state, and the threshold value should be very small.
When the false alarm probability is set to 1, any echo signal cannot be detected as a double-talk state, and the threshold value should be very large.
Therefore, in this embodiment, the false alarm probability can be set to a smaller target value, so that the false alarm probability for detecting the dual-talk state based on the threshold value is lower than the target value.
In the embodiment of the invention, the power of the echo signal is decomposed into the variance and the average value, the threshold value of the current detection double-talk state is determined based on the variance and the average value, so that the false alarm probability of the double-talk state detected based on the threshold value is lower than the target value, the threshold value of the double-talk detection is calculated by combining the first-order and second-order statistical characteristics of the echo signal, the environment condition can be reflected really, and the accuracy of the threshold value adapting to the current environment is improved.
In a specific implementation, in order to detect whether the voice communication has the double-talk state, each computer device performs N times of energy sampling in a detection time slot, and the energy value of the j frame echo signal is as follows:
Figure BDA0002243446270000131
wherein h isj(i) Is the ith sample value, n, of the j frame echo signalj(i) Is the noise signal energy of the j frame echo signal at the i-th sampling, n for simplifying the modelj(i) Is regarded as additive white noise, the mean value is 0, and the variance is
Figure BDA0002243446270000132
nj(i) And hjs (i) independently of each other.
H0And H1Respectively representing two hypothetical cases of the presence and absence of the double talk state.
From the formula (1), in the formula H0In the case of (2), Yj2Obeying a central chi-squared distribution of length N (echo signal); at H1In the case of (2), Yj2Obeying a non-centric chi-squared distribution with a (echo signal) length of NThe following were used:
Figure BDA0002243446270000133
wherein λ isj=Nμj,μjIs the signal-to-noise ratio of the received signal,
Figure BDA0002243446270000134
according to the central limit theorem, when N is sufficiently large, YjGenerally obey a normal distribution, defined as follows:
Figure BDA0002243446270000141
all values of all echo signals are fused to obtain a judgment value ZcThe definition is as follows:
Figure BDA0002243446270000142
wherein, wjIs the weighting coefficient of the echo signal of the j-th frame,
Figure BDA0002243446270000143
because of YjJ is 1,2, …, J is a random variable that follows a normal distribution, so Z iscRandom variables, which are also subject to a normal distribution, are defined as follows:
Figure BDA0002243446270000144
will decide the value ZcAnd threshold value gammacBy comparison, a decision result can be obtained, which is defined as follows:
Figure BDA0002243446270000145
the false alarm probability P can be obtained according to the formulas (4) to (6)fAnd probability of undetected PmThe definition is as follows:
Figure BDA0002243446270000146
Figure BDA0002243446270000147
wherein the content of the first and second substances,
Figure BDA0002243446270000148
probability of false alarm PfDetermining the maximum frequency utilization, as false alarm probability PfAnd probability of undetected PmUnder the unknown condition, a Neyman-Pearson (Neyman-Pearson) judgment criterion can be introduced, so that the detection performance problem of the system is equivalent to the control of the false alarm probability PfIn the case of a small value, the probability of failure P is set as followsmAt a minimum, or in the control of the probability of undetected PmIn the smaller case, the false alarm probability P is made as followsfAnd minimum.
In this embodiment, a smaller false alarm probability P can be presetfWhen false alarm probability PfIn the known case, the threshold value γ is obtained from the equation (7)cThe following were used:
Figure BDA0002243446270000151
wherein the content of the first and second substances,
Figure BDA0002243446270000152
that is, equation (9) can be converted to:
Figure BDA0002243446270000153
wherein the content of the first and second substances,
Figure BDA0002243446270000154
of course, the manner of calculating the threshold value is only an example, and when the embodiment is implemented, other manners of calculating the threshold value may be set according to actual situations, for example, taking the average power as a regularization term, determining the threshold value of the current dual-talk state detection based on the regularization term, so that the false alarm probability of the dual-talk state detection based on the threshold value is lower than a target value, or constructing an objective function with the average power, and optimizing the objective function, so as to calculate the threshold value of the current dual-talk state detection, so that the false alarm probability of the dual-talk state detection based on the threshold value is lower than the target value, and so on, which is not limited in this embodiment. In addition, in addition to the above-mentioned manner of calculating the threshold value, a person skilled in the art may also adopt other manners of calculating the threshold value according to actual needs, for example, taking the average power as a regularization term, determining the threshold value of the current dual-talk state based on the regularization term, so as to make the false alarm probability of the current dual-talk state based on the threshold value lower than the target value, or constructing an objective function with the average power, and optimizing the objective function, so as to calculate the threshold value of the current dual-talk state, so as to make the false alarm probability of the current dual-talk state based on the threshold value lower than the target value, and so on, this embodiment is not limited thereto.
And S207, determining the power of the audio signal.
And S208, if the power of the audio signal is larger than the threshold value, determining that the voice communication has the double-talk state.
In a specific implementation, y (n) ═ y may be usedn,yn-1,…,yn-N+1)TA signal vector representing a currently received audio signal, the signal vector containing observations of the current to past time instants N-1.
Using YH(n) Y (n) represents the power of the currently received audio signal, threshold value gammac(n) represents the power of the audio signal in the current received signal vector.
Then, Y isH(n) Y (n) and γc(n) comparison is made when YH(n)Y(n)>γcAnd (n), judging that the computer equipment has a double-talk state, and otherwise, judging that the double-talk condition does not exist.
And S209, performing echo cancellation on the audio signal according to the double-talk state.
If the voice communication has a double-talk state, i.e. multiple parties talk simultaneously during the voice communication, the audio signal collected from the microphone of the computer device includes the voice signal (i.e. echo signal) of the opposite end and the voice signal of the local end, which are mixed together.
When echo cancellation is performed in a dual-talk state, on one hand, the voice signal of the local terminal is protected from being damaged, and on the other hand, the voice signal (i.e., echo signal) of the opposite terminal is removed as much as possible.
Generally, in the case that the voice signal (i.e. echo signal) of the opposite end is higher than the voice signal of the local end by about 6dB to 8dB, if the voice signal (i.e. echo signal) of the opposite end is to be removed, the voice signal of the local end is damaged more or less.
In addition, if the voice signal (i.e., echo signal) of the opposite end is higher than the voice signal of the home end by more than 18dB, for example, the speaker is closer to the microphone, and the voice signal (i.e., echo signal) of the opposite end masks the voice signal of the home end, the echo cancellation effect is not obvious. In this case, it is possible to eliminate the voice signal (i.e., echo signal) of the opposite end together with the voice signal of the home end, and then appropriately fill the comfort noise.
Further, echo cancellation includes the following two steps:
1. linear adaptive filtering
The linear adaptive filtering is to solve fe ═ f (fs), establish a speech model of the speech signal (i.e., echo signal) at the opposite end, and perform a first round of echo cancellation.
Wherein fs is far-end signal, and fe is far-end echo.
The echo path can be regarded as an 'environment filter', and after the processing of the 'environment filter', the far-end speech signal is changed into a far-end echo signal, and the echo cancellation is to construct an 'algorithm filter'
Based on the voice model of the voice signal (i.e. echo signal) of the opposite terminal, the coefficient of the environment filter is continuously adjusted, so that the estimated value is closer to the real echo signal, and the more the estimated value is closer to the real echo signal, the better the echo cancellation effect is.
2. Non-linear processing
The nonlinear processing is divided into two steps: residual echo processing and non-linear clipping processing.
The residual echo processing performs a second round of echo cancellation to process the residual echo.
The strategy of residual echo cancellation is to further cancel the residual echo by utilizing the correlation between the residual echo processed by the adaptive filter and the far-end reference audio signal. The greater the correlation, the more the residual echo is, the greater the degree of further cancellation of the residual echo is required; conversely, the smaller the correlation, the less the residual echo, and the smaller the degree to which the residual echo needs to be further cancelled. Therefore, firstly, a correlation matrix of the residual echo and the reference audio signal is calculated to obtain an attenuation factor reflecting the elimination degree; the residual echo is then multiplied by an attenuation factor to further cancel the residual echo.
The nonlinear clipping process is a process of clipping a speech signal whose attenuation amount reaches a threshold value.
If the voice is in the double-talk state, the linear adaptive filter cancels the echo signal on the premise of not damaging the near-end voice quality as much as possible, the suppression amount of the echo signal is not too large, and therefore the attenuation factor is relatively large.
It should be noted that echo suppression continues to work, and the computer device at the opposite end does not send out a voice signal but sends a mute packet, and at this time, because there is no voice signal, echo suppression is equivalent to not working, and the voice signal at the local end reaches the computer device at the opposite end through echo suppression, and is not affected by the result of double-talk detection.
In this embodiment, if it is detected that the voice communication has the double-talk state, the voice signal is subjected to echo cancellation in a targeted manner, so as to accurately detect the human voice, improve the recovery quality of the echo cancellation on the human voice signal, and thus improve the performance of the echo cancellation.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a dual-talk detection apparatus provided in the third embodiment of the present invention, where the apparatus may specifically include the following modules:
an audio signal receiving module 301, configured to receive an audio signal from a microphone when performing voice communication, where the audio signal has an echo signal;
a first power determination module 302 for determining the power of the echo signal;
a threshold value determining module 303, configured to determine a threshold value of a currently detected dual-talk state according to the power of the echo signal;
a second power determination module 304 for determining the power of the audio signal;
a dual-talk state determining module 305, configured to determine that the voice communication exists in the dual-talk state if the power of the audio signal is greater than the threshold value.
In one embodiment of the present invention, the first power determining module 302 comprises:
a reference audio signal determination submodule for determining a reference audio signal;
a reference average power determination sub-module for determining an average power of the reference audio signal;
and the attenuation gain submodule is used for carrying out attenuation gain on the average power of the reference audio signal to be used as the power of the echo signal.
In one embodiment of the invention, the reference audio signal determination submodule comprises:
and the audio signal acquisition unit is used for acquiring the audio signal to be played from the loudspeaker as a reference audio signal.
In one embodiment of the invention, the attenuation gain sub-module comprises:
an echo path determination unit for determining an echo path between the microphone and the speaker;
and the echo path attenuation unit is used for carrying out attenuation gain on the average power of the reference audio signal according to the echo path, and the attenuation gain is used as the power of the echo signal.
In an embodiment of the present invention, the threshold value determining module 303 includes:
a false alarm probability determination submodule for determining a target value of the false alarm probability for detecting the double-talk state;
and the threshold value calculation sub-module is used for determining the threshold value of the current detection double-talk state based on the power of the echo signal under the limit of the target value.
In an embodiment of the present invention, the threshold value calculation sub-module includes:
a power decomposition unit for decomposing the power of the echo signal into a variance and an average;
and the limit calculation unit is used for determining a threshold value of the current detection double-talk state based on the variance and the average value so as to enable the false alarm probability of the detection double-talk state based on the threshold value to be lower than the target value.
In one embodiment of the present invention, further comprising:
and the echo cancellation module is used for carrying out echo cancellation on the audio signal according to the double-talk state.
The double-talk detection device provided by the embodiment of the invention can execute the double-talk detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. As shown in fig. 4, the computer apparatus includes a processor 400, a memory 401, a communication module 402, an input device 403, and an output device 404; the number of processors 400 in the computer device may be one or more, and one processor 400 is taken as an example in fig. 4; the processor 400, the memory 401, the communication module 402, the input device 403 and the output device 404 in the computer apparatus may be connected by a bus or other means, and fig. 4 illustrates an example of connection by a bus.
The memory 401 is used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as the modules corresponding to the dual talk detection method in the present embodiment (for example, the audio signal receiving module 301, the first power determining module 302, the threshold value determining module 303, the second power determining module 304, and the dual talk state determining module 305 in the dual talk detection apparatus shown in fig. 3). The processor 400 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the memory 401, that is, the above-mentioned double talk detection method is realized.
The memory 401 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 401 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 401 may further include memory located remotely from processor 400, which may be connected to a computer device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
And the communication module 402 is used for establishing connection with the display screen and realizing data interaction with the display screen.
The input device 403 may include a microphone or the like, and may be used to receive input audio signals, numeric or character information, and to generate key signal inputs related to user settings and function control of the computer apparatus.
The output device 404 may include a speaker or the like, which may be used to output audio signals.
The computer device provided by the embodiment of the invention can execute the double talk detection method provided by any embodiment of the invention, and has corresponding functions and beneficial effects.
EXAMPLE five
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for detecting a double talk, and the method includes:
receiving an audio signal from a microphone when performing voice communication, the audio signal having an echo signal;
determining a power of the echo signal;
determining a threshold value of a current detection double-talk state according to the power of the echo signal;
determining a power of the audio signal;
and if the power of the audio signal is greater than the threshold value, determining that the voice communication has the double-talk state.
Of course, the computer program of the computer-readable storage medium provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the double talk detection method provided in any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the dual-talk detection apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (7)

1. A method for detecting a speakerphone, comprising:
receiving an audio signal from a microphone when performing voice communication, the audio signal having an echo signal;
determining a power of the echo signal, comprising: determining a reference audio signal; determining an average power of the reference audio signal; performing attenuation gain on the average power of the reference audio signal as the power of the echo signal;
determining a threshold value of a current detection double-talk state according to the power of the echo signal;
determining a power of the audio signal;
if the power of the audio signal is larger than the threshold value, determining that the voice communication has the double-talk state;
wherein, the determining the threshold value of the current detection dual-talk state according to the power of the echo signal comprises:
determining a target value of the false alarm probability of detecting the double-talk state;
under the limitation of the target value, determining a threshold value of the current detection dual-talk state based on the power, including: decomposing the power of the echo signal into a variance and a mean; and determining a threshold value of the current detection double-talk state based on the variance and the average value so as to enable the false alarm probability of the detection double-talk state based on the threshold value to be lower than the target value.
2. The method of claim 1, wherein the determining the reference audio signal comprises:
and collecting an audio signal to be played from a loudspeaker as a reference audio signal.
3. The method of claim 2, wherein the attenuating the average power of the reference audio signal by a gain as the power of the echo signal comprises:
determining an echo path between the microphone and the speaker;
and performing attenuation gain on the power of the reference audio signal according to the echo path to serve as the power of the echo signal.
4. The method of claim 1,2 or 3, further comprising:
and carrying out echo cancellation on the audio signal according to the double-talk state.
5. A dual talk detection device, comprising:
the audio signal receiving module is used for receiving an audio signal from a microphone when voice communication is carried out, wherein the audio signal has an echo signal;
a first power determination module for determining a power of the echo signal;
a threshold value determining module, configured to determine a threshold value of a currently detected dual-talk state according to the power of the echo signal;
a second power determination module for determining a power of the audio signal;
a double-talk state determination module, configured to determine that the voice communication has the double-talk state if the power of the audio signal is greater than the threshold value;
wherein the threshold value determining module comprises:
a false alarm probability determination submodule for determining a target value of the false alarm probability for detecting the double-talk state;
a threshold value calculation submodule, configured to determine a threshold value of a currently detected dual-talk state based on the power of the echo signal under the restriction of the target value;
wherein the first power determination module comprises:
a reference audio signal determination submodule for determining a reference audio signal;
a reference average power determination sub-module for determining an average power of the reference audio signal;
the attenuation gain submodule is used for carrying out attenuation gain on the average power of the reference audio signal to be used as the power of the echo signal;
wherein, the threshold value calculation sub-module comprises:
a power decomposition unit for decomposing the power of the echo signal into a variance and an average;
and the limit calculation unit is used for determining a threshold value of the current detection double-talk state based on the variance and the average value so as to enable the false alarm probability of the detection double-talk state based on the threshold value to be lower than the target value.
6. A computer device, characterized in that the computer device comprises:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the double talk detection method of any one of claims 1-4.
7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the double talk detection method according to any one of claims 1 to 4.
CN201911008388.9A 2019-10-22 2019-10-22 Double-talk detection method and device, computer equipment and storage medium Active CN110634496B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911008388.9A CN110634496B (en) 2019-10-22 2019-10-22 Double-talk detection method and device, computer equipment and storage medium
PCT/CN2019/127730 WO2021077599A1 (en) 2019-10-22 2019-12-24 Double-talk detection method and apparatus, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911008388.9A CN110634496B (en) 2019-10-22 2019-10-22 Double-talk detection method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110634496A CN110634496A (en) 2019-12-31
CN110634496B true CN110634496B (en) 2021-12-24

Family

ID=68977217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911008388.9A Active CN110634496B (en) 2019-10-22 2019-10-22 Double-talk detection method and device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110634496B (en)
WO (1) WO2021077599A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161748B (en) * 2020-02-20 2022-09-23 百度在线网络技术(北京)有限公司 Double-talk state detection method and device and electronic equipment
CN111654585B (en) * 2020-03-26 2021-08-03 紫光展锐(重庆)科技有限公司 Echo sound field state determination method and device, storage medium and terminal
CN113223547A (en) * 2021-04-30 2021-08-06 杭州朗和科技有限公司 Method, device, equipment and medium for detecting double talk
CN114222225B (en) * 2022-02-22 2022-07-08 深圳市技湛科技有限公司 Howling suppression method and device for sound amplification equipment, sound amplification equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1917386A (en) * 2006-09-05 2007-02-21 华为技术有限公司 Method for detecting both speaking status in operatioon of echo cancel
CN102984406A (en) * 2012-10-01 2013-03-20 美商威睿电通公司 Method used for detecting double-end conversation conditions and system thereof
CN103391381A (en) * 2012-05-10 2013-11-13 中兴通讯股份有限公司 Method and device for canceling echo
CN106506872A (en) * 2016-11-02 2017-03-15 腾讯科技(深圳)有限公司 Talking state detection method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101106405A (en) * 2006-07-12 2008-01-16 北京大学深圳研究生院 Method for eliminating echo in echo eliminator and its dual end communication detection system
CN102739886B (en) * 2011-04-01 2013-10-16 中国科学院声学研究所 Stereo echo offset method based on echo spectrum estimation and speech existence probability
US9191493B2 (en) * 2013-12-09 2015-11-17 Captioncall, Llc Methods and devices for updating an adaptive filter for echo cancellation
KR20170032603A (en) * 2015-09-15 2017-03-23 삼성전자주식회사 Electric device, acoustic echo cancelling method of thereof and non-transitory computer readable recording medium
CN108806713B (en) * 2018-05-22 2020-06-16 出门问问信息科技有限公司 Method and device for detecting double-speech state

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1917386A (en) * 2006-09-05 2007-02-21 华为技术有限公司 Method for detecting both speaking status in operatioon of echo cancel
CN103391381A (en) * 2012-05-10 2013-11-13 中兴通讯股份有限公司 Method and device for canceling echo
CN102984406A (en) * 2012-10-01 2013-03-20 美商威睿电通公司 Method used for detecting double-end conversation conditions and system thereof
CN106506872A (en) * 2016-11-02 2017-03-15 腾讯科技(深圳)有限公司 Talking state detection method and device

Also Published As

Publication number Publication date
WO2021077599A1 (en) 2021-04-29
CN110634496A (en) 2019-12-31

Similar Documents

Publication Publication Date Title
CN110634496B (en) Double-talk detection method and device, computer equipment and storage medium
CN106898359B (en) Audio signal processing method and system, audio interaction device and computer equipment
CN106713570B (en) Echo cancellation method and device
CN105472189B (en) Echo cancellation detector, method of cancelling echoes and comparison generator
US6792107B2 (en) Double-talk detector suitable for a telephone-enabled PC
US7856097B2 (en) Echo canceling apparatus, telephone set using the same, and echo canceling method
EP1183848B1 (en) System and method for near-end talker detection by spectrum analysis
CN110995951B (en) Echo cancellation method, device and system based on double-end sounding detection
CN108134863B (en) Improved double-end detection device and detection method based on double statistics
CN109273019B (en) Method for double-talk detection for echo suppression and echo suppression
US8693678B2 (en) Device and method for controlling damping of residual echo
US20190349471A1 (en) Method, apparatus, and computer-readable media utilizing residual echo estimate information to derive secondary echo reduction parameters
EP3796629B1 (en) Double talk detection method, double talk detection device and echo cancellation system
JPWO2010035308A1 (en) Echo canceller
CN111742541B (en) Acoustic echo cancellation method, acoustic echo cancellation device and storage medium
WO2008011319A2 (en) Method and system for near-end detection
JP2003500936A (en) Improving near-end audio signals in echo suppression systems
JP3009647B2 (en) Acoustic echo control system, simultaneous speech detector of acoustic echo control system, and simultaneous speech control method of acoustic echo control system
CN109559756B (en) Filter coefficient determining method, echo eliminating method, corresponding device and equipment
US20170310360A1 (en) Echo removal device, echo removal method, and non-transitory storage medium
CN111524532B (en) Echo suppression method, device, equipment and storage medium
US8064966B2 (en) Method of detecting a double talk situation for a “hands-free” telephone device
CN111028855B (en) Echo suppression method, device, equipment and storage medium
JP4888262B2 (en) Call state determination device and echo canceller having the call state determination device
CN113241084B (en) Echo cancellation method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant