CN110634496B

CN110634496B - Double-talk detection method and device, computer equipment and storage medium

Info

Publication number: CN110634496B
Application number: CN201911008388.9A
Authority: CN
Inventors: 王亮亮
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2019-10-22
Filing date: 2019-10-22
Publication date: 2021-12-24
Anticipated expiration: 2039-10-22
Also published as: WO2021077599A1; CN110634496A

Abstract

The embodiment of the invention discloses a method and a device for detecting double talk, computer equipment and a storage medium, wherein the method comprises the following steps: receiving an audio signal from a microphone when performing voice communication, the audio signal having an echo signal; determining a power of the echo signal; determining a threshold value of a current detection double-talk state according to the power of the echo signal; determining a power of the audio signal; and if the power of the audio signal is greater than the threshold value, determining that the voice communication has the double-talk state. Combining the statistical characteristics of echo signals, dynamically adapting the state of the sound emitted by the opposite end user to generate the threshold value of the double-talk detection, and when different levels of interference or noise exist in the environment, still adapting the environment state, and keeping the accuracy of the threshold value, thereby ensuring that the double-talk detection using the threshold value is maintained at a lower false alarm probability.

Description

Double-talk detection method and device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the technology of audio processing, in particular to a double-talk detection method, a double-talk detection device, computer equipment and a storage medium.

Background

In an audio device having both a loudspeaker and a microphone, an audio signal emitted by the loudspeaker reaches the microphone via multiple reflections in space to form an echo signal.

For voice communication such as video conference, echo signals may seriously impair communication quality and reduce voice recognition rate, and especially in the presence of double-talk (double-talk), echo cancellation generally includes a double-talk detection function in order to ensure the performance of a filter under the double-talk condition.

At present, a static threshold value is adopted for double-talk detection based on correlation or energy detection as a basis for judgment, and when different levels of interference or noise exist in the environment, the static threshold value can cause the double-talk detection to have higher false alarm probability.

Disclosure of Invention

The embodiment of the invention provides a double-talk detection method, a double-talk detection device, computer equipment and a storage medium, which aim to solve the problem of high false alarm probability caused by using a static threshold value to carry out double-talk detection.

In a first aspect, an embodiment of the present invention provides a method for detecting a talkback, including:

receiving an audio signal from a microphone when performing voice communication, the audio signal having an echo signal;

determining a power of the echo signal;

determining a threshold value of a current detection double-talk state according to the power of the echo signal;

determining a power of the audio signal;

and if the power of the audio signal is greater than the threshold value, determining that the voice communication has the double-talk state.

Optionally, the determining the power of the echo signal includes:

determining a reference audio signal;

determining an average power of the reference audio signal;

and performing attenuation gain on the average power of the reference audio signal as the power of the echo signal.

Optionally, the determining the reference audio signal comprises:

and collecting an audio signal to be played from a loudspeaker as a reference audio signal.

Optionally, the performing attenuation gain on the average power of the reference audio signal as the power of the echo signal includes:

determining an echo path between the microphone and the speaker;

and performing attenuation gain on the average power of the reference audio signal according to the echo path to serve as the power of the echo signal.

Optionally, the determining the threshold value of the current dual-talk detection state according to the average power includes:

determining a target value of the false alarm probability of detecting the double-talk state;

and under the limitation of the target value, determining a threshold value of the current detection double-talk state based on the power of the echo signal.

Optionally, the determining a threshold value of a currently detected two-talk state based on the power of the echo signal under the limitation of the target value includes:

decomposing the power of the echo signal into a variance and a mean;

and determining a threshold value of the current detection double-talk state based on the variance and the average value so as to enable the false alarm probability of the detection double-talk state based on the threshold value to be lower than the target value.

Optionally, the method further comprises:

and carrying out echo cancellation on the voice signal according to the double-talk state.

In a second aspect, an embodiment of the present invention further provides a dual-talk detection apparatus, including:

the audio signal receiving module is used for receiving an audio signal from a microphone when voice communication is carried out, wherein the audio signal has an echo signal;

a first power determination module for determining a power of the echo signal;

a threshold value determining module, configured to determine a threshold value of a currently detected dual-talk state according to the power of the echo signal;

a second power determination module for determining a power of the audio signal;

and the double-talk state determining module is used for determining that the voice communication exists in the double-talk state if the power of the audio signal is greater than the threshold value.

Optionally, the first power determining module includes:

a reference audio signal determination submodule for determining a reference audio signal;

a reference average power determination sub-module for determining a power of the reference audio signal;

and the attenuation gain submodule is used for carrying out attenuation gain on the average power of the reference audio signal to be used as the power of the echo signal.

Optionally, the reference audio signal determination sub-module comprises:

and the audio signal acquisition unit is used for acquiring the audio signal to be played from the loudspeaker as a reference audio signal.

Optionally, the attenuation gain sub-module comprises:

an echo path determination unit for determining an echo path between the microphone and the speaker;

and the echo path attenuation unit is used for carrying out attenuation gain on the average power of the reference audio signal according to the echo path, and the attenuation gain is used as the power of the echo signal.

Optionally, the threshold value determining module includes:

a false alarm probability determination submodule for determining a target value of the false alarm probability for detecting the double-talk state;

and the threshold value calculation sub-module is used for determining the threshold value of the current detection double-talk state based on the power of the echo signal under the limit of the target value.

Optionally, the threshold value calculation sub-module includes:

a power decomposition unit for decomposing the power of the echo signal into a variance and an average;

and the limit calculation unit is used for determining a threshold value of the current detection double-talk state based on the variance and the average value so as to enable the false alarm probability of the detection double-talk state based on the threshold value to be lower than the target value.

Optionally, the method further comprises:

and the echo cancellation module is used for carrying out echo cancellation on the voice signal according to the double-talk state.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of detecting double talk as in any one of the first aspects.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the double talk detection method according to any one of the first aspect.

In an embodiment of the present invention, when voice communication is performed, a voice signal is received from a microphone, the voice signal having an echo signal, the power of the echo signal is determined, therefore, the threshold value of the current detection double-talk state is determined according to the power, the power of the voice signal is determined, if the power of the voice signal is larger than the threshold value, the voice signal of the user at the local end is also existed besides the voice signal of the user at the opposite end, the existence of the double-talk state in the voice communication can be determined, because the state of the sound emitted by the user at the opposite end is constantly changed, the threshold value of the double-talk detection generated by the state of the sound emitted by the user at the opposite end is dynamically adapted in combination with the statistical characteristic of the echo signal, when different levels of interference or noise exist in the environment, the environment state can still be self-adapted, and the accuracy of the threshold value is kept, so that the threshold value is used for carrying out double-talk detection and the lower false alarm probability is maintained.

Drawings

Fig. 1 is a flowchart of a method for detecting a double talk according to an embodiment of the present invention;

fig. 2 is a flowchart of a double talk detection method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a dual-talk detection apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a double-talk detection method according to an embodiment of the present invention, which is applicable to a situation where a threshold is dynamically set for double-talk detection, and the method can be executed by a double-talk detection apparatus, the double-talk detection apparatus can be implemented by software and/or hardware, and can be configured in a computer device, such as a personal computer, a mobile terminal, a conference machine, and the like, and the computer device is configured with a microphone and a speaker, wherein the microphone is known as a microphone, which is also called a microphone and a microphone, and is an energy conversion device that converts a sound signal into an electrical signal, and the speaker is also called a speaker, which is a transducer device that converts an electrical signal into a sound signal.

Further, the microphone and the speaker may be independent components, or may be integrated into the same component, such as an earphone configured with a microphone; the microphone and the speaker may be directly disposed in the computer device, or may be connected to the computer device through a wire or wirelessly (e.g., Wi-Fi, bluetooth, etc.), which is not limited in this embodiment of the present invention.

As shown in fig. 1, the method specifically includes the following steps:

s101, when voice communication is carried out, an audio signal is received from a microphone.

In the present embodiment, at least two computer devices perform Voice communication including a teleconference, a car hands-free phone, VoIP (Voice over Internet Protocol), and the like.

In order to distinguish different computer devices, a computer device to which the present embodiment is applied may be referred to as a local-end computer device, a user using the local-end computer device to perform voice communication may be referred to as a local-end user, another computer device performing voice communication with the local-end computer device may be referred to as an opposite-end computer device, and a user using the opposite-end computer device to perform voice communication may be referred to as an opposite-end user.

On one hand, the computer equipment of the local terminal receives the audio signal sent by the computer equipment of the opposite terminal and plays the audio signal through the loudspeaker.

On the other hand, the computer device at the home terminal receives the audio signal through the microphone and transmits the audio signal to the computer device at the opposite terminal.

In a specific implementation, an audio signal sent by the local computer device reaches a microphone in the computer device through multiple spatial reflections, and therefore, the audio signal received by the local computer device has an echo signal.

When the user at the opposite end makes a sound, the audio signal sent by the computer device at the opposite end to the computer device at the home end contains the voice signal of the user at the opposite end, so that the echo signal received after the computer device at the home end sends the audio signal contains the voice signal of the user at the opposite end.

When the user at the opposite end does not make a sound, the audio signal sent by the computer device at the opposite end to the computer device at the home end does not contain the voice signal of the user at the opposite end, so that the echo signal received after the computer device at the home end sends the audio signal does not contain the voice signal of the user at the opposite end.

Of course, in the process of voice communication, besides the audio signal sent out by the local computer device, the local user also sends out a sound, and there may be environmental noise, so that besides the echo signal, the audio signal received by the local computer device may include the voice signal of the local user and may also include a noise signal.

And S102, determining the power of the echo signal.

In this embodiment, the local computer device may determine the power of the echo signal contained therein for the currently received audio signal.

It should be noted that, in the process of voice communication, the local computer device continuously receives the audio signal and continuously determines the power of the echo signal in the current audio signal.

S103, determining a threshold value of the current detection double-talk state according to the power of the echo signal.

In this embodiment, the local computer device calculates a threshold value suitable for performing double-talk detection on the current voice communication by using the real-time characteristic of the echo signal.

Among them, the double talk detection is a technique for detecting a double talk state.

The two-talk state means that the user at the local end and the user at the opposite end talk at the same time.

Further, in the voice communication process, there is a certain delay in the transmission of the audio signal, but the delay is short, so that if the audio signal received by the local computer device includes both the voice signal of the user at the opposite end and the voice signal of the user at the local end, it can be considered that the user at the local end and the user at the opposite end speak at the same time, and it is determined that the dual-talk state exists.

And S104, determining the power of the audio signal.

In this embodiment, the local computer device may determine the power of the currently received audio signal.

It should be noted that, in the process of voice communication, the local computer device continuously receives the audio signal and continuously determines the current power of the audio signal.

And S105, if the power of the audio signal is larger than the threshold value, determining that the voice communication has the double-talk state.

And comparing the power of the audio signal with a threshold value calculated based on the echo signal, wherein the echo signal may contain a voice signal of the user at the opposite end, and if the power of the audio signal is greater than the threshold value, the voice signal of the user at the local end is also present besides the voice signal of the user at the opposite end, so that the existence of a double-talk state in the voice communication can be determined.

On the contrary, if the power of the audio signal is less than or equal to the threshold value, the voice signal of the user of the local terminal is absent except the voice signal of the user of the opposite terminal, which indicates that the voice signal is absent, and it can be determined that the dual-talk state does not exist in the voice communication.

Example two

Fig. 2 is a flowchart of a method for detecting a double talk according to a second embodiment of the present invention, and the present embodiment further details the processing operations for calculating the average power and the threshold value based on the foregoing embodiments. The method specifically comprises the following steps:

s201, when voice communication is carried out, an audio signal is received from a microphone.

Wherein the audio signal has an echo signal.

S202, determining a reference audio signal.

The reference audio signal, which belongs to the audio signal, can be used to estimate the power of the echo signal.

In one case, the audio signal to be played is captured by internal circuitry before the audio signal played by the speaker and set as the reference audio signal.

For the local computer equipment, after the audio signal is played by the loudspeaker, the emitted audio signal reaches the microphone through multiple reflections in space, although the audio signal is attenuated through process reflection, the whole audio signal is unchanged, the propagation speed of the audio signal is high, the environment of the space (such as indoor space, vehicle interior and the like) is small, the time for the audio signal to reach the microphone through emission from the loudspeaker is short and can be ignored, therefore, the audio signal received by the microphone is similar to the audio signal played by the loudspeaker, and the accuracy of the subsequent estimation of the power of the echo signal is ensured.

S203, determining the average power of the reference audio signal.

In a specific implementation, assuming that the n frames of reference audio signals are s (n), the average power of the reference audio signals may be represented as:

δ_s(n)＝λδ_s(n-1)+(1-λ)|S(n)|²

wherein, delta_sDenotes an average power of the reference audio signal, and λ denotes a forgetting factor, which belongs to a preset constant.

S204, performing attenuation gain on the average power of the reference audio signal to serve as the power of the echo signal.

Obtaining the average power delta of the reference audio signal_sThen, the power value of the echo signal can be estimated as:

δ_x(n)＝Gδ_s(n)

wherein, delta_xRepresenting the power of the echo signal and G the attenuation gain.

Further, the audio signal played by the speaker reaches the microphone after being spatially attenuated to form an echo signal. Generally, when the distance between the speaker and the microphone is fixed, there is a relatively fixed attenuation gain between the echo signal and the reference audio signal (the audio signal played by the speaker).

Thus, an echo path between the Microphone and the Loudspeaker, e.g. Loudspeaker-Room-Microphone (LRM), can be determined.

In general, the microphone-to-speaker distance increases by a factor of 1, with a gain that is relatively attenuated by 6 dB.

In the embodiment of the invention, the average power of the reference audio signal is attenuated and gained so as to obtain the power of the echo signal, and the echo signal is estimated through the real reference voice signal, so that the simplicity of operation can be reduced under the condition of ensuring that the echo signal keeps higher accuracy.

Of course, the above way of calculating the power of the echo signal is only an example, and when the embodiment is implemented, other ways of calculating the power of the echo signal may be set according to actual situations, which is not limited in the embodiment. In addition, besides the above-mentioned way of calculating the power of the echo signal, a person skilled in the art may also adopt other ways of calculating the power of the echo signal according to actual needs, and this embodiment is not limited to this.

For example, the spatial impulse response may be expressed as:

wherein the content of the first and second substances,

for impulse response, y is the audio signal received by the microphone, and x is the reference speech signal.

At the present moment, the echo signal can be expressed as:

wherein, y_eIs an echo signal.

The average power of the echo signal at time n can be expressed as

P_e(n)＝λP_e(n-1)+(1-λ)|P_e|²

Wherein, P_eDenotes the average power of the reference speech signal and λ denotes the forgetting factor.

S205, determining a target value of the false alarm probability of the double-talk state.

And S206, under the limit of the target value, determining the threshold value of the current detection double-talk state based on the power of the echo signal.

In a specific implementation, the detection of the double-talk state can be measured in the following probability form:

the false alarm probability may refer to the probability that the dual-talk state exists when the dual-talk state does not exist.

In the present embodiment, a target value may be set in advance.

When voice communication is carried out, the false alarm probability is initialized to the target value, and under the limit of the target value, the power of the echo signal is used for calculating the threshold value of the current detection double-talk state, so that other indexes (such as the detection probability, the undetected probability and the like) are in accordance with expected conditions (such as the maximum detection probability, the minimum undetected probability and the like).

The detection probability refers to the probability of the system judging the occurrence of the target when the target exists.

In the embodiment of the invention, the target value of the false alarm probability of the double-talk state is determined, the threshold value of the current double-talk state is determined based on the average power under the limitation of the target value, and the threshold value is calculated under the condition of limiting the false alarm probability, so that the condition that the double-talk detection performed by subsequently using the threshold value meets the expected condition can be ensured, and the quality of the double-talk detection can be ensured.

Of course, in addition to using the false alarm probability for limitation, the present embodiment also uses the threshold of the current dual-talk detection state to be determined based on the average power under the limitation of other indicators (such as the detection probability, the undetected probability, and the like), so that the false alarm probability meets the expected conditions (such as the minimum false alarm probability, and the like), and the present embodiment does not limit this.

In one embodiment of the present invention, S206 includes:

s2061, decomposing the power of the echo signal into variance and average value.

S2062, determining the threshold value of the current dual-talk detection state based on the variance and the average value, so that the false alarm probability of the dual-talk detection state based on the threshold value is lower than the target value.

The threshold value should reflect the power value of the current echo signal, and the observed value of any signal can be decomposed into a representation mode of mean value and variance. When the power of the echo signal cannot be accurately obtained, the power of the current echo signal can be characterized by the mean value and the standard deviation of the power of the echo signal contained in the observation signal.

At the same time, a confidence level is set for the threshold value.

When the false alarm probability is set to 0, any echo signal can be detected as a double-talk state, and the threshold value should be very small.

When the false alarm probability is set to 1, any echo signal cannot be detected as a double-talk state, and the threshold value should be very large.

Therefore, in this embodiment, the false alarm probability can be set to a smaller target value, so that the false alarm probability for detecting the dual-talk state based on the threshold value is lower than the target value.

In the embodiment of the invention, the power of the echo signal is decomposed into the variance and the average value, the threshold value of the current detection double-talk state is determined based on the variance and the average value, so that the false alarm probability of the double-talk state detected based on the threshold value is lower than the target value, the threshold value of the double-talk detection is calculated by combining the first-order and second-order statistical characteristics of the echo signal, the environment condition can be reflected really, and the accuracy of the threshold value adapting to the current environment is improved.

In a specific implementation, in order to detect whether the voice communication has the double-talk state, each computer device performs N times of energy sampling in a detection time slot, and the energy value of the j frame echo signal is as follows:

wherein h is_j(i) Is the ith sample value, n, of the j frame echo signal_j(i) Is the noise signal energy of the j frame echo signal at the i-th sampling, n for simplifying the model_j(i) Is regarded as additive white noise, the mean value is 0, and the variance is

n_j(i) And h_js (i) independently of each other.

H₀And H₁Respectively representing two hypothetical cases of the presence and absence of the double talk state.

From the formula (1), in the formula H₀In the case of (2), Y_j/δ²Obeying a central chi-squared distribution of length N (echo signal); at H₁In the case of (2), Y_j/δ²Obeying a non-centric chi-squared distribution with a (echo signal) length of NThe following were used:

wherein λ is_j＝Nμ_j，μ_jIs the signal-to-noise ratio of the received signal,

according to the central limit theorem, when N is sufficiently large, Y_jGenerally obey a normal distribution, defined as follows:

all values of all echo signals are fused to obtain a judgment value Z_cThe definition is as follows:

wherein, w_jIs the weighting coefficient of the echo signal of the j-th frame,

because of Y_jJ is 1,2, …, J is a random variable that follows a normal distribution, so Z is_cRandom variables, which are also subject to a normal distribution, are defined as follows:

will decide the value Z_cAnd threshold value gamma_cBy comparison, a decision result can be obtained, which is defined as follows:

the false alarm probability P can be obtained according to the formulas (4) to (6)_fAnd probability of undetected P_mThe definition is as follows:

wherein the content of the first and second substances,

probability of false alarm P_fDetermining the maximum frequency utilization, as false alarm probability P_fAnd probability of undetected P_mUnder the unknown condition, a Neyman-Pearson (Neyman-Pearson) judgment criterion can be introduced, so that the detection performance problem of the system is equivalent to the control of the false alarm probability P_fIn the case of a small value, the probability of failure P is set as follows_mAt a minimum, or in the control of the probability of undetected P_mIn the smaller case, the false alarm probability P is made as follows_fAnd minimum.

In this embodiment, a smaller false alarm probability P can be preset_fWhen false alarm probability P_fIn the known case, the threshold value γ is obtained from the equation (7)_cThe following were used:

wherein the content of the first and second substances,

that is, equation (9) can be converted to:

wherein the content of the first and second substances,

of course, the manner of calculating the threshold value is only an example, and when the embodiment is implemented, other manners of calculating the threshold value may be set according to actual situations, for example, taking the average power as a regularization term, determining the threshold value of the current dual-talk state detection based on the regularization term, so that the false alarm probability of the dual-talk state detection based on the threshold value is lower than a target value, or constructing an objective function with the average power, and optimizing the objective function, so as to calculate the threshold value of the current dual-talk state detection, so that the false alarm probability of the dual-talk state detection based on the threshold value is lower than the target value, and so on, which is not limited in this embodiment. In addition, in addition to the above-mentioned manner of calculating the threshold value, a person skilled in the art may also adopt other manners of calculating the threshold value according to actual needs, for example, taking the average power as a regularization term, determining the threshold value of the current dual-talk state based on the regularization term, so as to make the false alarm probability of the current dual-talk state based on the threshold value lower than the target value, or constructing an objective function with the average power, and optimizing the objective function, so as to calculate the threshold value of the current dual-talk state, so as to make the false alarm probability of the current dual-talk state based on the threshold value lower than the target value, and so on, this embodiment is not limited thereto.

And S207, determining the power of the audio signal.

And S208, if the power of the audio signal is larger than the threshold value, determining that the voice communication has the double-talk state.

In a specific implementation, y (n) ═ y may be used_n,y_n-1,…,y_n-N+1)^TA signal vector representing a currently received audio signal, the signal vector containing observations of the current to past time instants N-1.

Using Y^H(n) Y (n) represents the power of the currently received audio signal, threshold value gamma_c(n) represents the power of the audio signal in the current received signal vector.

Then, Y is^H(n) Y (n) and γ_c(n) comparison is made when Y^H(n)Y(n)＞γ_cAnd (n), judging that the computer equipment has a double-talk state, and otherwise, judging that the double-talk condition does not exist.

And S209, performing echo cancellation on the audio signal according to the double-talk state.

If the voice communication has a double-talk state, i.e. multiple parties talk simultaneously during the voice communication, the audio signal collected from the microphone of the computer device includes the voice signal (i.e. echo signal) of the opposite end and the voice signal of the local end, which are mixed together.

When echo cancellation is performed in a dual-talk state, on one hand, the voice signal of the local terminal is protected from being damaged, and on the other hand, the voice signal (i.e., echo signal) of the opposite terminal is removed as much as possible.

Generally, in the case that the voice signal (i.e. echo signal) of the opposite end is higher than the voice signal of the local end by about 6dB to 8dB, if the voice signal (i.e. echo signal) of the opposite end is to be removed, the voice signal of the local end is damaged more or less.

In addition, if the voice signal (i.e., echo signal) of the opposite end is higher than the voice signal of the home end by more than 18dB, for example, the speaker is closer to the microphone, and the voice signal (i.e., echo signal) of the opposite end masks the voice signal of the home end, the echo cancellation effect is not obvious. In this case, it is possible to eliminate the voice signal (i.e., echo signal) of the opposite end together with the voice signal of the home end, and then appropriately fill the comfort noise.

Further, echo cancellation includes the following two steps:

1. linear adaptive filtering

The linear adaptive filtering is to solve fe ═ f (fs), establish a speech model of the speech signal (i.e., echo signal) at the opposite end, and perform a first round of echo cancellation.

Wherein fs is far-end signal, and fe is far-end echo.

The echo path can be regarded as an 'environment filter', and after the processing of the 'environment filter', the far-end speech signal is changed into a far-end echo signal, and the echo cancellation is to construct an 'algorithm filter'

Based on the voice model of the voice signal (i.e. echo signal) of the opposite terminal, the coefficient of the environment filter is continuously adjusted, so that the estimated value is closer to the real echo signal, and the more the estimated value is closer to the real echo signal, the better the echo cancellation effect is.

2. Non-linear processing

The nonlinear processing is divided into two steps: residual echo processing and non-linear clipping processing.

The residual echo processing performs a second round of echo cancellation to process the residual echo.

The strategy of residual echo cancellation is to further cancel the residual echo by utilizing the correlation between the residual echo processed by the adaptive filter and the far-end reference audio signal. The greater the correlation, the more the residual echo is, the greater the degree of further cancellation of the residual echo is required; conversely, the smaller the correlation, the less the residual echo, and the smaller the degree to which the residual echo needs to be further cancelled. Therefore, firstly, a correlation matrix of the residual echo and the reference audio signal is calculated to obtain an attenuation factor reflecting the elimination degree; the residual echo is then multiplied by an attenuation factor to further cancel the residual echo.

The nonlinear clipping process is a process of clipping a speech signal whose attenuation amount reaches a threshold value.

If the voice is in the double-talk state, the linear adaptive filter cancels the echo signal on the premise of not damaging the near-end voice quality as much as possible, the suppression amount of the echo signal is not too large, and therefore the attenuation factor is relatively large.

It should be noted that echo suppression continues to work, and the computer device at the opposite end does not send out a voice signal but sends a mute packet, and at this time, because there is no voice signal, echo suppression is equivalent to not working, and the voice signal at the local end reaches the computer device at the opposite end through echo suppression, and is not affected by the result of double-talk detection.

In this embodiment, if it is detected that the voice communication has the double-talk state, the voice signal is subjected to echo cancellation in a targeted manner, so as to accurately detect the human voice, improve the recovery quality of the echo cancellation on the human voice signal, and thus improve the performance of the echo cancellation.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a dual-talk detection apparatus provided in the third embodiment of the present invention, where the apparatus may specifically include the following modules:

an audio signal receiving module 301, configured to receive an audio signal from a microphone when performing voice communication, where the audio signal has an echo signal;

a first power determination module 302 for determining the power of the echo signal;

a threshold value determining module 303, configured to determine a threshold value of a currently detected dual-talk state according to the power of the echo signal;

a second power determination module 304 for determining the power of the audio signal;

a dual-talk state determining module 305, configured to determine that the voice communication exists in the dual-talk state if the power of the audio signal is greater than the threshold value.

In one embodiment of the present invention, the first power determining module 302 comprises:

a reference average power determination sub-module for determining an average power of the reference audio signal;

In one embodiment of the invention, the reference audio signal determination submodule comprises:

In one embodiment of the invention, the attenuation gain sub-module comprises:

In an embodiment of the present invention, the threshold value determining module 303 includes:

In an embodiment of the present invention, the threshold value calculation sub-module includes:

In one embodiment of the present invention, further comprising:

and the echo cancellation module is used for carrying out echo cancellation on the audio signal according to the double-talk state.

The double-talk detection device provided by the embodiment of the invention can execute the double-talk detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. As shown in fig. 4, the computer apparatus includes a processor 400, a memory 401, a communication module 402, an input device 403, and an output device 404; the number of processors 400 in the computer device may be one or more, and one processor 400 is taken as an example in fig. 4; the processor 400, the memory 401, the communication module 402, the input device 403 and the output device 404 in the computer apparatus may be connected by a bus or other means, and fig. 4 illustrates an example of connection by a bus.

The memory 401 is used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as the modules corresponding to the dual talk detection method in the present embodiment (for example, the audio signal receiving module 301, the first power determining module 302, the threshold value determining module 303, the second power determining module 304, and the dual talk state determining module 305 in the dual talk detection apparatus shown in fig. 3). The processor 400 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the memory 401, that is, the above-mentioned double talk detection method is realized.

The memory 401 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 401 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 401 may further include memory located remotely from processor 400, which may be connected to a computer device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

And the communication module 402 is used for establishing connection with the display screen and realizing data interaction with the display screen.

The input device 403 may include a microphone or the like, and may be used to receive input audio signals, numeric or character information, and to generate key signal inputs related to user settings and function control of the computer apparatus.

The output device 404 may include a speaker or the like, which may be used to output audio signals.

The computer device provided by the embodiment of the invention can execute the double talk detection method provided by any embodiment of the invention, and has corresponding functions and beneficial effects.

EXAMPLE five

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for detecting a double talk, and the method includes:

determining a power of the echo signal;

determining a power of the audio signal;

Of course, the computer program of the computer-readable storage medium provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the double talk detection method provided in any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the dual-talk detection apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for detecting a speakerphone, comprising:

determining a power of the echo signal, comprising: determining a reference audio signal; determining an average power of the reference audio signal; performing attenuation gain on the average power of the reference audio signal as the power of the echo signal;

determining a power of the audio signal;

if the power of the audio signal is larger than the threshold value, determining that the voice communication has the double-talk state;

wherein, the determining the threshold value of the current detection dual-talk state according to the power of the echo signal comprises:

under the limitation of the target value, determining a threshold value of the current detection dual-talk state based on the power, including: decomposing the power of the echo signal into a variance and a mean; and determining a threshold value of the current detection double-talk state based on the variance and the average value so as to enable the false alarm probability of the detection double-talk state based on the threshold value to be lower than the target value.

2. The method of claim 1, wherein the determining the reference audio signal comprises:

3. The method of claim 2, wherein the attenuating the average power of the reference audio signal by a gain as the power of the echo signal comprises:

determining an echo path between the microphone and the speaker;

and performing attenuation gain on the power of the reference audio signal according to the echo path to serve as the power of the echo signal.

4. The method of claim 1,2 or 3, further comprising:

and carrying out echo cancellation on the audio signal according to the double-talk state.

5. A dual talk detection device, comprising:

a first power determination module for determining a power of the echo signal;

a double-talk state determination module, configured to determine that the voice communication has the double-talk state if the power of the audio signal is greater than the threshold value;

wherein the threshold value determining module comprises:

a threshold value calculation submodule, configured to determine a threshold value of a currently detected dual-talk state based on the power of the echo signal under the restriction of the target value;

wherein the first power determination module comprises:

the attenuation gain submodule is used for carrying out attenuation gain on the average power of the reference audio signal to be used as the power of the echo signal;

wherein, the threshold value calculation sub-module comprises:

6. A computer device, characterized in that the computer device comprises:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the double talk detection method of any one of claims 1-4.

7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the double talk detection method according to any one of claims 1 to 4.