CN113223547B - Double-talk detection method, device, equipment and medium - Google Patents

Double-talk detection method, device, equipment and medium Download PDF

Info

Publication number
CN113223547B
CN113223547B CN202110478318.0A CN202110478318A CN113223547B CN 113223547 B CN113223547 B CN 113223547B CN 202110478318 A CN202110478318 A CN 202110478318A CN 113223547 B CN113223547 B CN 113223547B
Authority
CN
China
Prior art keywords
far
audio signal
amplitude value
threshold value
frequency amplitude
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110478318.0A
Other languages
Chinese (zh)
Other versions
CN113223547A (en
Inventor
郝一亚
阮良
陈功
陈丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Netease Zhiqi Technology Co Ltd
Original Assignee
Hangzhou Netease Zhiqi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Netease Zhiqi Technology Co Ltd filed Critical Hangzhou Netease Zhiqi Technology Co Ltd
Priority to CN202110478318.0A priority Critical patent/CN113223547B/en
Publication of CN113223547A publication Critical patent/CN113223547A/en
Application granted granted Critical
Publication of CN113223547B publication Critical patent/CN113223547B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)

Abstract

The disclosure provides a double-talk detection method, a double-talk detection device, double-talk detection equipment and double-talk detection media, which relate to the technical field of communication, wherein the double-talk detection method comprises the following steps: acquiring a near-end audio signal and a far-end audio signal, and detecting whether a near-end frequency amplitude value exists in a target low-frequency band of the near-end audio signal; based on the detection result of the near-end frequency amplitude value, determining whether a near-end voice signal exists in the near-end audio signal or not so as to obtain a near-end detection result; detecting whether a far-end voice signal exists in the far-end audio signal or not to obtain a far-end detection result; based on the near-end detection result and the far-end detection result, whether the two-talk state is established. The method and the device can accurately detect the near-end voice signal and further determine the double-talk state.

Description

Double-talk detection method, device, equipment and medium
Technical Field
The disclosure relates to the technical field of communication, and in particular relates to a double-talk detection method, a double-talk detection device, double-talk detection equipment and double-talk detection media.
Background
This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
Echo cancellation (Acoustic Echo Cancellation, AEC) is a signal processing technique that functions to suppress and cancel echo signals in a communication system, ensure that users are not interfered by echo signals, and improve call quality.
The AEC method in the related art generally uses an adaptive linear filter to estimate an echo signal and then cancels the echo signal in the communication system according to the estimated echo signal. To improve the effect of the adaptive linear filter in the AEC method, a double-talk detection (DTD) module is usually added to cooperate with the adaptive linear filter.
The two-talk detection module is configured to detect a talk state of both communication parties, for example, a two-talk state when both communication parties talk simultaneously. In the related art, at one end of communication, whether a two-talk state is established is determined by detecting a local voice signal (i.e., a near-end voice signal) and a voice signal at the other end (i.e., a far-end voice signal). In general, detecting a far-end voice signal is easy to achieve, but because an audio signal collected by a local microphone includes not only a near-end voice signal but also an echo signal of the far-end voice signal, it is difficult to detect the near-end voice signal.
Disclosure of Invention
The embodiment of the disclosure provides a double-talk detection method, a double-talk detection device, double-talk detection equipment and a double-talk detection medium, which are used for accurately detecting a near-end voice signal so as to determine a double-talk state.
In a first aspect, an embodiment of the present disclosure provides a duplex detection method, including:
Acquiring a near-end audio signal and a far-end audio signal;
detecting whether a near-end frequency amplitude value exists in a target low-frequency band of the near-end audio signal;
determining whether a near-end voice signal exists in the near-end audio signal based on the detection result of the near-end frequency amplitude value so as to obtain a near-end detection result;
Detecting whether a far-end voice signal exists in the far-end audio signal or not to obtain a far-end detection result;
and determining whether the near-end detection result and the far-end detection result are in a double-talk state or not based on the near-end detection result and the far-end detection result.
In one possible implementation manner, the determining whether a near-end speech signal exists in the near-end audio signal based on the detection result of the near-end frequency amplitude value includes:
If the near-end frequency amplitude value exists, detecting a far-end frequency amplitude value corresponding to a target frequency point in the target low-frequency band of the far-end audio signal; the target frequency point is a frequency point corresponding to the near-end frequency amplitude value;
and comparing the energy corresponding to the near-end frequency amplitude value with the energy corresponding to the far-end frequency amplitude value, and determining whether the near-end voice signal exists in the near-end audio signal according to a comparison result.
In one possible implementation manner, the comparing the energy corresponding to the near-end frequency amplitude value with the energy corresponding to the far-end frequency amplitude value, and determining whether the near-end voice signal exists in the near-end audio signal according to the comparison result includes:
comparing the difference value of the energy corresponding to the near-end frequency amplitude value and the energy corresponding to the far-end frequency amplitude value with a first threshold value;
if the difference value is not greater than the first threshold value, determining that the near-end voice signal does not exist in the near-end audio signal;
if the difference value is larger than the first threshold value, determining that the near-end voice signal exists in the near-end audio signal; or alternatively
Comparing the ratio of the energy corresponding to the near-end frequency amplitude value to the energy corresponding to the far-end frequency amplitude value with a second threshold value;
If the ratio is not greater than the second threshold value, determining that the near-end voice signal is absent from the near-end audio signal;
And if the ratio is greater than the second threshold value, determining that the near-end voice signal exists in the near-end audio signal.
In one possible embodiment, the method further comprises:
determining whether an update condition of the first threshold value or the second threshold value is reached based on one or more of energy corresponding to the near-end frequency amplitude value, energy corresponding to the far-end frequency amplitude value, and energy mean square error of the far-end audio signal; the energy mean square error of the far-end audio signals is the energy mean square error of each obtained far-end audio signal at the target frequency point in a set time period;
and if the updating condition of the first threshold value or the second threshold value is met, updating the first threshold value or the second threshold value.
In one possible implementation, the update condition of the first threshold value or the second threshold value includes one or more of the following:
E[Ak(m0)]>E[Mk(m0)]
STD{E[Ak-n(m0)],E[Ak-n+1(m0)]...E[Ak(m0)]}<Estd
E[Ak(m0)]>βE[noise floor FarEnd]
Wherein m 0 is the target frequency point; k-n, k-n+1 … … k are frame numbers corresponding to the remote audio signals, k is a current frame number, n is an integer greater than or equal to 0, and k is greater than n; a k(m0) is the far-end frequency amplitude value; m k(m0) is the near-end frequency amplitude value; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value; e ] M k(m0) is the energy corresponding to the near-end frequency amplitude value; STD { } represents mean square error; e std is the energy mean square error threshold of the far-end audio signal; beta is the energy threshold coefficient of the far-end audio signal; e noise floor FarEnd is the average energy of the background noise of the far-end audio signal.
In a possible implementation manner, the updating the first threshold value or the second threshold value includes:
updating the first threshold value by the following formula:
wherein, Is the first threshold value; gamma is a set coefficient; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value; or alternatively
Updating the second threshold value by the following formula:
wherein, Is the second threshold value.
In one possible implementation manner, the determining whether a near-end speech signal exists in the near-end audio signal based on the detection result of the near-end frequency amplitude value further includes:
and if the near-end frequency amplitude value does not exist in the target low-frequency band of the near-end audio signal, determining that the near-end voice signal does not exist in the near-end audio signal.
In one possible implementation manner, the determining whether to be in a two-talk state based on the near-end detection result and the far-end detection result includes:
If the far-end voice signal exists in the far-end audio signal and the near-end voice signal exists in the near-end audio signal, determining that the current double-talk state exists; or alternatively
If the far-end voice signal exists in the far-end audio signal and the near-end voice signal does not exist in the near-end audio signal, determining that the far-end single-talk state exists currently; or alternatively
And if the far-end voice signal does not exist in the far-end audio signal and the near-end voice signal exists in the near-end audio signal, determining that the near-end single-talk state exists currently.
In a second aspect, embodiments of the present disclosure further provide a duplex detecting apparatus, including:
the audio acquisition module is used for acquiring a near-end audio signal and a far-end audio signal;
The low-frequency band detection module is used for detecting whether a near-end frequency amplitude value exists in a target low-frequency band of the near-end audio signal;
the near-end voice detection module is used for determining whether a near-end voice signal exists in the near-end audio signal or not based on the detection result of the near-end frequency amplitude value so as to obtain a near-end detection result;
The remote voice detection module is used for detecting whether a remote voice signal exists in the remote audio signal or not so as to obtain a remote detection result;
And the double-talk detection module is used for determining whether the near-end detection result and the far-end detection result are in a double-talk state or not based on the near-end detection result and the far-end detection result.
In one possible implementation, the near-end voice detection module further includes:
the detection sub-module is used for detecting a far-end frequency amplitude value corresponding to a target frequency point in the target low-frequency band of the far-end audio signal if the near-end frequency amplitude value exists; the target frequency point is a frequency point corresponding to the near-end frequency amplitude value;
And the comparison sub-module is used for comparing the energy corresponding to the near-end frequency amplitude value with the energy corresponding to the far-end frequency amplitude value and determining whether the near-end voice signal exists in the near-end audio signal according to a comparison result.
In a possible implementation manner, the comparing sub-module is further configured to:
comparing the difference value of the energy corresponding to the near-end frequency amplitude value and the energy corresponding to the far-end frequency amplitude value with a first threshold value;
if the difference value is not greater than the first threshold value, determining that the near-end voice signal does not exist in the near-end audio signal;
if the difference value is larger than the first threshold value, determining that the near-end voice signal exists in the near-end audio signal; or alternatively
Comparing the ratio of the energy corresponding to the near-end frequency amplitude value to the energy corresponding to the far-end frequency amplitude value with a second threshold value;
If the ratio is not greater than the second threshold value, determining that the near-end voice signal is absent from the near-end audio signal;
And if the ratio is greater than the second threshold value, determining that the near-end voice signal exists in the near-end audio signal.
In a possible implementation manner, the apparatus further includes a threshold updating module, configured to:
determining whether an update condition of the first threshold value or the second threshold value is reached based on one or more of energy corresponding to the near-end frequency amplitude value, energy corresponding to the far-end frequency amplitude value, and energy mean square error of the far-end audio signal; the energy mean square error of the far-end audio signals is the energy mean square error of each obtained far-end audio signal at the target frequency point in a set time period;
and if the updating condition of the first threshold value or the second threshold value is met, updating the first threshold value or the second threshold value.
In one possible implementation, the update condition of the first threshold value or the second threshold value includes one or more of the following:
E[Ak(m0)]>E[Mk(m0)]
STD{E[Ak-n(m0)],E[Ak-n+1(m0)]...E[Ak(m0)]}<Estd
E[Ak(m0)]>βE[noise floor FarEnd]
Wherein m 0 is the target frequency point; k-n, k-n+1 … … k are frame numbers corresponding to the remote audio signals, k is a current frame number, n is an integer greater than or equal to 0, and k is greater than n; a k(m0) is the far-end frequency amplitude value; m k(m0) is the near-end frequency amplitude value; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value; STD { } represents mean square error; e std is the energy mean square error threshold of the far-end audio signal; beta is the energy threshold coefficient of the far-end audio signal; e noise floor FarEnd is the average energy of the background noise of the far-end audio signal.
In a possible implementation manner, the threshold updating module is further configured to:
updating the first threshold value by the following formula:
wherein, Is the first threshold value; gamma is a set coefficient; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value; or alternatively
Updating the second threshold value by the following formula:
wherein, Is the second threshold value.
In a third aspect, the present disclosure also provides an electronic device, characterized by comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, which when executed by the processor causes the processor to implement the steps of any of the two-talk detection methods of the first aspect.
In a fourth aspect, the present disclosure also provides a computer-readable storage medium having a computer program stored therein, characterized in that: the computer program, when executed by a processor, implements the steps of any of the two-talk detection methods of the first aspect.
The double-talk detection method provided by the embodiment of the disclosure has at least the following beneficial effects:
According to the scheme provided by the embodiment of the disclosure, a near-end audio signal and a far-end audio signal are firstly obtained, and whether a near-end voice signal exists in the near-end audio signal is determined by detecting whether a near-end frequency amplitude value exists in a target low-frequency band of the near-end audio signal so as to obtain a near-end detection result; then detecting whether a far-end voice signal exists in the far-end audio signal or not so as to obtain a far-end detection result; and finally, determining whether the two-talk state is established based on the near-end detection result and the far-end detection result. By detecting whether the near-end frequency amplitude value exists in the target low-frequency band of the near-end audio signal, whether nonlinear distortion exists in the target low-frequency band of the near-end audio signal is determined, and whether a near-end voice signal exists in the near-end audio signal can be further determined. Based on the above process, the embodiment of the disclosure can accurately detect the near-end voice signal. Further, the double talk state is determined according to the detection result of the near-end voice signal and the detection result of the far-end voice signal.
Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the disclosure. The objectives and other advantages of the disclosure will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
Fig. 1 is a schematic diagram of an application scenario of a two-way detection method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an AEC based on a two-way detection module in the related art;
fig. 3 is a flowchart of a two-way detection method provided in an embodiment of the present disclosure;
FIG. 4 is a flowchart of another two-talk detection method provided by an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an AEC based on a two-talk detection module according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a two-way detection device according to an embodiment of the disclosure;
FIG. 7 is a schematic diagram of another two-way detection device according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
For the purpose of promoting an understanding of the principles and advantages of the disclosure, reference will now be made in detail to the drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
It should be noted that the terms "first," "second," and the like in the description and in the claims of the present disclosure are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein.
Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For ease of understanding, some of the concepts involved in the embodiments of the present disclosure are explained below.
Echo signal (Echo): in the communication system, the sound played by the local terminal passes through the sound field path, and the signal is re-collected by the local terminal sensor, for example, the sound played by the speaker of the terminal device is re-collected by the microphone.
Double talk state: the method refers to a scene that terminal equipment at two ends simultaneously inputs voice signals during real-time communication. Wherein the two ends may be divided into a proximal end and a distal end.
Current AEC methods typically utilize an adaptive linear filter to estimate an echo signal and then cancel the echo signal in the communication system based on the estimated echo signal. To improve the effect of the adaptive linear filter in the AEC method, a double-talk detection (DTD) module is usually added to cooperate with the adaptive linear filter. A schematic structure of the AEC based on the two-talk detection module is described below with reference to fig. 1.
As shown in fig. 1, x (n) represents a far-end signal transmitted from a far end to a local end, d (n) represents a near-end signal acquired by a local end microphone, v (n) represents a near-end voice signal, and y (n) represents an echo signal; d (n) may contain both v (n) and y (n), or may contain one of v (n) and y (n). The double-talk detection module determines speaking states of two ends by detecting far-end voice signals in x (n) and near-end voice signals in d (n), for example, two ends speak simultaneously to be in a double-talk state; one end speaks in a single talk state. The two-way detection module 21 sends the speaking state to the adaptive algorithm module 22, and the adaptive algorithm module 22 determines the filtering parameters of the adaptive filter 23 according to the speaking state, and then sends the filtering parameters to the adaptive filter 23, so that the adaptive filter 23 filters according to the filtering parameters to suppress echo signals.
From the above, the double talk detection result is important for the filtering effect of the adaptive linear filter. In general, in the two-way detection, it is easy to detect the far-end voice signal in the far-end signal, but since the near-end signal collected by the local microphone may include not only the near-end voice signal but also the echo signal of the far-end voice signal, it is difficult to accurately detect the near-end voice signal in the near-end signal.
In view of this, the embodiments of the present disclosure provide a two-way detection method, apparatus, device, and medium, which first obtain a near-end audio signal and a far-end audio signal, and determine whether a near-end speech signal exists in the near-end audio signal by detecting whether a near-end frequency amplitude value exists in a target low-frequency band of the near-end audio signal, so as to obtain a near-end detection result; then detecting whether a far-end voice signal exists in the far-end audio signal or not so as to obtain a far-end detection result; and finally, determining whether the two-talk state is established based on the near-end detection result and the far-end detection result. The scheme of the embodiment of the disclosure can accurately detect the near-end voice signal so as to determine the double-talk state.
The application scenario of the embodiments of the present disclosure is described below with reference to the accompanying drawings.
Referring to fig. 2, an application scenario diagram of a two-way detection method provided by an embodiment of the present disclosure is shown. The application scenario comprises a first terminal device 100 and a second terminal device 200. Wherein the first terminal device 100 and the second terminal device 200 may be connected through a communication network to implement a voice call or a video call, etc. The first terminal device 100 and the second terminal device 200 each include, but are not limited to, a desktop computer, a mobile phone, a mobile computer, a tablet computer, a media player, a smart wearable device, a smart television, a vehicle-mounted device, a Personal Digital Assistant (PDA), and the like.
When a first user performs a voice call or a video call with a second user using the first terminal device 100, the first terminal device 100 may collect a voice signal of the first user through a microphone, and then transmit the voice signal to the second terminal device 200 through a communication network, the second terminal device 200 may play the received voice signal through a speaker, and the voice signal played by the speaker may be collected by the microphone of the second terminal device 200, thereby forming an echo signal. At this time, if the second terminal apparatus 200 is taken as a home terminal apparatus, the first terminal apparatus 100 may be taken as a remote terminal apparatus, the second terminal apparatus 200 may suppress an echo signal through an adaptive filter, and before suppressing the echo signal, the speaking state of both ends may be first detected through a two-talk detection module.
The two-talk detection method according to an exemplary embodiment of the present disclosure is described below in conjunction with the application scenario of fig. 2. The above application scenarios are only shown for facilitating understanding of the spirit and principles of the present disclosure, and embodiments of the present disclosure are not limited in any way in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.
Referring to fig. 3, an embodiment of the present disclosure provides a two-talk detection method, which may be applied to a terminal device, such as the second terminal device 100 shown in fig. 2. The double talk detection method may include the steps of:
In step S301, a near-end audio signal and a far-end audio signal are acquired.
In the embodiment of the disclosure, the terminal device at the home end may acquire the near-end audio signal acquired by the microphone, and receive the far-end audio signal sent by the far-end terminal device through the communication network, and play the far-end audio signal through the speaker after receiving the far-end audio signal.
In a specific implementation, in a call between the terminal device at the home end and the terminal device at the far end, the near-end audio signal and the far-end audio signal may be acquired in real time, for example, the near-end audio signal may be a frame of audio signal currently acquired by the microphone, and the far-end audio signal may be a frame of audio signal currently received.
In step S302, it is detected whether the near-end frequency amplitude value exists in the target low-frequency band of the near-end audio signal.
The near-end audio signal collected by the terminal equipment of the local terminal through the microphone may include one or two of a near-end voice signal and an echo signal, wherein the echo signal is a far-end voice signal played by a loudspeaker and is formed after being collected by the microphone.
Considering the characteristic that the speaker of the terminal equipment at present has low frequency loss generally, namely that the audio played by the speaker has nonlinear distortion in a low frequency band, it can be seen that the echo signal has nonlinear distortion in the low frequency band relative to the far-end voice signal, however, the near-end voice signal acquired by the microphone does not have nonlinear distortion in the low frequency band, so that the embodiment of the disclosure selects the target low frequency band of the near-end audio signal for detection.
The target low frequency band can be set according to the requirement. Optionally, the target low-frequency band is determined according to the low-frequency band in which the speaker of the terminal device generates nonlinear distortion. If the speaker is non-linearly distorted below 300Hz, the target low frequency band may be a low frequency band below 300Hz, such as 90Hz-300Hz, 50Hz-300Hz, etc., as embodiments of the present disclosure are not limited.
The near-end frequency amplitude value may also be set as desired, which is used to characterize the presence of a speech signal in the near-end audio signal, e.g., if a near-end frequency amplitude value is present, it may be determined that one or both of a near-end speech signal and an echo signal are present in the near-end audio signal. Alternatively, the near-end frequency amplitude value may be a near-end frequency amplitude peak.
In step S302, firstly, a time domain signal of a near-end audio signal is converted into a frequency domain signal, the frequency domain signal is a spectrogram, the abscissa thereof is frequency, the ordinate thereof is frequency amplitude, and then each frequency point in a target low frequency band of the frequency domain signal is sequentially detected to determine whether a near-end frequency amplitude value exists at the target frequency point.
In some embodiments, taking the near-end frequency amplitude peak value as an example, detecting whether the frequency amplitude peak exists in the target low-frequency band of the near-end audio signal can be achieved by:
Detecting each frequency point of a target low-frequency band according to a frequency domain signal of a near-end audio signal, and if the energy corresponding to the frequency amplitude value of the target frequency point is larger than the energy corresponding to the frequency amplitude value of the last frequency point and the energy corresponding to the frequency amplitude value of the next frequency point, and the energy corresponding to the frequency amplitude value of the target frequency point is larger than a set energy value, determining the frequency amplitude value of the target frequency point as a frequency amplitude peak value; the set energy value is determined according to the average energy of the background noise of the near-end audio signal and the energy threshold coefficient of the near-end audio signal, for example, as shown in the formulas (1) and (2).
Wherein k represents a frame number, i.e., a kth frame; m 0 denotes a frequency point, M k(m0) denotes a frequency amplitude value of the near-end audio signal at M 0; m k(m0 -1) represents the frequency amplitude value of the near-end audio signal at M 0 -1; m k(m0 +1) represents the frequency amplitude value of the near-end audio signal at M 0 +1; e [ M k(m0) ] represents the energy corresponding to the frequency amplitude value of the near-end audio signal at M 0; e [ M k(m0 -1) ] represents the energy corresponding to the frequency amplitude value of the near-end audio signal at the M 0 -1 point; e [ M k(m0 +1) ] represents the energy corresponding to the frequency amplitude value of the near-end audio signal at the point M 0 +1; t 0 represents a set energy value; the judgment result of the frequency amplitude peak value at the m 0 point is shown; e noise floor NearEnd represents the average energy of the background noise of the near-end audio signal; alpha represents the energy threshold coefficient of the near-end audio signal. T 0 is determined by E noise floor NearEnd and α, and α may be set as required, and is not limited herein.
As can be seen from equation (1), if1, Which indicates that a frequency amplitude peak exists in a target low frequency band; if it isAnd 0, indicating that the frequency amplitude peak value does not exist in the target low-frequency band.
It should be noted that, the average energy E noise floor NearEnd of the background noise of the near-end audio signal refers to an average energy of background noise, for example, the background noise may be device noise, playback ambient noise, etc., and the background noise may be detected in the near-end audio signal, and the average energy of the background noise may be obtained.
Step S303, based on the detection result of the near-end frequency amplitude value, it is determined whether the near-end voice signal exists in the near-end audio signal, so as to obtain a near-end detection result.
In the step, if the detection result of the near-end frequency amplitude value is that the near-end frequency amplitude value does not exist, it can be determined that the near-end voice signal does not exist in the near-end audio signal; if the near-end frequency amplitude value exists, whether nonlinear distortion exists in the target low-frequency band can be determined according to the near-end frequency amplitude value. If the target low-frequency band has nonlinear distortion, the near-end voice signal does not exist in the near-end audio signal; if the target low-frequency band does not have nonlinear distortion, a near-end voice signal exists in the near-end audio signal.
Step S304, detecting whether a far-end voice signal exists in the far-end audio signal to obtain a far-end detection result.
In one possible implementation, a voice activity detection (Voice Activity Detection, VAD) algorithm may be employed to detect a far-end voice signal in the far-end audio signal. The VAD algorithm can detect the energy of the far-end audio signal, and can determine a voice area and a mute area according to the energy detection result, if the voice area exists, the far-end audio signal exists, and if the voice area does not exist, the far-end audio signal does not exist.
Step S305, based on the near-end detection result and the far-end detection result, it is determined whether the two-talk state is in.
In this step, the following four cases may be specifically included.
First case: if the far-end voice signal exists in the far-end audio signal and the near-end voice signal exists in the near-end audio signal, determining that the current double-talk state exists.
Second case: if the far-end voice signal exists in the far-end audio signal and the near-end voice signal does not exist in the near-end audio signal, determining that the far-end single-talk state exists currently.
Third case: if the far-end voice signal does not exist in the far-end audio signal and the near-end voice signal exists in the near-end audio signal, determining that the near-end single-talk state exists currently.
Fourth case: if the far-end voice signal does not exist in the far-end audio signal and the near-end voice signal does not exist in the near-end audio signal, determining that no person speaks at the near end and the far end currently.
In the embodiment of the disclosure, whether the target low frequency band of the near-end audio signal has nonlinear distortion is determined by detecting whether the target low frequency band of the near-end audio signal has a near-end frequency amplitude value, so that whether the near-end audio signal has a near-end voice signal can be determined. Thus, the embodiment of the disclosure can accurately detect the near-end voice signal. Further, the double talk state may be determined according to the detection result of the near-end voice signal and the detection result of the far-end voice signal.
An exemplary description is given below of the implementation of step S303 in the above-described embodiments of the present disclosure.
In some embodiments, as shown in fig. 4, the determining whether the near-end audio signal has the near-end speech signal in step S303 based on the detection result of the near-end frequency amplitude value may include the following steps:
step S3031, if the near-end frequency amplitude value exists, detecting the far-end frequency amplitude value corresponding to the target frequency point in the target low-frequency band of the far-end audio signal; the target frequency point is a frequency point corresponding to the near-end frequency amplitude value.
In this step, if the target low-frequency band of the near-end audio signal has a near-end frequency amplitude value, it may be stated that a speech signal is present in the near-end audio signal, and the speech signal may include one or both of a near-end speech signal and an echo signal.
If the near-end audio signal only comprises an echo signal, when the near-end frequency amplitude value is a near-end frequency amplitude peak value, the far-end frequency amplitude value corresponding to the target frequency point is a far-end frequency amplitude peak value; if the near-end audio signal includes a near-end speech signal, when the near-end frequency amplitude value is a near-end frequency amplitude peak value, the far-end frequency amplitude value corresponding to the target frequency point may not be the far-end frequency amplitude peak value, but may be the far-end frequency amplitude peak value.
In step S3032, the energy corresponding to the near-end frequency amplitude value is compared with the energy corresponding to the far-end frequency amplitude value, and whether the near-end voice signal exists in the near-end audio signal is determined according to the comparison result.
In this step, to determine whether a near-end speech signal is present in the speech signal, a near-end frequency amplitude value of the near-end audio signal may be compared with a far-end frequency amplitude value of the far-end audio signal at a target frequency point to determine whether a nonlinear distortion is present in the speech signal in the near-end audio signal. If nonlinear distortion exists, it can be determined that the near-end speech signal does not exist in the near-end audio signal; if no nonlinear distortion is present, it may be determined that a near-end speech signal is present in the near-end audio signal.
Further, if the target low-frequency band of the near-end audio signal does not have the near-end frequency amplitude value, determining that the near-end audio signal does not have the near-end voice signal.
Based on the above steps, according to the embodiment of the disclosure, by comparing the energy corresponding to the near-end frequency amplitude value of the target low-frequency band with the energy corresponding to the far-end frequency amplitude value of the target low-frequency band, whether the near-end audio signal has nonlinear distortion in the target low-frequency band can be determined, and further, the near-end speech signal can be accurately detected.
In the embodiment of the present disclosure, the step S3032 may be one of the following two implementations.
A first possible implementation: step S3032 may include the following steps a-c:
a. And comparing the difference value between the energy corresponding to the near-end frequency amplitude value and the energy corresponding to the far-end frequency amplitude value with a first threshold value.
B. If the difference is not greater than the first threshold, it is determined that the near-end speech signal does not exist in the near-end audio signal.
C. if the difference is greater than the first threshold, it is determined that a near-end speech signal is present in the near-end audio signal.
In the above step, the first threshold value is used to represent the degree of nonlinear distortion, and may be set according to an actual application scenario. If the difference value between the energy corresponding to the near-end frequency amplitude value and the energy corresponding to the far-end frequency amplitude value is not greater than a first threshold value, it can be indicated that nonlinear distortion exists in a target low-frequency band of the near-end audio signal, that is, the near-end audio signal does not exist in the near-end audio signal; if the difference is greater than the first threshold, it may be indicated that the target low-frequency band of the near-end audio signal has no nonlinear distortion, i.e., the near-end audio signal has a near-end speech signal.
Illustratively, the frequency point corresponding to the near-end frequency amplitude value is exemplified by the point m 0 in the above formula (1), and the above steps a-c can be implemented by the following formula (3).
Wherein,Representing the comparison result of the difference value and the first threshold value; /(I)The first threshold value is represented, and may be specifically set as required. Alternatively, the first threshold/>, is provided that the near-end audio signal and the far-end audio signal are not affected by ambient noise, or the hardware (microphone, speaker, etc.) performance of the terminal deviceMay be set to 0; of course, other values, such as 0.5, 1, 1.5, etc., may be set, without limitation.
As can be seen from equation (3), if1, Representing that no near-end voice signal exists in the near-end audio signal; if/>A 0 indicates the presence of a near-end speech signal in the near-end audio signal.
In this embodiment, by comparing the difference between the energy corresponding to the near-end frequency amplitude value and the energy corresponding to the far-end frequency amplitude value and the magnitude relation between the difference and the first threshold value, the near-end voice signal in the near-end audio signal can be accurately detected, and the double talk state can be accurately detected.
A second possible implementation: step S3032 may include the steps of:
A. The ratio of the energy corresponding to the near-end frequency amplitude value to the energy corresponding to the far-end frequency amplitude value is compared with a second threshold value.
B. If the ratio is not greater than the second threshold, it is determined that the near-end speech signal is not present in the near-end audio signal.
C. If the ratio is greater than the second threshold, it is determined that a near-end speech signal is present in the near-end audio signal.
In the above step, the second threshold is used to represent the degree of nonlinear distortion, and may be set according to the actual application scenario. If the ratio of the energy corresponding to the near-end frequency amplitude value to the energy corresponding to the far-end frequency amplitude value is not greater than the second threshold value, it can be indicated that nonlinear distortion exists in the target low-frequency band of the near-end audio signal, that is, the near-end audio signal does not exist in the near-end audio signal; if the ratio is greater than the second threshold, it may be stated that the target low-frequency band of the near-end audio signal is free of nonlinear distortion, i.e., the near-end speech signal is present in the near-end audio signal.
Illustratively, the frequency point corresponding to the near-end frequency amplitude value is exemplified by the point m 0 in the above formula (1), and the above steps a-c can be implemented by the following formula (3).
Wherein,Representing the comparison result of the ratio and the second threshold value; /(I)The second threshold value is specifically set as needed. Alternatively, the second threshold/>, provided that the near-end audio signal and the far-end audio signal are not affected by ambient noise, or the hardware (microphone, speaker, etc.) performance of the terminal deviceIt may be set to 1, but may be set to other values, such as 1.1, 1.2, etc., without limitation.
As can be seen from equation (4), if1, Representing that no near-end voice signal exists in the near-end audio signal; if/>A 0 indicates the presence of a near-end speech signal in the near-end audio signal.
In this embodiment, by comparing the magnitude relation between the energy corresponding to the near-end frequency amplitude value and the energy corresponding to the far-end frequency amplitude value and the second threshold value, the near-end voice signal in the near-end audio signal can be accurately detected, and the double talk state can be accurately detected.
Further, considering that in an actual call, the near-end audio signal and the far-end audio signal may be affected by environmental noise or the performance of hardware (such as a microphone and a speaker) of the terminal device, so as to affect the judgment of nonlinear distortion of the near-end audio signal, in order to more accurately detect the low-frequency distortion characteristic of the near-end audio signal, the first threshold value or the second threshold value may be updated according to an actual call scene, that is, the nonlinear distortion degree that needs to be achieved by the near-end audio signal is updated. Therefore, based on the updated first threshold value or the updated second threshold value, whether the near-end voice signal exists in the near-end audio signal or not is judged, and the detection accuracy of the near-end voice signal can be further improved.
In some embodiments, updating the first threshold value or the second threshold value may include the following steps (1) and (2):
(1) Determining whether an update condition of the first threshold value or the second threshold value is reached based on one or more of energy corresponding to the near-end frequency amplitude value, energy corresponding to the far-end frequency amplitude value, and energy mean square error of the far-end audio signal; the energy mean square error of the far-end audio signals is the energy mean square error of each obtained far-end audio signal at the target frequency point in the set time period.
The remote audio signal may be a frame of audio signal acquired in real time, and each remote audio signal may be obtained in a set period of time including the current time. The update condition of the first threshold value and the update condition of the second threshold value may be the same.
In an alternative embodiment, the update condition of the first threshold value or the second threshold value may include one or more of the following inequality (5), inequality (6), inequality (7):
E[Ak(m0)]>E[Mk(m0)] (5)
STD{E[Ak-n(m0)],E[Ak-n+1(m0)]...E[Ak(m0)]}<Estd (6)
E[Ak(m0)]>βE[noise floor FarEnd] (7)
Wherein m 0 is a target frequency point; k-n, k-n+1 … … k are frame numbers corresponding to each far-end audio signal, k is a current frame number, n is an integer greater than or equal to 0, and k is greater than n; a k(m0) is the far-end frequency amplitude value of the far-end audio signal of the kth frame at m 0; m k(m0) is the near-end frequency amplitude value of the near-end audio signal of the kth frame at M 0; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value of the far-end audio signal of the kth frame at the m 0 point; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value of the near-end audio signal of the kth frame at the M 0 point; STD { } represents mean square error; e std is the energy mean square error threshold of the far-end audio signal; beta is the energy threshold coefficient of the far-end audio signal; e noise floor FarEnd is the average energy of the background noise of the far-end audio signal.
It should be noted that E std and β may be set as needed, and are not limited herein. The average energy of the background noise E noise floor NearEnd of the far-end audio signal refers to the average energy of the background noise in the far-end audio signal.
When one or more of the inequality (5), inequality (6) and inequality (7) is satisfied, in order to more accurately detect whether the nonlinear distortion exists in the target low frequency band of the near-end audio signal, the first threshold value or the second threshold value may be updated to update the degree of nonlinear distortion that needs to be achieved in the target low frequency band of the near-end audio signal.
(2) And if the updating condition of the first threshold value or the second threshold value is met, updating the first threshold value or the second threshold value.
In an alternative embodiment, the updating of the first threshold value may be implemented as follows.
Updating the first threshold value by the following equation (8):
/>
wherein, Is a first threshold value; gamma is a setting coefficient, and can be set according to the need; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value.
In this embodiment, the updating of the first threshold value may be implemented by the formula (8) to update the nonlinear distortion degree that needs to be achieved by the target low frequency band of the near-end audio signal, so as to adapt to a complex call scenario, and more accurately detect whether nonlinear distortion exists in the target low frequency band of the near-end audio signal.
In another alternative embodiment, the updating of the second threshold value may be achieved as follows.
The second threshold value is updated by the following expression (9):
wherein, Is a second threshold value; gamma is a setting coefficient, and can be set according to the need; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value.
In this embodiment, the second threshold value may be updated by the formula (9), and the nonlinear distortion degree to be achieved by the target low frequency band of the near-end audio signal may be updated, so as to adapt to a complex call scenario, and more accurately detect whether nonlinear distortion exists in the target low frequency band of the near-end audio signal.
The following describes a two-way detection module for implementing the two-way detection method according to the embodiments of the present disclosure, with reference to the schematic structural diagram of the AEC based on the two-way detection mode shown in fig. 5.
As shown in fig. 5, x (n) represents a far-end signal transmitted from a far end to a local end, d (n) represents a near-end signal acquired by a local end microphone, v (n) represents a near-end voice signal, and y (n) represents an echo signal; d (n) may contain both v (n) and y (n), or may contain one of v (n) and y (n). The double-talk detection module 51 includes a far-end voice detection module 511, a decision module 512, and a near-end voice detection module 513, detects a far-end voice signal in x (n) through the far-end voice detection module 511, and detects a near-end voice signal in d (n) through the near-end voice detection module 513.
The far-end voice detection module 511 and the near-end voice detection module 513 respectively send the respective detection results to the decision module 512, and the decision module 512 can determine the speaking states of the two ends according to the detection results of the two modules. For example, both ends speak at the same time, in a two-talk state; one end speaks in a single talk state. The decision module 512 sends the speaking states of the two ends to the adaptive algorithm module 52, and the adaptive algorithm module 52 determines the filtering parameters of the adaptive filter 53 according to the speaking states of the two ends, and then sends the filtering parameters to the adaptive filter 53, so that the adaptive filter 53 filters according to the filtering parameters to suppress the echo signal.
The far-end voice detection module 511 may detect the far-end voice signal by using a voice activity detection algorithm, and the near-end voice detection module 513 may use the above-mentioned method for detecting the near-end voice signal according to the embodiments of the present disclosure.
By testing the double-talk detection method of the embodiment of the disclosure on terminal equipment (such as mobile phones and notebook computers) in the field of real-time communication, the accuracy of double-talk detection is greatly improved, and the erroneous judgment rate and the missed judgment rate are low.
In addition, the double-talk detection method of the embodiment of the disclosure can be applied to various complex scenes and various terminal equipment, so that the robustness of double-talk detection is improved, and the double-talk detection method is particularly applicable to the field of real-time communication; and the method has small calculation cost and is more suitable for being applied to personal terminal equipment.
Furthermore, by matching the accurate double-talk detection result with the adaptive filter to carry out echo suppression, the echo suppression effect and double-talk definition can be greatly improved.
Based on the same inventive concept, the embodiments of the present disclosure further provide a two-way detection device, which solves the problem in a similar manner to the method in the above embodiments, so that the implementation of the device may refer to the implementation of the method, and the repetition is omitted. Referring to fig. 6, a two-way detection device provided in an embodiment of the present disclosure includes an audio acquisition module 61, a low-frequency band detection module 62, a near-end voice detection module 63, a far-end voice detection module 64, and a two-way detection module 65.
An audio acquisition module 61 for acquiring a near-end audio signal and a far-end audio signal;
The low-frequency band detection module 62 is configured to detect whether a target low-frequency band of the near-end audio signal has a near-end frequency amplitude value;
The near-end voice detection module 63 is configured to determine whether a near-end voice signal exists in the near-end audio signal based on a detection result of the near-end frequency amplitude value, so as to obtain a near-end detection result;
the far-end voice detection module 64 is configured to detect whether a far-end voice signal exists in the far-end audio signal, so as to obtain a far-end detection result;
the two-talk detection module 65 is configured to determine whether to be in a two-talk state based on the near-end detection result and the far-end detection result.
In an alternative embodiment, as shown in fig. 7, the near-end voice detection module 63 may further include:
the detection sub-module 631 is configured to detect, if a near-end frequency amplitude value exists, a far-end frequency amplitude value corresponding to a target frequency point in a target low-frequency band of the far-end audio signal; the target frequency point is a frequency point corresponding to the near-end frequency amplitude value;
the comparing sub-module 632 is configured to compare the energy corresponding to the near-end frequency amplitude value with the energy corresponding to the far-end frequency amplitude value, and determine whether the near-end voice signal exists in the near-end audio signal according to the comparison result.
In an alternative embodiment, the comparison sub-module 632 may also be configured to:
Comparing the difference value between the energy corresponding to the near-end frequency amplitude value and the energy corresponding to the far-end frequency amplitude value with a first threshold value;
if the difference value is not greater than the first threshold value, determining that the near-end voice signal does not exist in the near-end audio signal;
If the difference value is larger than the first threshold value, determining that a near-end voice signal exists in the near-end audio signal; or alternatively
Comparing the ratio of the energy corresponding to the near-end frequency amplitude value to the energy corresponding to the far-end frequency amplitude value with a second threshold value;
If the ratio is not greater than the second threshold value, determining that the near-end voice signal does not exist in the near-end audio signal;
if the ratio is greater than the second threshold, it is determined that a near-end speech signal is present in the near-end audio signal.
In an alternative embodiment, the apparatus may further include a threshold updating module configured to:
Determining whether an update condition of the first threshold value or the second threshold value is reached based on one or more of energy corresponding to the near-end frequency amplitude value, energy corresponding to the far-end frequency amplitude value, and energy mean square error of the far-end audio signal; the energy mean square error of the far-end audio signals is the energy mean square error of each obtained far-end audio signal at the target frequency point in a set time period;
and if the updating condition of the first threshold value or the second threshold value is met, updating the first threshold value or the second threshold value.
In an alternative embodiment, the update condition of the first threshold value or the second threshold value comprises one or more of the following:
E[Ak(m0)]>E[Mk(m0)]
STD{E[Ak-n(m0)],E[Ak-n+1(m0)]...E[Ak(m0)]}<Estd
E[Ak(m0)]>βE[noise floor FarEnd]
Wherein m 0 is a target frequency point; k-n, k-n+1 … … k are frame numbers corresponding to each far-end audio signal, k is a current frame number, n is an integer greater than or equal to 0, and k is greater than n; a k(m0) is the far-end frequency amplitude value; m k(m0) is the near-end frequency amplitude value; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value; STD { } represents mean square error; e std is the energy mean square error threshold of the far-end audio signal; beta is the energy threshold coefficient of the far-end audio signal; e noise floor FarEnd is the average energy of the background noise of the far-end audio signal.
In an alternative embodiment, the threshold update module may be further configured to:
updating the first threshold value by the following formula:
wherein, Is a first threshold value; gamma is a set coefficient; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value; or alternatively
The second threshold value is updated by the following equation:
wherein, Is a second threshold value.
Based on the same inventive concept, the embodiments of the present disclosure further provide an electronic device, which has a similar principle of solving the problem as the method of the foregoing embodiments, so that the implementation of the electronic device may refer to the implementation of the method, and the repetition is omitted. Fig. 8 shows a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Referring to fig. 8, an electronic device may include a processor 802 and a memory 801. The memory 801 provides program instructions and data stored in the memory 801 to the processor 802. In the disclosed embodiment, the memory 801 may be used to store the program of the double talk detection in the disclosed embodiment.
The processor 802 is configured to execute the method of any of the above-described method embodiments, such as a two-talk detection method provided by the embodiment shown in fig. 2, by invoking program instructions stored in the memory 801.
The particular connection medium between the memory 801 and the processor 802 described above is not limited in the presently disclosed embodiments. The embodiment of the present disclosure is illustrated in fig. 8 by a bus 803 connected between a memory 801 and a processor 802, where the bus 803 is illustrated in fig. 8 by a bold line, and the connection between other components is merely illustrative and not limiting. The bus 803 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.
The Memory may include Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM), and may also include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital instruction processors (DIGITAL SIGNAL Processing units, DSPs), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
The disclosed embodiments also provide a computer storage medium having a computer program stored therein, a processor of a computer device reading the computer program from the computer readable storage medium, the processor executing the computer program, so that the computer device performs the double talk detection method in any of the above-described method embodiments.
In a specific implementation, the computer storage medium may include: a universal serial bus flash disk (USB, universal Serial Bus FLASH DRIVE), a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.
In some possible embodiments, aspects of the two-talk detection method provided by the present disclosure may also be implemented in the form of a program product, which includes a program code for causing a computer device to perform the steps of the two-talk detection method according to the various exemplary embodiments of the present disclosure described above when the program product is run on the computer device, for example, the computer device may perform the two-talk detection flow in step S301-step S305 as shown in fig. 3.
It will be apparent to those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present disclosure without departing from the spirit or scope of the disclosure. Thus, the present disclosure is intended to include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (12)

1. A two-way detection method, comprising:
Acquiring a near-end audio signal and a far-end audio signal;
Detecting whether a near-end frequency amplitude value exists in a target low-frequency band of the near-end audio signal; the target low frequency band is determined according to the low frequency band of the loudspeaker with nonlinear distortion;
If the near-end frequency amplitude value exists, detecting a far-end frequency amplitude value corresponding to a target frequency point in the target low-frequency band of the far-end audio signal; the target frequency point is a frequency point corresponding to the near-end frequency amplitude value;
comparing the energy corresponding to the near-end frequency amplitude value with the energy corresponding to the far-end frequency amplitude value, and determining whether nonlinear distortion exists in a target low-frequency band of the near-end audio signal according to a comparison result;
If nonlinear distortion exists, determining that a near-end voice signal does not exist in the near-end audio signal, and if nonlinear distortion does not exist, determining that a near-end voice signal exists in the near-end audio signal so as to obtain a near-end detection result;
Detecting whether a far-end voice signal exists in the far-end audio signal or not to obtain a far-end detection result;
determining whether the near-end detection result and the far-end detection result are in a double-talk state or not based on the near-end detection result and the far-end detection result;
The comparing the energy corresponding to the near-end frequency amplitude value with the energy corresponding to the far-end frequency amplitude value, and determining whether the nonlinear distortion exists in the target low-frequency band of the near-end audio signal according to the comparison result comprises the following steps:
comparing the difference value of the energy corresponding to the near-end frequency amplitude value and the energy corresponding to the far-end frequency amplitude value with a first threshold value;
if the difference value is not greater than the first threshold value, determining that nonlinear distortion exists in a target low-frequency band of the near-end audio signal;
if the difference value is larger than the first threshold value, determining that nonlinear distortion does not exist in the target low-frequency band of the near-end audio signal; or alternatively
Comparing the ratio of the energy corresponding to the near-end frequency amplitude value to the energy corresponding to the far-end frequency amplitude value with a second threshold value;
if the ratio is not greater than the second threshold value, determining that nonlinear distortion exists in the target low-frequency band of the near-end audio signal;
If the ratio is greater than the second threshold value, determining that nonlinear distortion does not exist in the target low-frequency band of the near-end audio signal;
the method further comprises the steps of:
Determining whether an updating condition of the first threshold value or the second threshold value is reached based on the energy corresponding to the near-end frequency amplitude value, the energy corresponding to the far-end frequency amplitude value and the energy mean square error of the far-end audio signal; the energy mean square error of the far-end audio signals is the energy mean square error of each obtained far-end audio signal at the target frequency point in a set time period;
and if the updating condition of the first threshold value or the second threshold value is met, updating the first threshold value or the second threshold value.
2. The method of claim 1, wherein the update condition of the first threshold value or the second threshold value comprises:
E[Ak(m0)]>E[Mk(m0)]
STD{E[Ak-n(m0)],E[Ak-n+1(m0)]...E[Ak(m0)]}<Estd
E[Ak(m0)]>βE[noise floor FarEnd]
Wherein m 0 is the target frequency point; k-n, k-n+1 … … k are frame numbers corresponding to the remote audio signals, k is a current frame number, n is an integer greater than or equal to 0, and k is greater than n; a k(m0) is the far-end frequency amplitude value; m k(m0) is the near-end frequency amplitude value; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value; STD { } represents mean square error; e std is the energy mean square error threshold of the far-end audio signal; beta is the energy threshold coefficient of the far-end audio signal; e noise floor FarEnd is the average energy of the background noise of the far-end audio signal.
3. The method according to claim 1 or 2, wherein said updating of said first threshold value or said second threshold value comprises:
updating the first threshold value by the following formula:
wherein, Is the first threshold value; gamma is a set coefficient; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value; or alternatively
Updating the second threshold value by the following formula:
wherein, Is the second threshold value.
4. The method of claim 1, wherein determining whether a near-end speech signal is present in the near-end audio signal based on the detection of the near-end frequency amplitude value further comprises:
and if the near-end frequency amplitude value does not exist in the target low-frequency band of the near-end audio signal, determining that the near-end voice signal does not exist in the near-end audio signal.
5. The method according to claim 1 or 2, wherein the determining whether to be in a two-talk state based on the near-end detection result and the far-end detection result comprises:
If the far-end voice signal exists in the far-end audio signal and the near-end voice signal exists in the near-end audio signal, determining that the current double-talk state exists; or alternatively
If the far-end voice signal exists in the far-end audio signal and the near-end voice signal does not exist in the near-end audio signal, determining that the far-end single-talk state exists currently; or alternatively
And if the far-end voice signal does not exist in the far-end audio signal and the near-end voice signal exists in the near-end audio signal, determining that the near-end single-talk state exists currently.
6. A two-way detection device, comprising:
the audio acquisition module is used for acquiring a near-end audio signal and a far-end audio signal;
The low-frequency band detection module is used for detecting whether a near-end frequency amplitude value exists in a target low-frequency band of the near-end audio signal; the target low frequency band is determined according to the low frequency band of the loudspeaker with nonlinear distortion;
the near-end voice detection module is used for determining whether a near-end voice signal exists in the near-end audio signal or not based on the detection result of the near-end frequency amplitude value so as to obtain a near-end detection result;
The remote voice detection module is used for detecting whether a remote voice signal exists in the remote audio signal or not so as to obtain a remote detection result;
The double-talk detection module is used for determining whether the near-end detection result and the far-end detection result are in a double-talk state or not based on the near-end detection result and the far-end detection result;
The near-end voice detection module further comprises:
the detection sub-module is used for detecting a far-end frequency amplitude value corresponding to a target frequency point in the target low-frequency band of the far-end audio signal if the near-end frequency amplitude value exists; the target frequency point is a frequency point corresponding to the near-end frequency amplitude value;
the comparison sub-module is used for comparing the energy corresponding to the near-end frequency amplitude value with the energy corresponding to the far-end frequency amplitude value, determining whether nonlinear distortion exists in a target low-frequency band of the near-end audio signal according to a comparison result, if so, determining that the near-end audio signal does not exist, and if not, determining that the near-end audio signal exists;
The comparing sub-module is further configured to:
comparing the difference value of the energy corresponding to the near-end frequency amplitude value and the energy corresponding to the far-end frequency amplitude value with a first threshold value;
if the difference value is not greater than the first threshold value, determining that nonlinear distortion exists in a target low-frequency band of the near-end audio signal;
if the difference value is larger than the first threshold value, determining that nonlinear distortion does not exist in the target low-frequency band of the near-end audio signal; or alternatively
Comparing the ratio of the energy corresponding to the near-end frequency amplitude value to the energy corresponding to the far-end frequency amplitude value with a second threshold value;
if the ratio is not greater than the second threshold value, determining that nonlinear distortion exists in the target low-frequency band of the near-end audio signal;
If the ratio is greater than the second threshold value, determining that nonlinear distortion does not exist in the target low-frequency band of the near-end audio signal;
the apparatus further comprises a threshold updating module configured to:
Determining whether an updating condition of the first threshold value or the second threshold value is reached based on the energy corresponding to the near-end frequency amplitude value, the energy corresponding to the far-end frequency amplitude value and the energy mean square error of the far-end audio signal; the energy mean square error of the far-end audio signals is the energy mean square error of each obtained far-end audio signal at the target frequency point in a set time period;
and if the updating condition of the first threshold value or the second threshold value is met, updating the first threshold value or the second threshold value.
7. The apparatus of claim 6, wherein the update condition of the first threshold value or the second threshold value comprises:
E[Ak(m0)]>E[Mk(m0)]
STD{E[Ak-n(m0)],E[Ak-n+1(m0)]...E[Ak(m0)]}<Estd
E[Ak(m0)]>βE[noise floor FarEnd]
Wherein m 0 is the target frequency point; k-n, k-n+1 … … k are frame numbers corresponding to the remote audio signals, k is a current frame number, n is an integer greater than or equal to 0, and k is greater than n; a k(m0) is the far-end frequency amplitude value; m k(m0) is the near-end frequency amplitude value; e [ A k(m0) ] is the energy corresponding to the far-end frequency amplitude value; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value; STD { } represents mean square error; e std is the energy mean square error threshold of the far-end audio signal; beta is the energy threshold coefficient of the far-end audio signal; e noise floor FarEnd is the average energy of the background noise of the far-end audio signal.
8. The apparatus of claim 6 or 7, wherein the threshold updating module is further configured to:
updating the first threshold value by the following formula:
wherein, Is the first threshold value; gamma is a set coefficient; e [ Ak (m 0) ] is the energy corresponding to the far-end frequency amplitude value; e [ M k(m0) ] is the energy corresponding to the near-end frequency amplitude value; or alternatively
Updating the second threshold value by the following formula:
wherein, Is the second threshold value.
9. The apparatus of claim 6, wherein the near-end voice detection module is further configured to:
and if the near-end frequency amplitude value does not exist in the target low-frequency band of the near-end audio signal, determining that the near-end voice signal does not exist in the near-end audio signal.
10. The apparatus of any one of claims 6 or 7, wherein the two-talk detection module is further configured to:
If the far-end voice signal exists in the far-end audio signal and the near-end voice signal exists in the near-end audio signal, determining that the current double-talk state exists; or alternatively
If the far-end voice signal exists in the far-end audio signal and the near-end voice signal does not exist in the near-end audio signal, determining that the far-end single-talk state exists currently; or alternatively
And if the far-end voice signal does not exist in the far-end audio signal and the near-end voice signal exists in the near-end audio signal, determining that the near-end single-talk state exists currently.
11. An electronic device comprising a processor and a memory, wherein the memory stores program code that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1-5.
12. A computer readable storage medium, characterized in that it comprises a program code for causing an electronic device to perform the steps of the method according to any of claims 1-5, when said program code is run on the electronic device.
CN202110478318.0A 2021-04-30 2021-04-30 Double-talk detection method, device, equipment and medium Active CN113223547B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110478318.0A CN113223547B (en) 2021-04-30 2021-04-30 Double-talk detection method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110478318.0A CN113223547B (en) 2021-04-30 2021-04-30 Double-talk detection method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN113223547A CN113223547A (en) 2021-08-06
CN113223547B true CN113223547B (en) 2024-05-24

Family

ID=77090204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110478318.0A Active CN113223547B (en) 2021-04-30 2021-04-30 Double-talk detection method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN113223547B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5353348A (en) * 1993-05-14 1994-10-04 Jrc International, Inc. Double echo cancelling system
US5978824A (en) * 1997-01-29 1999-11-02 Nec Corporation Noise canceler
JP2001067094A (en) * 1999-08-30 2001-03-16 Mitsubishi Electric Corp Voice recognizing device and its method
US6570985B1 (en) * 1998-01-09 2003-05-27 Ericsson Inc. Echo canceler adaptive filter optimization
WO2003096031A2 (en) * 2002-03-05 2003-11-20 Aliphcom Voice activity detection (vad) devices and methods for use with noise suppression systems
CN103561184A (en) * 2013-11-05 2014-02-05 武汉烽火众智数字技术有限责任公司 Frequency-convertible echo cancellation method based on near-end audio signal calibration and correction
CN108353107A (en) * 2015-11-13 2018-07-31 伯斯有限公司 The double talk detection eliminated for acoustic echo
CN110634496A (en) * 2019-10-22 2019-12-31 广州视源电子科技股份有限公司 Double-talk detection method and device, computer equipment and storage medium
CN111161748A (en) * 2020-02-20 2020-05-15 百度在线网络技术(北京)有限公司 Double-talk state detection method and device and electronic equipment
CN112017679A (en) * 2020-08-05 2020-12-01 海尔优家智能科技(北京)有限公司 Method, device and equipment for updating adaptive filter coefficient
CN112292844A (en) * 2019-05-22 2021-01-29 深圳市汇顶科技股份有限公司 Double-end call detection method, double-end call detection device and echo cancellation system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2449720A (en) * 2007-05-31 2008-12-03 Zarlink Semiconductor Inc Detecting double talk conditions in a hands free communication system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5353348A (en) * 1993-05-14 1994-10-04 Jrc International, Inc. Double echo cancelling system
US5978824A (en) * 1997-01-29 1999-11-02 Nec Corporation Noise canceler
US6570985B1 (en) * 1998-01-09 2003-05-27 Ericsson Inc. Echo canceler adaptive filter optimization
JP2001067094A (en) * 1999-08-30 2001-03-16 Mitsubishi Electric Corp Voice recognizing device and its method
WO2003096031A2 (en) * 2002-03-05 2003-11-20 Aliphcom Voice activity detection (vad) devices and methods for use with noise suppression systems
CN103561184A (en) * 2013-11-05 2014-02-05 武汉烽火众智数字技术有限责任公司 Frequency-convertible echo cancellation method based on near-end audio signal calibration and correction
CN108353107A (en) * 2015-11-13 2018-07-31 伯斯有限公司 The double talk detection eliminated for acoustic echo
CN112292844A (en) * 2019-05-22 2021-01-29 深圳市汇顶科技股份有限公司 Double-end call detection method, double-end call detection device and echo cancellation system
CN110634496A (en) * 2019-10-22 2019-12-31 广州视源电子科技股份有限公司 Double-talk detection method and device, computer equipment and storage medium
CN111161748A (en) * 2020-02-20 2020-05-15 百度在线网络技术(北京)有限公司 Double-talk state detection method and device and electronic equipment
CN112017679A (en) * 2020-08-05 2020-12-01 海尔优家智能科技(北京)有限公司 Method, device and equipment for updating adaptive filter coefficient

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于麦克风阵列的声学回声消除研究;饶鼎;《中国优秀硕士论文电子期刊网》;20200815;I136-71 *

Also Published As

Publication number Publication date
CN113223547A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
US8842851B2 (en) Audio source localization system and method
CN111768796B (en) Acoustic echo cancellation and dereverberation method and device
CN101689371B (en) A device for and a method of processing audio signals
CN111951819A (en) Echo cancellation method, device and storage medium
JP5501527B2 (en) Echo canceller and echo detector
US10115411B1 (en) Methods for suppressing residual echo
US11349525B2 (en) Double talk detection method, double talk detection apparatus and echo cancellation system
US9246545B1 (en) Adaptive estimation of delay in audio systems
CN110431624B (en) Residual echo detection method, residual echo detection device, voice processing chip and electronic equipment
CN110956975B (en) Echo cancellation method and device
CN110992923B (en) Echo cancellation method, electronic device, and storage device
US20170310360A1 (en) Echo removal device, echo removal method, and non-transitory storage medium
CN110995951A (en) Echo cancellation method, device and system based on double-end sounding detection
CN111524532B (en) Echo suppression method, device, equipment and storage medium
CN111028855B (en) Echo suppression method, device, equipment and storage medium
CN106161820B (en) A kind of interchannel decorrelation method for stereo acoustic echo canceler
US8582754B2 (en) Method and system for echo cancellation in presence of streamed audio
CN110148421B (en) Residual echo detection method, terminal and device
CN111989934B (en) Echo cancellation device, echo cancellation method, signal processing chip, and electronic apparatus
CN113345459A (en) Method and device for detecting double-talk state, computer equipment and storage medium
CN112929506B (en) Audio signal processing method and device, computer storage medium and electronic equipment
CN113223547B (en) Double-talk detection method, device, equipment and medium
CN112997249B (en) Voice processing method, device, storage medium and electronic equipment
CN111883153A (en) Microphone array-based double-talk state detection method and device
CN115620737A (en) Voice signal processing device, method, electronic equipment and sound amplification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210926

Address after: 310052 Room 408, building 3, No. 399, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Hangzhou Netease Zhiqi Technology Co.,Ltd.

Address before: 310052 Room 301, Building No. 599, Changhe Street Network Business Road, Binjiang District, Hangzhou City, Zhejiang Province

Applicant before: HANGZHOU LANGHE TECHNOLOGY Ltd.

GR01 Patent grant