CN113345459A

CN113345459A - Method and device for detecting double-talk state, computer equipment and storage medium

Info

Publication number: CN113345459A
Application number: CN202110805408.6A
Authority: CN
Inventors: 秦永红; 付贤会; 刘武钊
Original assignee: Beijing Rongxun Technology Co ltd
Current assignee: Beijing Rongxun Technology Co ltd
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2021-09-03
Anticipated expiration: 2041-07-16
Also published as: CN113345459B

Abstract

The embodiment of the invention discloses a method and a device for detecting a double-talk state, computer equipment and a storage medium. The method comprises the following steps: acquiring a far-end voice reference signal and a near-end signal; determining a near-end microphone input signal according to the far-end speech reference signal and the near-end signal; the far-end voice reference signal is processed by a preset self-adaptive filter to obtain an estimated echo signal; determining a residual echo output signal from the near-end microphone input signal and the estimated echo signal; calculating a double-talk detection decision value according to the estimated echo signal, the near-end signal and the residual echo output signal; and determining whether the current state is the double-talk state or not according to the double-talk detection judgment value. The method for detecting the double-talk state improves the accuracy of double-talk state detection, can adapt to various speaking contexts or scenes, has higher robustness compared with the prior art, and reduces the problem of voice interruption caused by false detection.

Description

Method and device for detecting double-talk state, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of audio signal processing, in particular to a method and a device for detecting a double-talk state, computer equipment and a storage medium.

Background

With the continuous development of information technology, various distributed intelligent hardware is increasingly widely applied in various fields, and echo suppression becomes a hot spot for research of technicians in related fields. Real-time transmission of speech over the internet has become widespread, and one of the key factors affecting speech quality is the problem of echo. One key indicator of the echo cancellation algorithm is double-talk detection, and if the double-talk detection is not accurate in the echo cancellation algorithm, voice interruption occurs. Therefore, in echo cancellation processing, double talk detection is crucial to speech quality.

At present, the double-talk detection is mostly realized based on the traditional time/frequency domain calculation, namely, the mutual correlation coefficient between far-end speech and near-end speech, the spectrum calculation and other ideas are used for performing the double-talk detection. However, the prior art has at least the following problems: in voice communication of a Voice Over Internet Phone (VOIP), the reasons for echo generation are complex, and the method has the characteristics of complex echo source, large echo path delay, variable call scenes, variable call device types and the like, while the iteration factors and the adopted parameters of the conventional double-talk detection algorithm are basically fixed, so that the method has limitations and cannot be effectively adjusted along with the change of the speaking context/scene.

Disclosure of Invention

The embodiment of the invention provides a method and a device for detecting a double-talk state, computer equipment and a storage medium, which are used for improving the accuracy of double-talk state detection and are suitable for various scenes so as to reduce the problem of voice interruption caused by false detection.

In a first aspect, an embodiment of the present invention provides a method for detecting a dual speech state, where the method includes:

acquiring a far-end voice reference signal and a near-end signal;

determining a near-end microphone input signal according to the far-end speech reference signal and the near-end signal;

the far-end voice reference signal is processed by a preset self-adaptive filter to obtain an estimated echo signal;

determining a residual echo output signal from the near-end microphone input signal and the estimated echo signal;

calculating a double-talk detection decision value according to the estimated echo signal, the near-end signal and the residual echo output signal;

and determining whether the current state is the double-talk state or not according to the double-talk detection judgment value.

In a second aspect, an embodiment of the present invention further provides a device for detecting a dual speech state, where the device includes:

the signal acquisition module is used for acquiring a far-end voice reference signal and a near-end signal;

a near-end microphone input signal determination module, configured to determine a near-end microphone input signal according to the far-end speech reference signal and the near-end signal;

an estimated echo signal obtaining module, configured to pass the far-end speech reference signal through a preset adaptive filter to obtain an estimated echo signal;

a residual echo output signal determining module, configured to determine a residual echo output signal according to the near-end microphone input signal and the estimated echo signal;

the double-talk detection decision value calculation module is used for calculating a double-talk detection decision value according to the estimated echo signal, the near-end signal and the residual echo output signal;

and the double-talk state determining module is used for determining whether the current double-talk state is the double-talk state according to the double-talk detection judging value.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

a memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement the method for detecting a double talk state provided by any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for detecting a double talk state provided in any embodiment of the present invention.

The embodiment of the invention provides a method for detecting a double-talk state, which comprises the steps of firstly obtaining a far-end voice reference signal and a near-end signal, then determining a near-end microphone input signal according to the far-end voice reference signal and the near-end signal, simultaneously enabling the far-end voice reference signal to pass through a preset self-adaptive filter to obtain an estimated echo signal, then determining a residual echo output signal according to the obtained near-end microphone input signal and the estimated echo signal, and finally calculating a double-talk detection judgment value according to the obtained estimated echo signal, the near-end signal and the residual echo output signal, thereby determining whether the double-talk state is currently determined according to the double-talk detection judgment value. The method for detecting the double-talk state provided by the embodiment of the invention judges whether the double-talk state is currently in the double-talk state by adaptively calculating the double-talk detection judgment value according to the estimated echo signal, the near-end signal and the residual echo output signal every time, improves the accuracy of double-talk state detection, can adapt to various speaking contexts or scenes, has higher robustness compared with the prior art, and reduces the problem of voice interruption caused by false detection.

Drawings

Fig. 1 is a flowchart of a method for detecting a double talk state according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a dual-speech state detection apparatus according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of a method for detecting a double talk state according to an embodiment of the present invention. The embodiment is applicable to the case of eliminating echo in a microphone collected signal, and the method can be executed by the device for detecting a double-talk state provided by the embodiment of the invention, and the device can be realized by hardware and/or software, and can be generally integrated in a computer device. As shown in fig. 1, the method specifically comprises the following steps:

and S11, acquiring a far-end voice reference signal and a near-end signal.

Specifically, in the real-time transmission process of voice on the internet, the local computer device may collect a near-end signal through the microphone and transmit the near-end signal to the outside, and may play the received audio signal through the speaker, so that the microphone may collect the near-end signal and may actually collect the audio signal played by the speaker at the same time, thereby generating an echo, and the audio signal played by the speaker may be used as a far-end voice reference signal.

And S12, determining a near-end microphone input signal according to the far-end voice reference signal and the near-end signal.

Specifically, after the far-end speech reference signal and the near-end signal are acquired, a signal actually acquired by the microphone, that is, a near-end microphone input signal, may be calculated according to the two signals.

Optionally, the determining a near-end microphone input signal according to the far-end speech reference signal and the near-end signal includes:

y(n)＝z(n)+ξ*x(n)

where y (n) represents the near-end microphone input signal, z (n) represents the near-end signal, x (n) represents the far-end speech reference signal, ξ x (n) represents a superposition of direct and reflected sounds of the far-end speech reference signal.

And S13, the far-end voice reference signal is processed by a preset self-adaptive filter to obtain an estimated echo signal.

Specifically, after the far-end speech reference signal is obtained, the far-end speech reference signal may be input to a preset adaptive filter, so as to obtain an estimated echo signal. Wherein the estimated echo signal obtained by the adaptive filter may be

Where N may be the number of partitioned subbands.

And S14, determining a residual echo output signal according to the near-end microphone input signal and the estimated echo signal.

Specifically, after obtaining the near-end microphone input signal and the estimated echo signal, a residual echo output signal may be calculated according to the two signals, that is, the signal after echo removal.

Wherein, optionally, the determining a residual echo output signal from the near-end microphone input signal and the estimated echo signal comprises:

wherein e (n) represents the residual echo output signal, y (n) represents the near-end microphone input signal,

representing the estimated echo signal.

And S15, calculating a double-talk detection judgment value according to the estimated echo signal, the near-end signal and the residual echo output signal.

Specifically, after the estimated echo signal, the near-end signal and the residual echo output signal are obtained, a double-talk detection decision value can be calculated according to the three signals, so that whether the current double-talk state is determined according to the double-talk detection decision value.

Optionally, the calculating a double-talk detection decision value according to the estimated echo signal, the near-end signal, and the residual echo output signal includes:

wherein λ represents the double talk detection determination value,

representing said estimated echo signal, e (n) representing said residual echo output signal, z (n) representing said near-end signal,

representing the mean square error of the estimated echo signal,

representing the mean square error of the residual echo output signal,

represents the mean square error of the near-end signal, and mu represents an amplification factor, and specifically can be mu ≧ 1. The method has higher robustness by calculating the mean square error of the statistical characteristics of the time domain signals to calculate the double-talk detection judgment value.

And S16, determining whether the current state is the double-talk state according to the double-talk detection judgment value.

Optionally, the dual-talk detection and determination value includes the dual-talk detection and determination value of each sub-band, and determining whether the current dual-talk state is the dual-talk state according to the dual-talk detection and determination value includes: respectively comparing the double-talk detection judgment value of each sub-band with a preset judgment value threshold, and counting the number of the sub-bands of which the corresponding double-talk detection judgment value is greater than or equal to the preset judgment value threshold; and comparing the number of the sub-bands with a preset number, if the number of the sub-bands is more than or equal to the preset number, determining that the current state is a single-talk state, and otherwise, determining that the current state is a double-talk state. Optionally, the preset decision threshold is 0-1, because the residual echo output signal is far smaller than the output signal of the adaptive filter, when the near-end signal is close to 0, the dual-talk detection decision value is close to 1, and when the near-end signal is not 0, the dual-talk detection decision value is obviously smaller than 1. The preset judgment value threshold can be 0-1, and meanwhile, the double-talk detection can have high discrimination degree through the action of the amplification factor mu. Specifically, after the dual-talk detection decision value of each sub-band is obtained through calculation, the number of the sub-bands with the dual-talk detection decision value being greater than or equal to the preset decision value threshold value is counted, the number of the sub-bands is compared with the preset number, if the dual-talk detection decision value of the sub-bands with the number being greater than or equal to the preset decision value threshold value exists, the current single-talk state can be judged, and if the dual-talk detection decision value of the sub-bands with the number being greater than or equal to the preset decision value threshold value exists, the current single-talk state is judged, and otherwise, the dual-talk state is judged. Through setting up the predetermined quantity, can adjust the sensitivity that two talkbacks detected, and the settlement of predetermineeing the quantity is not restricted to environmental factor.

On the basis of the foregoing technical solution, optionally, after determining whether the current state is the double-talk state according to the double-talk detection determination value, the method further includes: and if the current state is determined to be the double-talk state, taking the residual echo output signal as a final output signal to finish echo suppression. Specifically, if it is determined that the current dual-talk state is present, the echo needs to be cancelled, which may specifically include: acquiring a far-end voice reference signal x (n); passing x (n) through an adaptive filter to obtain an output signal

Meanwhile, the near-end microphone input signal y (n) ═ z (n) + xi x (n) can be obtained, wherein z (n) is the near-end signal, and xi x (n) is the superposition of the direct sound and the reflected sound of the far-end voice reference signal; calculating the output signal

Echo suppression is completed.

According to the technical scheme provided by the embodiment of the invention, a far-end voice reference signal and a near-end signal are firstly obtained, then a near-end microphone input signal is determined according to the far-end voice reference signal and the near-end signal, meanwhile, the far-end voice reference signal can pass through a preset self-adaptive filter to obtain an estimated echo signal, then a residual echo output signal is determined according to the obtained near-end microphone input signal and the estimated echo signal, finally, a double-talk detection judgment value can be calculated according to the obtained estimated echo signal, the near-end signal and the residual echo output signal, and therefore, whether the current double-talk state is determined according to the double-talk detection judgment value. Whether the current double-talk state is judged by calculating the double-talk detection judgment value in a self-adaptive way according to the estimated echo signal, the near-end signal and the residual echo output signal every time, so that the double-talk state detection accuracy is improved, the method can adapt to various speaking contexts or scenes, has higher robustness compared with the prior art, and reduces the problem of voice interruption caused by false detection.

Example two

Fig. 2 is a schematic structural diagram of a dual-speech state detection apparatus according to a second embodiment of the present invention, which may be implemented in a hardware and/or software manner, and may be generally integrated in a computer device for executing the dual-speech state detection method according to any embodiment of the present invention. As shown in fig. 2, the apparatus includes:

a signal obtaining module 21, configured to obtain a far-end speech reference signal and a near-end signal;

a near-end microphone input signal determining module 22, configured to determine a near-end microphone input signal according to the far-end speech reference signal and the near-end signal;

an estimated echo signal obtaining module 23, configured to pass the far-end speech reference signal through a preset adaptive filter to obtain an estimated echo signal;

a residual echo output signal determining module 24, configured to determine a residual echo output signal according to the near-end microphone input signal and the estimated echo signal;

a double-talk detection decision value calculation module 25, configured to calculate a double-talk detection decision value according to the estimated echo signal, the near-end signal, and the residual echo output signal;

and a double-talk state determining module 26, configured to determine whether the current double-talk state is the double-talk state according to the double-talk detection determination value.

On the basis of the above technical solution, optionally, the dual-speech state determining module 26 includes:

the sub-band number counting unit is used for respectively comparing the double-talk detection judgment value of each sub-band with a preset judgment value threshold value and counting the number of the sub-bands of which the corresponding double-talk detection judgment value is greater than or equal to the preset judgment value threshold value;

and the sub-band quantity comparison unit is used for comparing the sub-band quantity with a preset quantity, if the sub-band quantity is greater than or equal to the preset quantity, the current single-talk state is determined, and if not, the current double-talk state is determined.

On the basis of the above technical solution, optionally, the preset judgment value threshold is 0-1.

On the basis of the above technical solution, optionally, the near-end microphone input signal determining module 22 is specifically configured to:

y(n)＝z(n)+ξ*x(n)

On the basis of the above technical solution, optionally, the residual echo output signal determining module 24 is specifically configured to:

representing the estimated echo signal.

On the basis of the above technical solution, optionally, the dual-talk detection decision value calculation module 25 is specifically configured to:

wherein λ represents the double talk detection determination value,

representing the mean square error of the estimated echo signal,

represents the aboveThe mean square error of the residual echo output signal,

represents the mean square error of the near-end signal and μ represents the amplification factor.

On the basis of the above technical solution, optionally, the apparatus for detecting a dual-speech state further includes:

and the echo suppression module is used for determining whether the current state is the double-talk state according to the double-talk detection judgment value, and taking the residual echo output signal as a final output signal to finish echo suppression if the current state is the double-talk state.

The device for detecting the double-talk state provided by the embodiment of the invention can execute the method for detecting the double-talk state provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

It should be noted that, in the embodiment of the apparatus for detecting a dual-speech state, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a computer device provided in the third embodiment of the present invention, and shows a block diagram of an exemplary computer device suitable for implementing the embodiment of the present invention. The computer device shown in fig. 3 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present invention. As shown in fig. 3, the computer apparatus includes a processor 31, a memory 32, an input device 33, and an output device 34; the number of the processors 31 in the computer device may be one or more, one processor 31 is taken as an example in fig. 3, the processor 31, the memory 32, the input device 33 and the output device 34 in the computer device may be connected by a bus or in other ways, and the connection by the bus is taken as an example in fig. 3.

The memory 32 is used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the detection method of the double-talk state in the embodiment of the present invention (for example, the signal acquisition module 21, the near-end microphone input signal determination module 22, the estimated echo signal acquisition module 23, the residual echo output signal determination module 24, the double-talk detection determination value calculation module 25, and the double-talk state determination module 26 in the detection device of the double-talk state). The processor 31 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the memory 32, that is, the above-mentioned detection method of the double talk state is realized.

The memory 32 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 32 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 32 may further include memory located remotely from the processor 31, which may be connected to a computer device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 33 may be used to acquire a far-end speech reference signal and a near-end signal, and to generate key signal inputs and the like relating to user settings and function control of the computer apparatus. The output device 34 may be used to output the processed target audio data and the like.

Example four

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for detecting a double talk state, where the method includes:

acquiring a far-end voice reference signal and a near-end signal;

The storage medium may be any of various types of memory devices or storage devices. The term "storage medium" is intended to include: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Lambda (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in the computer system in which the program is executed, or may be located in a different second computer system connected to the computer system through a network (such as the internet). The second computer system may provide the program instructions to the computer for execution. The term "storage medium" may include two or more storage media that may reside in different locations, such as in different computer systems that are connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.

Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the method for detecting a double talk state provided by any embodiment of the present invention.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for detecting a double talk state is characterized by comprising the following steps:

acquiring a far-end voice reference signal and a near-end signal;

2. The method according to claim 1, wherein the double-talk detection decision value includes the double-talk detection decision value of each sub-band, and the determining whether the current double-talk state is determined according to the double-talk detection decision value includes:

respectively comparing the double-talk detection judgment value of each sub-band with a preset judgment value threshold, and counting the number of the sub-bands of which the corresponding double-talk detection judgment value is greater than or equal to the preset judgment value threshold;

and comparing the number of the sub-bands with a preset number, if the number of the sub-bands is more than or equal to the preset number, determining that the current state is a single-talk state, and otherwise, determining that the current state is a double-talk state.

3. The method according to claim 2, wherein the threshold value of the predetermined determination value is 0 to 1.

4. The method of claim 1, wherein the determining a near-end microphone input signal from the far-end speech reference signal and the near-end signal comprises:

y(n)＝z(n)+ξ*x(n)

5. The method of claim 1, wherein determining a residual echo output signal based on the near-end microphone input signal and the estimated echo signal comprises:

representing the estimated echo signal.

6. The method of claim 1, wherein said calculating a double talk detection decision value based on said estimated echo signal, said near-end signal and said residual echo output signal comprises:

wherein λ represents the double talk detection determination value,

representing said estimated echo signal, e (n) representing said residual echo output signal, z (n) representing said near echo signalThe end signals are sent to the mobile station,

representing the mean square error of the estimated echo signal,

representing the mean square error of the residual echo output signal,

7. The method for detecting a double talk state according to claim 1, further comprising, after determining whether the double talk state is currently present according to the double talk detection determination value:

and if the current state is determined to be the double-talk state, taking the residual echo output signal as a final output signal to finish echo suppression.

8. A device for detecting a double-talk state, comprising:

9. A computer device, comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of detecting a double talk state as recited in any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method for detecting a double talk state according to any one of claims 1 to 7.