WO2021114779A1

WO2021114779A1 - Echo cancellation method, apparatus, and system employing double-talk detection

Info

Publication number: WO2021114779A1
Application number: PCT/CN2020/114168
Authority: WO
Inventors: 潘思伟; 罗本彪; 雍雅琴; 董斐; 林福辉
Original assignee: 展讯通信（上海）有限公司
Priority date: 2019-12-13
Filing date: 2020-09-09
Publication date: 2021-06-17
Also published as: CN110995951B; CN110995951A

Abstract

An echo cancellation method, device, and system employing double-talk detection. The method comprises: acquiring an input sound signal from a sound collection device; performing adaptive filtering on the input sound signal to obtain a near-end speech estimation signal; determining the current sound production status according to the near-end speech estimation signal; acquiring preset mappings between sound production statuses and processing procedures, and acquiring, according to the mappings, a processing procedure corresponding to the current sound production status; processing the near-end speech estimation signal according to the processing procedure; and outputting the processed near-end speech estimation signal to obtain an output signal. The method improves echo cancellation, and improves the two-way conversation experience for hands-free speech communication terminals.

Description

Echo cancellation method, device and system based on double-ended vocalization detection

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on December 13, 2019, the application number is 201911284296.3, and the invention title is "Echo cancellation method, device and system based on double-ended sound detection", and the entire content of it is approved The reference is incorporated in this application.

Technical field

This application relates to the field of voice communication, and in particular to an echo cancellation method, device and system based on double-ended voice detection.

Background technique

In the telephone terminal, the acoustic echo is due to the coupling between the speaker and the terminal microphone, resulting in the telephone microphone not only containing useful voice signals, but also echo. If the microphone signal is not processed, the echo signal and the near-end voice signal will be transmitted to the far-end speaker for playback, and the far-end caller will hear his delayed voice, which will make people feel uncomfortable and affect the call Effect. When the echo is loud, the call cannot even be carried out normally. Therefore, effective measures must be taken to suppress the echo and eliminate its impact in order to improve the quality of voice communication.

For example, in systems such as conference calls and hands-free phones, there are acoustic echoes to varying degrees. Echo cancellation has become an engineering problem that needs to be solved since Bell invented the telephone. In recent years, with the rapid development of information technology, communication methods and application scenarios have become increasingly diversified, and communication terminals have become more and more compact, making the coupling between speakers and microphones stronger and stronger, and the echo channel has become more and more complex and changeable. This is voice communication. Acoustic echo cancellation in the system poses a great challenge.

Acoustic echo is generally produced in hands-free communication systems. It is an echo generation method affected by sound wave propagation. Generally, it can be divided into two situations: direct echo and indirect echo. Direct echo means that the sound played by the speaker directly enters the microphone along the path without any reflection and is picked up. This echo has the shortest delay time, and the voice energy of the far-end speaker, the distance and angle between the speaker and the microphone, and the speaker The playback volume and the pickup sensitivity of the microphone are related to other factors. Indirect echo refers to the collection of echoes generated by the sound played by the speaker entering the microphone after being reflected one or more times through different paths. The characteristics of this echo are long delay time, large delay jitter, and the amount of echo that is greatly affected by the environment.

In the prior art, an adaptive echo canceller (Acoustic Echo Canceller, AEC for short) is usually used to cancel the echo. The basic principle of AEC can be summarized as adaptively estimating the echo and subtracting the estimated echo from the signal picked up by the microphone. In the telephone circuit, no matter how far the distance is, AEC can avoid the influence of echo between the callers; in the hands-free phone, AEC can minimize the echo. When there is no near-end sound, the echo cancellation effect of AEC can meet the current needs; however, when there is obvious near-end sound, the performance of AEC based on various existing adaptive filtering algorithms will deteriorate, and it cannot even guarantee self-control. Adapt to the convergence of the filtering algorithm. This is the key problem that must be solved in the actual application of echo cancellation, which is usually called the double-talk (Double-talk, DT for short) problem. In order to reduce or avoid the impact of double-talk on AEC performance, double-talk detector (DTD) can be used. A typical application of DTD is to freeze AEC updates during double-talk periods to prevent adaptive filtering algorithms. Divergence.

DTD is based on the double-ended voice detection algorithm. The double-ended utterance detection algorithm may specifically include an energy-based double-ended utterance detection algorithm, a double-ended utterance detection algorithm based on signal correlation characteristics, and a double-ended utterance detection algorithm based on spectral characteristics. These double-ended vocalization detection algorithms all rely on the selection of a fixed threshold, and the vocalization state is judged by comparing the calculated statistics with the threshold. However, due to the variability of the actual channel and the call situation, the fixed threshold method cannot accurately detect the double-ended voice state. This not only affects the robustness of echo cancellation, but also produces severe sound cuts during subsequent processing, that is, the sound transmitted to the remote user will be intermittent.

The main influencing factor in hands-free communication equipment is the signal-to-return ratio of the signal received by the microphone, that is, the amplitude (power) ratio of the near-end voice received by the microphone to the echo signal received from the speaker. Compared with hand-held calls, the microphone's response ratio is usually lower during hands-free calls, and the distance between the microphone and the near-end talker, the volume of the near-end talker, and the size of the echo will change the return ratio. This makes the traditional The double-ended voice detection algorithm based on a fixed threshold often fails, and it is difficult to balance the duplex and de-echo performance in hands-free calling.

To sum up, the echo cancellation technology in the prior art cannot accurately filter out the echo interference in double-ended voice problems, especially in hands-free calls and conference calls, and the call quality is easily affected.

Summary of the invention

The technical problem solved by this application is how to better eliminate echo and improve the duplex call experience of the hands-free voice communication terminal.

In order to solve the above technical problems, embodiments of the present application provide an echo cancellation method, device, and system based on double-ended vocalization detection, where the echo cancellation method based on double-ended vocalization detection may include: acquiring an input sound signal from a sound collection device; Perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal; determine the current utterance state according to the near-end speech estimation signal; obtain a preset mapping relationship between the utterance state and the processing mode, according to the The mapping relationship obtains the processing mode corresponding to the current utterance state; processes the near-end speech estimation signal according to the processing mode; and outputs the processed near-end speech estimation signal to obtain an output signal.

Optionally, the determining the current utterance state according to the near-end speech estimation signal includes: calculating the double-ended utterance state of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal The average value of the statistics; obtain the dual-speaker judgment threshold corresponding to the current frame, the dual-speaker judgment threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal; according to the double-end utterance of the current frame The relationship between the average value of the state statistics and the dual-talk judgment threshold is used to determine the current utterance state.

Optionally, the calculating the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal includes: calculating the current state according to the following formula The average value of the double-ended utterance state statistics of the frame: where, is the average value of the double-ended utterance state statistics of the current frame, represents the power of the near-end speech estimation signal at the kth frame and the nth sample point, and represents the total The power of the input sound signal in the k-th frame and the n-th sample point represents the average value of the values in the brackets.

Optionally, the obtaining the dual-talk judgment threshold corresponding to the current frame, the dual-talk judgment threshold being obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal, includes: real-time estimation of the input sound signal To obtain the average response ratio of the current frame in the input sound signal; obtain multiple preset thresholds for the response ratio, and construct multiple response ratio intervals according to the multiple thresholds; Determine the interval of the average response ratio of the current frame to which the average response ratio belongs, and obtain the dual-speaking judgment threshold corresponding to the interval of the said current frame as the dual-speaking judgment threshold of the current frame.

Optionally, the real-time estimation of the signal return ratio of the input sound signal to obtain the average signal return ratio of the current frame in the input sound signal includes: acquiring a near-end interference signal, where the near-end interference signal is and The sound signal generated by the sound generating device at the same end of the sound collection device; calculate the average response ratio of the current frame in the input sound signal according to the following formula; wherein, represents the estimated average response ratio of the k-th frame, Its unit is dB, which represents the power of the input sound signal at the k-th frame and the n-th sample point, represents the power of the near-end interference signal at the k-th frame and the n-th sample point, and represents the value in brackets average value.

Optionally, the acquiring multiple preset thresholds of the return ratio, and constructing multiple intervals of the return ratio according to the multiple thresholds, includes: comparing the acquired multiple thresholds with the return ratio. The two adjacent ones are used as the boundary value of the RR interval to obtain multiple RL interval.

Optionally, the utterance state includes two states: only the far-end utterance and not only the far-end utterance, and the preset mapping relationship between the utterance state and the processing mode includes: when the utterance state is only the far-end utterance, Performing zeroing processing on the near-end speech estimation signal or suppressing it to be inaudible; when the utterance state is judged to be not only the far-end utterance, the near-end speech estimation signal is retained.

Optionally, the not only far-end utterance includes two states: near-end utterance only and double-ended utterance.

Optionally, the performing adaptive filtering on the input sound signal to obtain the near-end speech estimation signal includes: performing linear filtering and non-linear filtering on the input sound signal, respectively, to obtain the near-end speech estimation signal.

An embodiment of the present application also provides an echo cancellation device based on double-ended vocalization detection. The device includes: an input sound signal acquisition module for acquiring an input sound signal from a sound collection device; a filtering module for evaluating the input sound The signal is adaptively filtered to obtain the near-end speech estimation signal; the current utterance state determination module is used to determine the current utterance state according to the near-end speech estimation signal; the processing method acquisition module is used to obtain the preset utterance state and The mapping relationship between the processing modes, the processing mode corresponding to the current utterance state is obtained according to the mapping relationship; a near-end processing module, configured to process the near-end speech estimation signal according to the processing mode; an output module , Used to output the processed near-end speech estimation signal to obtain an output signal.

The embodiment of the present application also provides an echo cancellation system based on double-ended voice detection, including a sound collection device, a same-end voice device, and an echo cancellation device, and the echo cancellation device executes the steps of any one of the above-mentioned methods.

Compared with the prior art, the technical solutions of the embodiments of the present application have the following beneficial effects:

An embodiment of the present application provides an echo cancellation method based on double-ended utterance detection. The method includes: acquiring an input sound signal from a sound collection device; adaptively filtering the input sound signal to obtain a near-end speech estimation signal; Determine the current utterance state according to the near-end speech estimation signal; obtain the mapping relationship between the preset utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship; The near-end speech estimation signal is processed in a manner; the processed near-end speech estimation signal is output to obtain an output signal. Compared with the prior art, the input sound signal in a voice call such as a telephone is different from the direct transmission to the peer device in the existing communication scheme or only the adaptive echo cancellation is transmitted to the peer device. The technical scheme in this method , Customize different processing methods according to different sounding states corresponding to the input sound signal, and accurately filter out the echo in the input sound signal by combining the characteristics of double-ended sounding. Especially in the call system that is greatly affected by the interference of double-end vocalization, such as hands-free call and voice conference, the call quality can be significantly improved.

Further, the real-time sounding state judgment is performed on each frame of the input sound signal to realize the real-time update of the processing method of the near-end voice estimation signal, so that the input sound signal can be accurately and completely echo canceled, and the call can be guaranteed. Stability of the process.

Further, the signal-to-return ratio of the input sound signal with the near-end interference signal as the echo source is calculated in real time by sampling, and different dual-talk judgment thresholds are set when the influence of the near-end interference signal on the input sound signal is different. It can more accurately determine the current sounding state and improve the accuracy of echo cancellation for the input sound signal.

Furthermore, two utterance states are defined, and processing methods corresponding to the two utterance states are specified, which can basically meet the requirements of real-time echo cancellation in common voice calls.

Further, the adaptive filtering of the input sound signal includes two operations of linear filtering and non-linear filtering, which can further suppress the echo of the input sound signal.

The echo cancellation system based on double-ended vocalization detection provided by the embodiments of the present application can perform real-time detection based on the acoustic echo generated in the communication process, and eliminate it based on the detection result, so that the echo cancellation system can be improved when the voice communication terminal is in the hands-free mode. Eliminate the effect to improve the quality of the call. Especially for hands-free voice communication terminals, the echo cancellation method, device and system based on double-ended utterance detection provided in the embodiments of the present application can distinguish between only far-end utterance and only near-end utterance or double-ended utterance in real time. When it is judged that only the far-end is speaking, the time-domain output result is zeroed or suppressed to inaudible, so that the echo can be eliminated to the greatest extent while ensuring the duplex call performance, so as to improve the echo cancellation and duplex performance at the same time. The purpose is to improve the duplex call experience of the hands-free voice communication terminal.

Description of the drawings

FIG. 1 is a schematic flowchart of an echo cancellation method based on double-ended vocalization detection according to an embodiment of the present application;

2 is a schematic diagram of the application of an echo cancellation method based on double-ended vocalization detection according to an embodiment of the present application;

FIG. 3 is a schematic flowchart of step S103 in FIG. 1 in an embodiment of the present application;

FIG. 4 is a schematic flowchart of step S302 in FIG. 3 in an embodiment of the present application;

FIG. 5 is a schematic diagram of a response ratio interval according to an embodiment of the present application;

6 is a schematic structural diagram of an echo cancellation device based on double-ended vocalization detection according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an echo cancellation system based on double-ended vocalization detection according to an embodiment of the present application.

Detailed ways

As mentioned in the background art, the echo cancellation technology in the prior art cannot accurately filter out the echo interference in double-ended voice problems, especially in hands-free calls and conference calls, and the call quality is easily affected.

In order to solve the above technical problems, an embodiment of the present application provides an echo cancellation method based on double-ended vocalization detection. The method includes: acquiring an input sound signal from a sound collection device; performing adaptive filtering on the input sound signal to obtain a close End speech estimation signal; determine the current utterance state according to the near-end speech estimation signal; obtain the mapping relationship between the preset utterance state and processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship ; Process the near-end speech estimation signal according to the processing mode; output the processed near-end speech estimation signal to obtain an output signal.

Adopting the solution described in this embodiment can filter out the interference signal in the double-ended voice, and significantly improve the quality of the call.

Please refer to Figure 1 for details. Figure 1 provides a schematic flow chart of an echo cancellation method based on double-ended vocalization detection; the method may specifically include the following steps:

S101: Obtain an input sound signal from a sound collection device.

The input sound signal is the sound signal collected by the sound collection device. Among them, the sound collection device may be a microphone or other device, and for a telephone or phone-like call, it is a sound collection device that comes with a terminal such as a mobile phone, a landline or a computer.

In the process of telephone communication, the terminal such as the telephone collects the sound of the local end through the sound collection device in real time, and transmits it to the opposite end of the call through the communication line. After the sound collection device at the local end collects the input sound signal, it is not directly transmitted to the call. Instead, through the following steps S102 to S106, the input sound signal is echo canceled to improve the quality of the voice call.

S102: Perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal.

After acquiring the input sound signal from the sound collection device, the acquired input sound signal is filtered to filter out the echo signal generated at the local end that interferes with the normal call, and to obtain the near-end voice estimation signal after the echo signal is filtered out. Among them, the adaptive filtering method can use an adaptive echo canceller (ie, AEC) to filter the input sound signal to filter out the near-end speech estimation signal.

S103: Determine the current utterance state according to the near-end speech estimation signal.

Among them, the utterance state can include different states such as far-end utterance only, double-end utterance, and near-end utterance only. The utterance state corresponds to different processing methods for the obtained near-end speech estimation signal, which can be set according to needs. The vocal state of is not limited to the examples mentioned above. The current utterance state is to determine the real-time utterance state of the near-end speech estimation signal obtained this time to determine its real-time corresponding utterance state.

After obtaining the near-end speech estimation signal, the corresponding utterance state can be determined according to the waveform, channel and other attributes of the speech signal.

S104: Obtain a preset mapping relationship between the utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship.

The processing method is a corresponding processing method for the near-end speech estimation signal of each utterance state, and may include processing methods such as setting the near-end speech estimation signal to zero (0), fully retaining or retaining part, and so on. The mapping relationship between the utterance state and the processing mode can be set in advance. After the current utterance state is determined, the corresponding processing mode can be automatically obtained according to the mapping relationship.

S105: Process the near-end speech estimation signal according to the processing manner.

After obtaining the processing mode corresponding to the current utterance state in step S104, the near-end speech estimation signal is processed according to this processing mode.

S106: Output the processed near-end speech estimation signal to obtain an output signal.

The processed near-end voice estimation signal can correctly reflect the call information of the local end, and this output signal can be transmitted to the call peer through the communication link.

Through the method in the above embodiment, for the input sound signal in a voice call such as a telephone, it is different from the direct transmission to the opposite device in the existing communication scheme or the transmission to the opposite device with only adaptive echo cancellation. In this embodiment Different processing methods can be customized according to the different sounding states of the input sound signal, and the interference or echo in the input sound signal can be accurately filtered by combining the characteristics of double-ended sounding. Especially in the call system that is greatly affected by the interference of double-end vocalization, such as hands-free call and voice conference, the call quality can be significantly improved.

Please refer to Figure 2. Figure 2 provides a schematic diagram of the application of an echo cancellation method based on double-ended utterance detection; in the application scenario shown in Figure 2, the call object includes a far-end device 200 and a near-end device 210, where the far The end device 200 includes a far-end microphone 201 and a far-end speaker 202, and the near-end device 210 includes a near-end speaker 203 and a near-end microphone 204.

In the communication process, the far-end microphone 201 sends the downlink signal S1 to the near-end speaker 203, the direct echo S2 is the sound signal that is emitted by the near-end speaker 203 and is directly picked up by the near-end microphone 204, and the indirect echo S3 is the sound signal from the near-end speaker. 203 emits a sound signal that is reflected by the environment and indirectly picked up by the near-end microphone 204. While picking up the echoes (direct echo S2 and indirect echo S3), a person (not shown) sends a voice to the near-end microphone 204 (marked "voice" in the figure), and the near-end microphone 204 picks up the voice and generates an uplink signal S4 is sent to the remote speaker 202 to be played out.

The echo cancellation method based on double-ended voice detection in FIG. 1 can be applied to the near-end microphone 204 side in FIG. 2 where the near-end microphone 204 obtains the input sound signal to be sent to the far-end device 200 (that is, according to the voice in FIG. 2 Before the obtained sound signal), the input sound signal is processed by the echo cancellation method in FIG. 1 first.

In an embodiment, please continue to refer to FIG. 1. Step S103 in FIG. 1 determines the current utterance state according to the near-end speech estimation signal, which may specifically include steps S301 to S303 in FIG. 3.

S301: Calculate the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal.

The double-ended utterance state statistics of the current frame are based on the current frame in the input sound signal as the reference point, and the input sound signal and the near-end speech estimation signal before the reference point are respectively sampled, and the input sound signal and the near-end speech estimation signal are respectively sampled. Signals are compared, calculated, and used to reflect the current sounding state of the input sound signal. The average value is the average value of the double-ended voice state statistics at several sampling points. The average value of the double-ended utterance state statistics of the current frame may be obtained by inputting the input sound signal and the near-end speech estimation signal into the double-ended utterance detector.

S302: Acquire a dual-talk judgment threshold corresponding to the current frame, where the dual-talk judgment threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal.

Among them, the signal-to-return ratio of the input sound signal is the energy ratio of the signal and the echo in the input sound signal, and the signal-to-return ratio of the input sound signal can be calculated to obtain the signal-to-return ratio.

The near-end interference signal is the interference signal generated by the sound generated by the same-end sounding device corresponding to the sound collecting device on the reception of the microphone, and can be obtained from the sounding device corresponding to the sound collecting device. The sound-producing device can be a device such as a speaker corresponding to the local microphone in telephone communication.

The dual-talk judgment threshold is the threshold value used to determine the utterance state corresponding to the average of the double-ended utterance state statistics of the current frame. Set multiple thresholds of the utterance state for the average value of the double-ended utterance state statistics, that is, dual-talk judgment Threshold. The dual-talk judgment threshold is set based on two factors: the signal-to-return ratio of the input sound signal and the near-end interference signal.

S303: Determine the current utterance state according to the magnitude relationship between the average value of the double-ended utterance state statistics of the current frame and the double-talk determination threshold.

According to the relationship between the average value of the double-ended utterance state statistics of the current frame in the input sound signal and the double-talk judgment threshold obtained in step S302, it is determined which utterance state the average value of the double-ended utterance state statistics of the current frame is in. Within the threshold interval to determine the current vocalization state.

In this embodiment, the real-time sounding state judgment is performed on each frame of the input sound signal to realize real-time update of the processing method of the near-end speech estimation signal, so that the input sound signal can be accurately and completely echo canceled. Ensure the stability of the call process.

In an embodiment, please continue to refer to FIG. 3. In step S301, the calculation method of the average value of the double-ended utterance state statistics of the current frame in the input sound signal can be calculated according to the following formula:

among them,

Is the average of the double-ended utterance state statistics of the current frame, P _S (k, n) represents the power of the near-end speech estimation signal in the k-th frame and the n-th sample point, and P _m (k, n) represents the The power of the input sound signal at the k-th frame and the n-th sample point, mean() means to take the average of the values in the brackets.

In one embodiment, please continue to refer to FIG. 3. Step S302 in FIG. 3 obtains the dual-talk judgment threshold corresponding to the current frame. The dual-talk judgment threshold is based on the signal-back ratio and near-end interference of the input sound signal. Obtaining the signal may include steps S401 to S403 in Fig. 4, where:

S401: Estimate the signal-to-return ratio of the input sound signal in real time to obtain an average signal-to-return ratio of the current frame in the input sound signal.

Transmit the input sound signal to the signal return ratio calculator to obtain the signal return ratio of the input sound signal in real time, and use the current frame as the reference point to calculate the quantification of the influence of the near-end interference signal in the input sound signal before the reference point Value, and average the values to get the average return ratio of the current frame.

S402: Acquire multiple preset thresholds of the response ratio, and construct multiple intervals of the ratio of the response ratio according to the multiple thresholds.

The preset multiple thresholds are values obtained through experience or extreme technical personnel. The boundary values of multiple thresholds can be generated based on multiple thresholds to define multiple thresholds. The corresponding dual-speaking judgment threshold is set for each response ratio interval.

S403: Determine a signal-to-return ratio interval to which the average return ratio of the current frame belongs, and obtain a dual-speaking judgment threshold corresponding to the signal-to-return ratio interval as the dual-speaking judgment threshold of the current frame.

Determine the average response ratio of the current frame in the input sound signal to which the response ratio belongs, and set the corresponding dual-speaking judgment threshold for this ratio interval as the dual-speaking judgment threshold of the current frame, and execute the above-mentioned figure 3 Operation in step S302.

In this embodiment, the signal-to-return ratio of the input sound signal with the near-end interference signal as the echo source is calculated in real time by sampling. When the near-end interference signal has a different degree of influence on the input sound signal, different dual-talk judgments are set Threshold, more accurately determine the current sounding state, and improve the accuracy of echo cancellation for the input sound signal.

Optionally, the calculation method of the average response ratio of the current frame in step S401 in FIG. 4 is as follows:

Acquire a near-end interference signal, where the near-end interference signal is a sound signal generated by a sound-producing device at the same end as the sound collection device.

Calculate the average signal-to-return ratio of the current frame in the input sound signal according to the following formula;

among them,

Represents the estimated average signal-to-return ratio of the _{k-th frame, in dB, P m} (k, n) represents the power of the input sound signal at the k-th frame and the n-th sample point, P _x (k, n ) Represents the power of the near-end interference signal at the k-th frame and the n-th sample point, and mean() represents the average value of the values in the brackets.

P _m (k, n) and P _x (k, n) are the power values of the sampling points obtained by sampling the input sound signal and the near-end interference signal respectively in frames. The sampling process is: acquiring n sample points in the input sound signal and the near-end interference signal respectively, and the signal frame corresponding to each sample point is the k-th frame. Among them, n and k are variable count values.

In one embodiment, please continue to refer to FIG. 4. Step S402 in FIG. 4 obtains multiple preset thresholds of the response ratio, and constructs multiple intervals of the response ratio according to the multiple thresholds, which may include: Two adjacent ones of the acquired multiple thresholds of the response ratio are used as the boundary value of the response ratio interval to obtain a plurality of ratio intervals.

It can store multiple preset thresholds for the response ratio. When it is necessary to construct the interval for the response ratio, obtain the stored thresholds from the storage area, and sort them in the order of numerical value to the two adjacent ones after sorting. The threshold value is used as the boundary value of a RL interval to obtain multiple RL interval.

For example, the preset multiple thresholds of the return ratio are SER_thr_1, SER_thr_2, SER_thr_3,..., SER_thr_k, and the return ratio interval is formed by the thresholds. Schematic.

Among them, the information response ratio interval can be expressed as: the information response ratio interval 501, the information response ratio interval 502,..., the information response ratio interval 50k, where k is a variable value, which represents the kth information response ratio interval 50k, according to K+1 thresholds of the response ratio can construct k response ratio intervals of 50k.

Continuing, set the corresponding dual-speaking judgment threshold for each response ratio interval, that is, the dual-speaking judgment threshold m1, the dual-speaking judgment threshold m2, ..., the dual-speaking judgment threshold mk in Fig. 5. After determining the average response ratio of the current frame to which it belongs, the corresponding dual-talk judgment threshold is obtained, that is, step S403.

In this embodiment, the RR interval is automatically constructed based on the preset RR threshold value as the interval boundary value.

In one embodiment, the utterance state includes two states: only the far-end utterance and not only the far-end utterance, and the preset mapping relationship between the utterance state and the processing mode includes: when the utterance state is only the far-end utterance When the near-end speech estimation signal is zeroed or suppressed to be inaudible; when the utterance state is not only the far-end utterance, the near-end speech estimation signal is retained.

According to the influence of the near-end voice estimation signal on the actual call, two voice states can be set, namely, only the far-end voice and not only the far-end voice. When it is determined based on the near-end speech estimation signal that the current utterance state is only the far-end utterance, the near-end speech estimation signal needs to be zeroed or suppressed to be inaudible, that is, the near-end speech estimation signal is filtered out, and the mute signal is used as The transmission signal of the local end is transmitted to the opposite end device of the call. When it is determined based on the near-end voice estimation signal that the current utterance state is not only the far-end utterance, the near-end voice estimation signal needs to be retained, and the near-end voice estimation signal is transmitted to the peer device of the call as a transmission signal of the local end.

In this embodiment, two utterance states are defined, and processing methods corresponding to the two utterance states are specified, which can basically meet the requirements of real-time echo cancellation in common voice calls.

You can continue to perform detailed analysis of the above-mentioned sounding state, which is not only the far-end sound, and divide it into two states: near-end sound-only and double-end sound. Only near-end sound means that the sound collection device only collects the transmission from the local end. Signals, but no near-end interference signal is collected; the double-ended sounding state means that the sound collection device collects both the local transmission signal and the near-end interference signal. The processing method can be further specified for these two states. For example, for only the near-end voice, no processing is done and the voice signal is directly transmitted to the opposite end, and so on.

In one embodiment, please continue to refer to FIG. 1. Step S102 in FIG. 1 performs adaptive filtering on the input sound signal to obtain a near-end speech estimation signal, which may specifically include two filtering operations, namely linear filtering and non-linear filtering. .

The input sound signal is processed by linear filtering in filters such as AEC to eliminate part of the echo. However, after linear filtering, the input sound signal still contains linear residual echo and nonlinear echo. In the case of near-end utterance, it also contains near-end speech. Continuous non-linear processing and filtering of the sound signal containing residual echo can be used to achieve further echo suppression.

In this embodiment, the adaptive filtering of the input sound signal includes two operations of linear filtering and non-linear filtering, which can further suppress the echo of the input sound signal.

The embodiment of the application also provides an echo cancellation device based on double-ended vocalization detection. Please refer to FIG. 6. The device may include an input sound signal acquisition module 601, a filtering module 602, a sound state determination module 603, and a processing mode acquisition module 604. , Near-end processing module 605 and output module 606, where:

The input sound signal acquisition module 601 is used to acquire the input sound signal from the sound collection device.

The filtering module 602 is configured to perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal.

The utterance state determination module 603 is configured to determine the current utterance state according to the near-end speech estimation signal.

The processing mode obtaining module 604 is configured to obtain a preset mapping relationship between the utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship.

The near-end processing module 605 is configured to process the near-end speech estimation signal according to the processing mode.

The output module 606 is configured to output the processed near-end speech estimation signal to obtain an output signal.

In an embodiment, the utterance state determination module 603 may include:

A real-time utterance state acquisition unit, configured to calculate the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal;

A threshold obtaining unit, configured to obtain a dual-talk judgment threshold corresponding to the current frame, the dual-talk judgment threshold being obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal;

The utterance state determination unit is configured to determine the current utterance state according to the magnitude relationship between the average value of the double-ended utterance state statistics of the current frame and the dual-talk determination threshold.

In an embodiment, the threshold value acquisition unit includes:

The current response ratio obtaining subunit is used to estimate the signal response ratio of the input sound signal in real time to obtain the average signal response ratio of the current frame in the input sound signal;

A signal response ratio interval construction subunit for obtaining a plurality of preset signal response ratio thresholds, and constructing a plurality of signal response ratio intervals according to the plurality of signal response ratio thresholds;

The threshold judging subunit is used to judge the interval of the average response ratio of the current frame to which the average response ratio belongs, and obtain the dual-speaking judgment threshold corresponding to the said interval as the dual-speaking judgment threshold of the current frame.

In an embodiment, the above-mentioned signal response ratio interval construction subunit is further used to use two adjacent ones of the obtained multiple signal response ratio threshold values as the boundary value of the signal response ratio interval to obtain multiple signal response ratios. Back to the interval.

In one embodiment, please continue to refer to FIG. 6. The filtering module 602 in FIG. 6 is further configured to perform linear filtering and non-linear filtering on the input sound signal to obtain the near-end speech estimation signal.

For more details on the working principle and working mode of the echo cancellation device based on double-ended vocalization detection, reference may be made to the related descriptions in Figs. 1 to 5 above, which will not be repeated here.

The embodiment of the present application also provides an echo cancellation system based on double-ended vocalization detection, including a sound collection device, a same-end sounding device, and an echo cancellation device. The echo cancellation device performs the double-ended-based echo cancellation system provided in FIGS. 1 to 5. The steps of the echo cancellation method for vocal detection.

Please refer to Figure 7. Figure 7 is a schematic diagram of an echo cancellation system based on double-ended voice detection; the system includes a sound collection device 701, an echo cancellation device 702, and a same-end voice device 703. Wherein, the sound collection device 701 may be a microphone in telephone communication for collecting the input sound signal A1. The same-end sounding device 703 may be a speaker connected to the same end as a microphone in telephone communication to generate a sound signal, but it may interfere with the input sound signal A1, so it is used as the interference sound signal A6. The echo cancellation device 702 is a device for implementing the echo cancellation method based on double-ended vocalization detection in FIGS. 1 to 5 in this application. The function of the echo cancellation device can be realized by means of entity or logic circuit, software programming, etc.

Optionally, as shown in FIG. 7, the echo cancellation device 702 may include a linear AEC filter 7021, an NLP filter 7022, a double-ended utterance detector 7023, a signal-to-return ratio estimator 7024, a threshold determiner 7025 and a processor 7026.

The echo cancellation device 702 processes the sound signals received from the sound collection device 701 and the same-end sound device 703 in the communication process as follows:

Obtain the input sound signal A1 from the sound collection device 701, and perform echo cancellation on it. The input sound signal A1 is linearly filtered through the linear AEC filter 7021 to obtain the linearly filtered sound signal A2, and then the NLP filter is applied to A2. Non-linear filtering obtains the near-end speech estimation signal A3, which is used as an input signal of the double-ended utterance detector 7023. The input sound signal A1 is directly used as another input signal of the double-ended sounding detector. Among them, the linear AEC filter 7021 uses the interference sound signal A6 as a filtering reference factor to linearly filter the input sound signal A1.

In addition, the input sound signal A1 is input to the echo ratio estimator 7024, the average echo ratio A4 of the current frame of the input sound signal is calculated in real time, and the average echo ratio A4 is transmitted to the threshold determiner 7025, which is based on the preset Multiple signal response ratio intervals constructed by multiple signal response ratio thresholds to determine the dual-talk judgment threshold A5 corresponding to the average return ratio of the current frame A4, and send the double-talk judgment threshold A5 to the double-ended utterance detector 7023 As the basis for judging the current utterance state. Among them, the signal-to-return ratio estimator 7024 samples the input sound signal A1 and the interference sound signal A6, and calculates the average signal-to-return ratio A4 of the current frame according to the following formula:

among them,

Represents the estimated average signal-to-return ratio of the _{k-th frame, in dB, P m} (k, n) represents the power of the input sound signal A1 at the k-th frame and the n-th sample point, P _x (k, n) represents the power of the near-end interference signal A6 at the k-th frame and the n-th sample point, and mean() represents the average value of the values in the brackets.

The double-ended utterance detector 7023 acquires the first input signal (ie the near-end voice estimation signal A3), the second input signal (ie the input sound signal A1), and the dual-talk judgment threshold A5, and determines the current utterance in real time based on this information State A7. The current utterance state is obtained based on the average of the double-ended utterance state statistics of the current frame. The double-ended utterance detector 7023 samples the near-end speech estimation signal A3 and the input sound signal A1, and calculates the average value of the double-ended utterance state statistics of the current frame according to the following formula:

among them,

Is the average of the double-ended utterance state statistics of the current frame, P _S (k, n) represents the power of the near-end speech estimation signal A3 at the k-th frame and the n-th sample point, and P _m (k, n) represents the total For the power of the input sound signal A1 at the k-th frame and the n-th sample point, mean() represents the average value of the values in the brackets.

The double-ended utterance detector 7023 sends the obtained current utterance state A7 to the processor 7026, and the processor 7026 processes the near-end voice estimation signal A3 according to the current utterance state A7. The processing method is: when the utterance state is only the far-end utterance, the near-end speech estimation signal A3 is zeroed or suppressed to inaudible; when the utterance state is not only the far-end utterance, the near-end speech estimation signal A3 is retained. The processed near-end speech estimation signal is output to obtain an output signal A8, and the output signal A8 can be transmitted to the device of the communication opposite end via the communication link.

Please refer to the application scenario of the echo cancellation method based on double-ended voice detection in Figure 2. The system in Figure 7 can be applied to the near-end device side in this scenario to eliminate direct echo S2 and indirect echo S3 in Figure 2 .

The above-mentioned echo cancellation system based on double-ended vocalization detection performs real-time detection based on the acoustic echo generated in the communication process, and eliminates it according to the detection result, so that the echo cancellation effect can be improved when the voice communication terminal is in the hands-free mode to improve the call quality.

Especially for hands-free voice communication terminals, the echo cancellation method, device and system based on double-ended utterance detection provided in the embodiments of the present application can distinguish between only far-end utterance and only near-end utterance or double-ended utterance in real time. When it is judged that only the far-end is speaking, the time-domain output result is zeroed or suppressed to inaudible, so that the echo can be eliminated to the greatest extent while ensuring the duplex call performance, so as to improve the echo cancellation and duplex performance at the same time. The purpose is to improve the duplex call experience of the hands-free voice communication terminal.

Although this application is disclosed as above, this application is not limited to this. Any person skilled in the art can make various changes and modifications without departing from the spirit and scope of this application. Therefore, the protection scope of this application shall be subject to the scope defined by the claims.

Claims

An echo cancellation method based on double-ended utterance detection, characterized in that the method includes:

Obtain the input sound signal from the sound collection device;

Performing adaptive filtering on the input sound signal to obtain a near-end speech estimation signal;

Judging the current utterance state according to the near-end speech estimation signal;

Acquiring a preset mapping relationship between a utterance state and a processing manner, and acquiring a processing manner corresponding to the current utterance state according to the mapping relationship;

Processing the near-end speech estimation signal according to the processing manner;

The processed near-end speech estimation signal is output to obtain an output signal.
The method according to claim 1, wherein the determining the current utterance state according to the near-end speech estimation signal comprises:

Calculating an average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal;

Acquiring a dual-talk judgment threshold corresponding to the current frame, where the dual-talk judgment threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal;

The current utterance state is determined according to the relationship between the average value of the double-ended utterance state statistics of the current frame and the double-talk determination threshold.
The method according to claim 2, wherein the calculating the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal comprises :

Calculate the average value of the double-ended voice state statistics of the current frame according to the following formula:

among them,
Is the average of the double-ended utterance state statistics of the current frame, P S (k, n) represents the power of the near-end speech estimation signal in the k-th frame and the n-th sample point, and P m (k, n) represents the The power of the input sound signal at the k-th frame and the n-th sample point, mean() means to take the average of the values in the brackets.
The method according to claim 2, wherein said obtaining a dual-talk determination threshold corresponding to the current frame, the dual-talk determination threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal, include:

Real-time estimation of the signal-to-return ratio of the input sound signal to obtain an average signal-to-return ratio of the current frame in the input sound signal;

Acquiring multiple preset thresholds for the response ratio, and constructing multiple intervals of the response ratio according to the multiple thresholds;

Determine the average response ratio interval of the current frame to which the response ratio interval belongs, and obtain the dual-speaker judgment threshold corresponding to the ratio interval as the dual-speaker judgment threshold of the current frame.
The method according to claim 4, wherein the real-time estimation of the signal-to-return ratio of the input sound signal to obtain the average signal-to-return ratio of the current frame in the input sound signal comprises:

Acquiring a near-end interference signal, where the near-end interference signal is a sound signal generated by a sound-producing device at the same end as the sound collection device;

Calculate the average signal-to-return ratio of the current frame in the input sound signal according to the following formula;

among them,
Represents the estimated average signal-to-return ratio of the k-th frame, and its unit is dB, P m (k, n) represents the power of the input sound signal at the k-th frame and the n-th sample point, P x (k, n ) Represents the power of the near-end interference signal at the k-th frame and the n-th sample point, and mean() represents the average value of the values in the brackets.
The method according to claim 4, wherein the obtaining multiple preset thresholds of the return ratio, and constructing multiple intervals of the return ratio according to the multiple thresholds, comprises:

Two adjacent ones of the acquired multiple thresholds of the response ratio are used as the boundary value of the response ratio interval to obtain a plurality of ratio intervals.
The method according to claim 1, wherein the utterance state includes two states: only the far-end utterance and not only the far-end utterance, and the mapping relationship between the preset utterance state and the processing mode includes:

When the utterance state is that only the far-end is uttered, zeroing the near-end speech estimation signal or suppressing it to be inaudible;

When the utterance state is not only the far-end utterance, the near-end speech estimation signal is retained.
The method according to claim 7, wherein the non-only far-end utterance includes two states: only near-end utterance and double-ended utterance.
The method according to claim 1, wherein said performing adaptive filtering on said input sound signal to obtain a near-end speech estimation signal comprises:

Perform linear filtering and nonlinear filtering on the input sound signal respectively to obtain the near-end speech estimation signal.
An echo cancellation device based on double-ended vocalization detection, characterized in that the device includes:

The input sound signal acquisition module is used to acquire the input sound signal from the sound collection device;

The filtering module is used to perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal;

The utterance state determination module is configured to determine the current utterance state according to the near-end speech estimation signal;

A processing method acquisition module, configured to acquire a preset mapping relationship between a utterance state and a processing method, and acquire the processing method corresponding to the current utterance state according to the mapping relationship;

The near-end processing module is configured to process the near-end speech estimation signal according to the processing mode;

The output module is used to output the processed near-end speech estimation signal to obtain an output signal.
An echo cancellation system based on double-ended voice detection, comprising a sound collection device, a same-end voice device, and an echo cancellation device, characterized in that the echo cancellation device executes the steps of the method according to any one of claims 1 to 9.