WO2021114779A1 - Echo cancellation method, apparatus, and system employing double-talk detection - Google Patents

Echo cancellation method, apparatus, and system employing double-talk detection Download PDF

Info

Publication number
WO2021114779A1
WO2021114779A1 PCT/CN2020/114168 CN2020114168W WO2021114779A1 WO 2021114779 A1 WO2021114779 A1 WO 2021114779A1 CN 2020114168 W CN2020114168 W CN 2020114168W WO 2021114779 A1 WO2021114779 A1 WO 2021114779A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
input sound
utterance
sound signal
double
Prior art date
Application number
PCT/CN2020/114168
Other languages
French (fr)
Chinese (zh)
Inventor
潘思伟
罗本彪
雍雅琴
董斐
林福辉
Original Assignee
展讯通信(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 展讯通信(上海)有限公司 filed Critical 展讯通信(上海)有限公司
Publication of WO2021114779A1 publication Critical patent/WO2021114779A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • This application relates to the field of voice communication, and in particular to an echo cancellation method, device and system based on double-ended voice detection.
  • the acoustic echo is due to the coupling between the speaker and the terminal microphone, resulting in the telephone microphone not only containing useful voice signals, but also echo. If the microphone signal is not processed, the echo signal and the near-end voice signal will be transmitted to the far-end speaker for playback, and the far-end caller will hear his delayed voice, which will make people feel uncomfortable and affect the call Effect. When the echo is loud, the call cannot even be carried out normally. Therefore, effective measures must be taken to suppress the echo and eliminate its impact in order to improve the quality of voice communication.
  • Echo cancellation has become an engineering problem that needs to be solved since Bell invented the telephone.
  • communication methods and application scenarios have become increasingly diversified, and communication terminals have become more and more compact, making the coupling between speakers and microphones stronger and stronger, and the echo channel has become more and more complex and changeable. This is voice communication.
  • Acoustic echo cancellation in the system poses a great challenge.
  • Acoustic echo is generally produced in hands-free communication systems. It is an echo generation method affected by sound wave propagation. Generally, it can be divided into two situations: direct echo and indirect echo.
  • Direct echo means that the sound played by the speaker directly enters the microphone along the path without any reflection and is picked up. This echo has the shortest delay time, and the voice energy of the far-end speaker, the distance and angle between the speaker and the microphone, and the speaker The playback volume and the pickup sensitivity of the microphone are related to other factors.
  • Indirect echo refers to the collection of echoes generated by the sound played by the speaker entering the microphone after being reflected one or more times through different paths. The characteristics of this echo are long delay time, large delay jitter, and the amount of echo that is greatly affected by the environment.
  • an adaptive echo canceller (Acoustic Echo Canceller, AEC for short) is usually used to cancel the echo.
  • AEC Acoustic Echo Canceller
  • the basic principle of AEC can be summarized as adaptively estimating the echo and subtracting the estimated echo from the signal picked up by the microphone.
  • AEC can avoid the influence of echo between the callers; in the hands-free phone, AEC can minimize the echo.
  • the echo cancellation effect of AEC can meet the current needs; however, when there is obvious near-end sound, the performance of AEC based on various existing adaptive filtering algorithms will deteriorate, and it cannot even guarantee self-control. Adapt to the convergence of the filtering algorithm.
  • double-talk detector DTD
  • a typical application of DTD is to freeze AEC updates during double-talk periods to prevent adaptive filtering algorithms. Divergence.
  • the double-ended utterance detection algorithm may specifically include an energy-based double-ended utterance detection algorithm, a double-ended utterance detection algorithm based on signal correlation characteristics, and a double-ended utterance detection algorithm based on spectral characteristics.
  • These double-ended vocalization detection algorithms all rely on the selection of a fixed threshold, and the vocalization state is judged by comparing the calculated statistics with the threshold.
  • the fixed threshold method cannot accurately detect the double-ended voice state. This not only affects the robustness of echo cancellation, but also produces severe sound cuts during subsequent processing, that is, the sound transmitted to the remote user will be intermittent.
  • the main influencing factor in hands-free communication equipment is the signal-to-return ratio of the signal received by the microphone, that is, the amplitude (power) ratio of the near-end voice received by the microphone to the echo signal received from the speaker.
  • the microphone's response ratio is usually lower during hands-free calls, and the distance between the microphone and the near-end talker, the volume of the near-end talker, and the size of the echo will change the return ratio. This makes the traditional The double-ended voice detection algorithm based on a fixed threshold often fails, and it is difficult to balance the duplex and de-echo performance in hands-free calling.
  • the echo cancellation technology in the prior art cannot accurately filter out the echo interference in double-ended voice problems, especially in hands-free calls and conference calls, and the call quality is easily affected.
  • the technical problem solved by this application is how to better eliminate echo and improve the duplex call experience of the hands-free voice communication terminal.
  • embodiments of the present application provide an echo cancellation method, device, and system based on double-ended vocalization detection, where the echo cancellation method based on double-ended vocalization detection may include: acquiring an input sound signal from a sound collection device; Perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal; determine the current utterance state according to the near-end speech estimation signal; obtain a preset mapping relationship between the utterance state and the processing mode, according to the The mapping relationship obtains the processing mode corresponding to the current utterance state; processes the near-end speech estimation signal according to the processing mode; and outputs the processed near-end speech estimation signal to obtain an output signal.
  • the determining the current utterance state according to the near-end speech estimation signal includes: calculating the double-ended utterance state of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal The average value of the statistics; obtain the dual-speaker judgment threshold corresponding to the current frame, the dual-speaker judgment threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal; according to the double-end utterance of the current frame The relationship between the average value of the state statistics and the dual-talk judgment threshold is used to determine the current utterance state.
  • the calculating the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal includes: calculating the current state according to the following formula The average value of the double-ended utterance state statistics of the frame: where, is the average value of the double-ended utterance state statistics of the current frame, represents the power of the near-end speech estimation signal at the kth frame and the nth sample point, and represents the total The power of the input sound signal in the k-th frame and the n-th sample point represents the average value of the values in the brackets.
  • the obtaining the dual-talk judgment threshold corresponding to the current frame includes: real-time estimation of the input sound signal To obtain the average response ratio of the current frame in the input sound signal; obtain multiple preset thresholds for the response ratio, and construct multiple response ratio intervals according to the multiple thresholds; Determine the interval of the average response ratio of the current frame to which the average response ratio belongs, and obtain the dual-speaking judgment threshold corresponding to the interval of the said current frame as the dual-speaking judgment threshold of the current frame.
  • the real-time estimation of the signal return ratio of the input sound signal to obtain the average signal return ratio of the current frame in the input sound signal includes: acquiring a near-end interference signal, where the near-end interference signal is and The sound signal generated by the sound generating device at the same end of the sound collection device; calculate the average response ratio of the current frame in the input sound signal according to the following formula; wherein, represents the estimated average response ratio of the k-th frame, Its unit is dB, which represents the power of the input sound signal at the k-th frame and the n-th sample point, represents the power of the near-end interference signal at the k-th frame and the n-th sample point, and represents the value in brackets average value.
  • the acquiring multiple preset thresholds of the return ratio, and constructing multiple intervals of the return ratio according to the multiple thresholds includes: comparing the acquired multiple thresholds with the return ratio. The two adjacent ones are used as the boundary value of the RR interval to obtain multiple RL interval.
  • the utterance state includes two states: only the far-end utterance and not only the far-end utterance, and the preset mapping relationship between the utterance state and the processing mode includes: when the utterance state is only the far-end utterance, Performing zeroing processing on the near-end speech estimation signal or suppressing it to be inaudible; when the utterance state is judged to be not only the far-end utterance, the near-end speech estimation signal is retained.
  • the not only far-end utterance includes two states: near-end utterance only and double-ended utterance.
  • the performing adaptive filtering on the input sound signal to obtain the near-end speech estimation signal includes: performing linear filtering and non-linear filtering on the input sound signal, respectively, to obtain the near-end speech estimation signal.
  • An embodiment of the present application also provides an echo cancellation device based on double-ended vocalization detection.
  • the device includes: an input sound signal acquisition module for acquiring an input sound signal from a sound collection device; a filtering module for evaluating the input sound The signal is adaptively filtered to obtain the near-end speech estimation signal; the current utterance state determination module is used to determine the current utterance state according to the near-end speech estimation signal; the processing method acquisition module is used to obtain the preset utterance state and The mapping relationship between the processing modes, the processing mode corresponding to the current utterance state is obtained according to the mapping relationship; a near-end processing module, configured to process the near-end speech estimation signal according to the processing mode; an output module , Used to output the processed near-end speech estimation signal to obtain an output signal.
  • the embodiment of the present application also provides an echo cancellation system based on double-ended voice detection, including a sound collection device, a same-end voice device, and an echo cancellation device, and the echo cancellation device executes the steps of any one of the above-mentioned methods.
  • An embodiment of the present application provides an echo cancellation method based on double-ended utterance detection.
  • the method includes: acquiring an input sound signal from a sound collection device; adaptively filtering the input sound signal to obtain a near-end speech estimation signal; Determine the current utterance state according to the near-end speech estimation signal; obtain the mapping relationship between the preset utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship;
  • the near-end speech estimation signal is processed in a manner; the processed near-end speech estimation signal is output to obtain an output signal.
  • the input sound signal in a voice call such as a telephone is different from the direct transmission to the peer device in the existing communication scheme or only the adaptive echo cancellation is transmitted to the peer device.
  • the technical scheme in this method Customize different processing methods according to different sounding states corresponding to the input sound signal, and accurately filter out the echo in the input sound signal by combining the characteristics of double-ended sounding. Especially in the call system that is greatly affected by the interference of double-end vocalization, such as hands-free call and voice conference, the call quality can be significantly improved.
  • the real-time sounding state judgment is performed on each frame of the input sound signal to realize the real-time update of the processing method of the near-end voice estimation signal, so that the input sound signal can be accurately and completely echo canceled, and the call can be guaranteed. Stability of the process.
  • the signal-to-return ratio of the input sound signal with the near-end interference signal as the echo source is calculated in real time by sampling, and different dual-talk judgment thresholds are set when the influence of the near-end interference signal on the input sound signal is different. It can more accurately determine the current sounding state and improve the accuracy of echo cancellation for the input sound signal.
  • two utterance states are defined, and processing methods corresponding to the two utterance states are specified, which can basically meet the requirements of real-time echo cancellation in common voice calls.
  • the adaptive filtering of the input sound signal includes two operations of linear filtering and non-linear filtering, which can further suppress the echo of the input sound signal.
  • the echo cancellation system based on double-ended vocalization detection provided by the embodiments of the present application can perform real-time detection based on the acoustic echo generated in the communication process, and eliminate it based on the detection result, so that the echo cancellation system can be improved when the voice communication terminal is in the hands-free mode. Eliminate the effect to improve the quality of the call.
  • the echo cancellation method, device and system based on double-ended utterance detection provided in the embodiments of the present application can distinguish between only far-end utterance and only near-end utterance or double-ended utterance in real time.
  • the time-domain output result is zeroed or suppressed to inaudible, so that the echo can be eliminated to the greatest extent while ensuring the duplex call performance, so as to improve the echo cancellation and duplex performance at the same time.
  • the purpose is to improve the duplex call experience of the hands-free voice communication terminal.
  • FIG. 1 is a schematic flowchart of an echo cancellation method based on double-ended vocalization detection according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of the application of an echo cancellation method based on double-ended vocalization detection according to an embodiment of the present application
  • FIG. 3 is a schematic flowchart of step S103 in FIG. 1 in an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of step S302 in FIG. 3 in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a response ratio interval according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an echo cancellation device based on double-ended vocalization detection according to an embodiment of the present application
  • FIG. 7 is a schematic structural diagram of an echo cancellation system based on double-ended vocalization detection according to an embodiment of the present application.
  • the echo cancellation technology in the prior art cannot accurately filter out the echo interference in double-ended voice problems, especially in hands-free calls and conference calls, and the call quality is easily affected.
  • an embodiment of the present application provides an echo cancellation method based on double-ended vocalization detection.
  • the method includes: acquiring an input sound signal from a sound collection device; performing adaptive filtering on the input sound signal to obtain a close End speech estimation signal; determine the current utterance state according to the near-end speech estimation signal; obtain the mapping relationship between the preset utterance state and processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship ; Process the near-end speech estimation signal according to the processing mode; output the processed near-end speech estimation signal to obtain an output signal.
  • Adopting the solution described in this embodiment can filter out the interference signal in the double-ended voice, and significantly improve the quality of the call.
  • Figure 1 provides a schematic flow chart of an echo cancellation method based on double-ended vocalization detection; the method may specifically include the following steps:
  • S101 Obtain an input sound signal from a sound collection device.
  • the input sound signal is the sound signal collected by the sound collection device.
  • the sound collection device may be a microphone or other device, and for a telephone or phone-like call, it is a sound collection device that comes with a terminal such as a mobile phone, a landline or a computer.
  • the terminal such as the telephone collects the sound of the local end through the sound collection device in real time, and transmits it to the opposite end of the call through the communication line.
  • the sound collection device at the local end collects the input sound signal, it is not directly transmitted to the call. Instead, through the following steps S102 to S106, the input sound signal is echo canceled to improve the quality of the voice call.
  • S102 Perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal.
  • the adaptive filtering method After acquiring the input sound signal from the sound collection device, the acquired input sound signal is filtered to filter out the echo signal generated at the local end that interferes with the normal call, and to obtain the near-end voice estimation signal after the echo signal is filtered out.
  • the adaptive filtering method can use an adaptive echo canceller (ie, AEC) to filter the input sound signal to filter out the near-end speech estimation signal.
  • AEC adaptive echo canceller
  • S103 Determine the current utterance state according to the near-end speech estimation signal.
  • the utterance state can include different states such as far-end utterance only, double-end utterance, and near-end utterance only.
  • the utterance state corresponds to different processing methods for the obtained near-end speech estimation signal, which can be set according to needs.
  • the vocal state of is not limited to the examples mentioned above.
  • the current utterance state is to determine the real-time utterance state of the near-end speech estimation signal obtained this time to determine its real-time corresponding utterance state.
  • the corresponding utterance state can be determined according to the waveform, channel and other attributes of the speech signal.
  • S104 Obtain a preset mapping relationship between the utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship.
  • the processing method is a corresponding processing method for the near-end speech estimation signal of each utterance state, and may include processing methods such as setting the near-end speech estimation signal to zero (0), fully retaining or retaining part, and so on.
  • the mapping relationship between the utterance state and the processing mode can be set in advance. After the current utterance state is determined, the corresponding processing mode can be automatically obtained according to the mapping relationship.
  • S105 Process the near-end speech estimation signal according to the processing manner.
  • the near-end speech estimation signal is processed according to this processing mode.
  • S106 Output the processed near-end speech estimation signal to obtain an output signal.
  • the processed near-end voice estimation signal can correctly reflect the call information of the local end, and this output signal can be transmitted to the call peer through the communication link.
  • the input sound signal in a voice call such as a telephone
  • it is different from the direct transmission to the opposite device in the existing communication scheme or the transmission to the opposite device with only adaptive echo cancellation.
  • Different processing methods can be customized according to the different sounding states of the input sound signal, and the interference or echo in the input sound signal can be accurately filtered by combining the characteristics of double-ended sounding.
  • the call quality can be significantly improved.
  • Figure 2 provides a schematic diagram of the application of an echo cancellation method based on double-ended utterance detection; in the application scenario shown in Figure 2, the call object includes a far-end device 200 and a near-end device 210, where the far The end device 200 includes a far-end microphone 201 and a far-end speaker 202, and the near-end device 210 includes a near-end speaker 203 and a near-end microphone 204.
  • the far-end microphone 201 sends the downlink signal S1 to the near-end speaker 203
  • the direct echo S2 is the sound signal that is emitted by the near-end speaker 203 and is directly picked up by the near-end microphone 204
  • the indirect echo S3 is the sound signal from the near-end speaker.
  • 203 emits a sound signal that is reflected by the environment and indirectly picked up by the near-end microphone 204. While picking up the echoes (direct echo S2 and indirect echo S3), a person (not shown) sends a voice to the near-end microphone 204 (marked "voice" in the figure), and the near-end microphone 204 picks up the voice and generates an uplink signal S4 is sent to the remote speaker 202 to be played out.
  • the echo cancellation method based on double-ended voice detection in FIG. 1 can be applied to the near-end microphone 204 side in FIG. 2 where the near-end microphone 204 obtains the input sound signal to be sent to the far-end device 200 (that is, according to the voice in FIG. 2 Before the obtained sound signal), the input sound signal is processed by the echo cancellation method in FIG. 1 first.
  • Step S103 in FIG. 1 determines the current utterance state according to the near-end speech estimation signal, which may specifically include steps S301 to S303 in FIG. 3.
  • S301 Calculate the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal.
  • the double-ended utterance state statistics of the current frame are based on the current frame in the input sound signal as the reference point, and the input sound signal and the near-end speech estimation signal before the reference point are respectively sampled, and the input sound signal and the near-end speech estimation signal are respectively sampled. Signals are compared, calculated, and used to reflect the current sounding state of the input sound signal.
  • the average value is the average value of the double-ended voice state statistics at several sampling points.
  • the average value of the double-ended utterance state statistics of the current frame may be obtained by inputting the input sound signal and the near-end speech estimation signal into the double-ended utterance detector.
  • S302 Acquire a dual-talk judgment threshold corresponding to the current frame, where the dual-talk judgment threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal.
  • the signal-to-return ratio of the input sound signal is the energy ratio of the signal and the echo in the input sound signal, and the signal-to-return ratio of the input sound signal can be calculated to obtain the signal-to-return ratio.
  • the near-end interference signal is the interference signal generated by the sound generated by the same-end sounding device corresponding to the sound collecting device on the reception of the microphone, and can be obtained from the sounding device corresponding to the sound collecting device.
  • the sound-producing device can be a device such as a speaker corresponding to the local microphone in telephone communication.
  • the dual-talk judgment threshold is the threshold value used to determine the utterance state corresponding to the average of the double-ended utterance state statistics of the current frame. Set multiple thresholds of the utterance state for the average value of the double-ended utterance state statistics, that is, dual-talk judgment Threshold.
  • the dual-talk judgment threshold is set based on two factors: the signal-to-return ratio of the input sound signal and the near-end interference signal.
  • S303 Determine the current utterance state according to the magnitude relationship between the average value of the double-ended utterance state statistics of the current frame and the double-talk determination threshold.
  • step S302 According to the relationship between the average value of the double-ended utterance state statistics of the current frame in the input sound signal and the double-talk judgment threshold obtained in step S302, it is determined which utterance state the average value of the double-ended utterance state statistics of the current frame is in. Within the threshold interval to determine the current vocalization state.
  • the real-time sounding state judgment is performed on each frame of the input sound signal to realize real-time update of the processing method of the near-end speech estimation signal, so that the input sound signal can be accurately and completely echo canceled. Ensure the stability of the call process.
  • step S301 the calculation method of the average value of the double-ended utterance state statistics of the current frame in the input sound signal can be calculated according to the following formula:
  • Step S302 in FIG. 3 obtains the dual-talk judgment threshold corresponding to the current frame.
  • the dual-talk judgment threshold is based on the signal-back ratio and near-end interference of the input sound signal.
  • Obtaining the signal may include steps S401 to S403 in Fig. 4, where:
  • S401 Estimate the signal-to-return ratio of the input sound signal in real time to obtain an average signal-to-return ratio of the current frame in the input sound signal.
  • S402 Acquire multiple preset thresholds of the response ratio, and construct multiple intervals of the ratio of the response ratio according to the multiple thresholds.
  • the preset multiple thresholds are values obtained through experience or extreme technical personnel.
  • the boundary values of multiple thresholds can be generated based on multiple thresholds to define multiple thresholds.
  • the corresponding dual-speaking judgment threshold is set for each response ratio interval.
  • S403 Determine a signal-to-return ratio interval to which the average return ratio of the current frame belongs, and obtain a dual-speaking judgment threshold corresponding to the signal-to-return ratio interval as the dual-speaking judgment threshold of the current frame.
  • the signal-to-return ratio of the input sound signal with the near-end interference signal as the echo source is calculated in real time by sampling.
  • different dual-talk judgments are set Threshold, more accurately determine the current sounding state, and improve the accuracy of echo cancellation for the input sound signal.
  • the calculation method of the average response ratio of the current frame in step S401 in FIG. 4 is as follows:
  • the near-end interference signal is a sound signal generated by a sound-producing device at the same end as the sound collection device.
  • P m (k, n) represents the power of the input sound signal at the k-th frame and the n-th sample point
  • P x (k, n ) Represents the power of the near-end interference signal at the k-th frame and the n-th sample point
  • mean() represents the average value of the values in the brackets.
  • P m (k, n) and P x (k, n) are the power values of the sampling points obtained by sampling the input sound signal and the near-end interference signal respectively in frames.
  • the sampling process is: acquiring n sample points in the input sound signal and the near-end interference signal respectively, and the signal frame corresponding to each sample point is the k-th frame. Among them, n and k are variable count values.
  • Step S402 in FIG. 4 obtains multiple preset thresholds of the response ratio, and constructs multiple intervals of the response ratio according to the multiple thresholds, which may include: Two adjacent ones of the acquired multiple thresholds of the response ratio are used as the boundary value of the response ratio interval to obtain a plurality of ratio intervals.
  • the threshold value is used as the boundary value of a RL interval to obtain multiple RL interval.
  • the preset multiple thresholds of the return ratio are SER_thr_1, SER_thr_2, SER_thr_3,..., SER_thr_k, and the return ratio interval is formed by the thresholds.
  • the information response ratio interval can be expressed as: the information response ratio interval 501, the information response ratio interval 502,..., the information response ratio interval 50k, where k is a variable value, which represents the kth information response ratio interval 50k, according to K+1 thresholds of the response ratio can construct k response ratio intervals of 50k.
  • the corresponding dual-speaking judgment threshold for each response ratio interval, that is, the dual-speaking judgment threshold m1, the dual-speaking judgment threshold m2, ..., the dual-speaking judgment threshold mk in Fig. 5.
  • the corresponding dual-talk judgment threshold is obtained, that is, step S403.
  • the RR interval is automatically constructed based on the preset RR threshold value as the interval boundary value.
  • the utterance state includes two states: only the far-end utterance and not only the far-end utterance, and the preset mapping relationship between the utterance state and the processing mode includes: when the utterance state is only the far-end utterance When the near-end speech estimation signal is zeroed or suppressed to be inaudible; when the utterance state is not only the far-end utterance, the near-end speech estimation signal is retained.
  • two voice states can be set, namely, only the far-end voice and not only the far-end voice.
  • the near-end speech estimation signal needs to be zeroed or suppressed to be inaudible, that is, the near-end speech estimation signal is filtered out, and the mute signal is used as The transmission signal of the local end is transmitted to the opposite end device of the call.
  • the near-end voice estimation signal When it is determined based on the near-end voice estimation signal that the current utterance state is not only the far-end utterance, the near-end voice estimation signal needs to be retained, and the near-end voice estimation signal is transmitted to the peer device of the call as a transmission signal of the local end.
  • two utterance states are defined, and processing methods corresponding to the two utterance states are specified, which can basically meet the requirements of real-time echo cancellation in common voice calls.
  • the not only far-end utterance includes two states: near-end utterance only and double-ended utterance.
  • near-end sound-only means that the sound collection device only collects the transmission from the local end. Signals, but no near-end interference signal is collected; the double-ended sounding state means that the sound collection device collects both the local transmission signal and the near-end interference signal.
  • the processing method can be further specified for these two states. For example, for only the near-end voice, no processing is done and the voice signal is directly transmitted to the opposite end, and so on.
  • Step S102 in FIG. 1 performs adaptive filtering on the input sound signal to obtain a near-end speech estimation signal, which may specifically include two filtering operations, namely linear filtering and non-linear filtering. .
  • the input sound signal is processed by linear filtering in filters such as AEC to eliminate part of the echo.
  • the input sound signal still contains linear residual echo and nonlinear echo.
  • near-end utterance it also contains near-end speech.
  • Continuous non-linear processing and filtering of the sound signal containing residual echo can be used to achieve further echo suppression.
  • the adaptive filtering of the input sound signal includes two operations of linear filtering and non-linear filtering, which can further suppress the echo of the input sound signal.
  • the embodiment of the application also provides an echo cancellation device based on double-ended vocalization detection. Please refer to FIG. 6.
  • the device may include an input sound signal acquisition module 601, a filtering module 602, a sound state determination module 603, and a processing mode acquisition module 604. , Near-end processing module 605 and output module 606, where:
  • the input sound signal acquisition module 601 is used to acquire the input sound signal from the sound collection device.
  • the filtering module 602 is configured to perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal.
  • the utterance state determination module 603 is configured to determine the current utterance state according to the near-end speech estimation signal.
  • the processing mode obtaining module 604 is configured to obtain a preset mapping relationship between the utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship.
  • the near-end processing module 605 is configured to process the near-end speech estimation signal according to the processing mode.
  • the output module 606 is configured to output the processed near-end speech estimation signal to obtain an output signal.
  • the utterance state determination module 603 may include:
  • a real-time utterance state acquisition unit configured to calculate the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal;
  • a threshold obtaining unit configured to obtain a dual-talk judgment threshold corresponding to the current frame, the dual-talk judgment threshold being obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal;
  • the utterance state determination unit is configured to determine the current utterance state according to the magnitude relationship between the average value of the double-ended utterance state statistics of the current frame and the dual-talk determination threshold.
  • the threshold value acquisition unit includes:
  • the current response ratio obtaining subunit is used to estimate the signal response ratio of the input sound signal in real time to obtain the average signal response ratio of the current frame in the input sound signal;
  • a signal response ratio interval construction subunit for obtaining a plurality of preset signal response ratio thresholds, and constructing a plurality of signal response ratio intervals according to the plurality of signal response ratio thresholds;
  • the threshold judging subunit is used to judge the interval of the average response ratio of the current frame to which the average response ratio belongs, and obtain the dual-speaking judgment threshold corresponding to the said interval as the dual-speaking judgment threshold of the current frame.
  • the above-mentioned signal response ratio interval construction subunit is further used to use two adjacent ones of the obtained multiple signal response ratio threshold values as the boundary value of the signal response ratio interval to obtain multiple signal response ratios. Back to the interval.
  • the filtering module 602 in FIG. 6 is further configured to perform linear filtering and non-linear filtering on the input sound signal to obtain the near-end speech estimation signal.
  • the embodiment of the present application also provides an echo cancellation system based on double-ended vocalization detection, including a sound collection device, a same-end sounding device, and an echo cancellation device.
  • the echo cancellation device performs the double-ended-based echo cancellation system provided in FIGS. 1 to 5. The steps of the echo cancellation method for vocal detection.
  • Figure 7 is a schematic diagram of an echo cancellation system based on double-ended voice detection; the system includes a sound collection device 701, an echo cancellation device 702, and a same-end voice device 703.
  • the sound collection device 701 may be a microphone in telephone communication for collecting the input sound signal A1.
  • the same-end sounding device 703 may be a speaker connected to the same end as a microphone in telephone communication to generate a sound signal, but it may interfere with the input sound signal A1, so it is used as the interference sound signal A6.
  • the echo cancellation device 702 is a device for implementing the echo cancellation method based on double-ended vocalization detection in FIGS. 1 to 5 in this application.
  • the function of the echo cancellation device can be realized by means of entity or logic circuit, software programming, etc.
  • the echo cancellation device 702 may include a linear AEC filter 7021, an NLP filter 7022, a double-ended utterance detector 7023, a signal-to-return ratio estimator 7024, a threshold determiner 7025 and a processor 7026.
  • the echo cancellation device 702 processes the sound signals received from the sound collection device 701 and the same-end sound device 703 in the communication process as follows:
  • the input sound signal A1 is linearly filtered through the linear AEC filter 7021 to obtain the linearly filtered sound signal A2, and then the NLP filter is applied to A2.
  • Non-linear filtering obtains the near-end speech estimation signal A3, which is used as an input signal of the double-ended utterance detector 7023.
  • the input sound signal A1 is directly used as another input signal of the double-ended sounding detector.
  • the linear AEC filter 7021 uses the interference sound signal A6 as a filtering reference factor to linearly filter the input sound signal A1.
  • the input sound signal A1 is input to the echo ratio estimator 7024, the average echo ratio A4 of the current frame of the input sound signal is calculated in real time, and the average echo ratio A4 is transmitted to the threshold determiner 7025, which is based on the preset Multiple signal response ratio intervals constructed by multiple signal response ratio thresholds to determine the dual-talk judgment threshold A5 corresponding to the average return ratio of the current frame A4, and send the double-talk judgment threshold A5 to the double-ended utterance detector 7023 As the basis for judging the current utterance state.
  • the signal-to-return ratio estimator 7024 samples the input sound signal A1 and the interference sound signal A6, and calculates the average signal-to-return ratio A4 of the current frame according to the following formula:
  • the double-ended utterance detector 7023 acquires the first input signal (ie the near-end voice estimation signal A3), the second input signal (ie the input sound signal A1), and the dual-talk judgment threshold A5, and determines the current utterance in real time based on this information State A7.
  • the current utterance state is obtained based on the average of the double-ended utterance state statistics of the current frame.
  • the double-ended utterance detector 7023 samples the near-end speech estimation signal A3 and the input sound signal A1, and calculates the average value of the double-ended utterance state statistics of the current frame according to the following formula:
  • the double-ended utterance detector 7023 sends the obtained current utterance state A7 to the processor 7026, and the processor 7026 processes the near-end voice estimation signal A3 according to the current utterance state A7.
  • the processing method is: when the utterance state is only the far-end utterance, the near-end speech estimation signal A3 is zeroed or suppressed to inaudible; when the utterance state is not only the far-end utterance, the near-end speech estimation signal A3 is retained.
  • the processed near-end speech estimation signal is output to obtain an output signal A8, and the output signal A8 can be transmitted to the device of the communication opposite end via the communication link.
  • the above-mentioned echo cancellation system based on double-ended vocalization detection performs real-time detection based on the acoustic echo generated in the communication process, and eliminates it according to the detection result, so that the echo cancellation effect can be improved when the voice communication terminal is in the hands-free mode to improve the call quality.
  • the echo cancellation method, device and system based on double-ended utterance detection provided in the embodiments of the present application can distinguish between only far-end utterance and only near-end utterance or double-ended utterance in real time.
  • the time-domain output result is zeroed or suppressed to inaudible, so that the echo can be eliminated to the greatest extent while ensuring the duplex call performance, so as to improve the echo cancellation and duplex performance at the same time.
  • the purpose is to improve the duplex call experience of the hands-free voice communication terminal.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

An echo cancellation method, device, and system employing double-talk detection. The method comprises: acquiring an input sound signal from a sound collection device; performing adaptive filtering on the input sound signal to obtain a near-end speech estimation signal; determining the current sound production status according to the near-end speech estimation signal; acquiring preset mappings between sound production statuses and processing procedures, and acquiring, according to the mappings, a processing procedure corresponding to the current sound production status; processing the near-end speech estimation signal according to the processing procedure; and outputting the processed near-end speech estimation signal to obtain an output signal. The method improves echo cancellation, and improves the two-way conversation experience for hands-free speech communication terminals.

Description

基于双端发声检测的回声消除方法、装置及系统Echo cancellation method, device and system based on double-ended vocalization detection
本申请要求于2019年12月13日提交中国专利局、申请号为201911284296.3、发明名称为“基于双端发声检测的回声消除方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on December 13, 2019, the application number is 201911284296.3, and the invention title is "Echo cancellation method, device and system based on double-ended sound detection", and the entire content of it is approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及语音通讯领域,具体地涉及一种基于双端发声检测的回声消除方法、装置及系统。This application relates to the field of voice communication, and in particular to an echo cancellation method, device and system based on double-ended voice detection.
背景技术Background technique
在电话终端中,声学回声是由于扬声器和终端麦克风之间的耦合,导致电话的麦克风不仅包含有用的语音信号,而且还包含回声。如果不对麦克风信号进行处理,则回声信号及近端语音信号会被传输到远端扬声器播放出来,远端通话者就会听到自己延迟后的声音,它会使人感到不舒服,从而影响通话的效果。当回声较大时,通话甚至无法正常进行。因此,必须采取有效措施来抑制回声,消除其影响,才能提高语音通信质量。In the telephone terminal, the acoustic echo is due to the coupling between the speaker and the terminal microphone, resulting in the telephone microphone not only containing useful voice signals, but also echo. If the microphone signal is not processed, the echo signal and the near-end voice signal will be transmitted to the far-end speaker for playback, and the far-end caller will hear his delayed voice, which will make people feel uncomfortable and affect the call Effect. When the echo is loud, the call cannot even be carried out normally. Therefore, effective measures must be taken to suppress the echo and eliminate its impact in order to improve the quality of voice communication.
例如,在电话会议、免提电话等系统中,都不同程度的存在声学回声现象。回声消除从贝尔发明电话开始就成为工程上需要解决的问题。近年来,随着信息技术的飞速发展,通信方式和应用场景日趋多样化,通信终端日趋小型化,使得扬声器和麦克风的耦合越来越强,回声信道越来越复杂多变,这为语音通信中的声学回声消除带来了极大的挑战。For example, in systems such as conference calls and hands-free phones, there are acoustic echoes to varying degrees. Echo cancellation has become an engineering problem that needs to be solved since Bell invented the telephone. In recent years, with the rapid development of information technology, communication methods and application scenarios have become increasingly diversified, and communication terminals have become more and more compact, making the coupling between speakers and microphones stronger and stronger, and the echo channel has become more and more complex and changeable. This is voice communication. Acoustic echo cancellation in the system poses a great challenge.
声学回声一般产生于免提通信系统中,是受声波传播影响的回声 产生方式,一般情况下可以分为两种情况:直接回声和间接回声。直接回声是指扬声器播放出来的声音未经任何反射沿路径直接进入麦克风被拾回,这种回声延迟时间最短,且与远端说话者的语音能量、扬声器与话筒之间的距离、角度、扬声器的播放音量及话筒的拾取灵敏度等因素有关。间接回声是指扬声器播放的声音经不同的路径一次或多次反射后进入麦克风所产生的回声集合,这种回声的特征是延迟时问长,延迟抖动大,回声音量受环境影响大。Acoustic echo is generally produced in hands-free communication systems. It is an echo generation method affected by sound wave propagation. Generally, it can be divided into two situations: direct echo and indirect echo. Direct echo means that the sound played by the speaker directly enters the microphone along the path without any reflection and is picked up. This echo has the shortest delay time, and the voice energy of the far-end speaker, the distance and angle between the speaker and the microphone, and the speaker The playback volume and the pickup sensitivity of the microphone are related to other factors. Indirect echo refers to the collection of echoes generated by the sound played by the speaker entering the microphone after being reflected one or more times through different paths. The characteristics of this echo are long delay time, large delay jitter, and the amount of echo that is greatly affected by the environment.
现有技术中通常使用自适应回声消除器(Acoustic Echo Canceller,简称AEC)来消除回声。AEC的基本原理可以概括为自适应地估计回声,并从麦克风拾取的信号中减去该估计回声。在电话电路中,无论距离有多远,AEC可以使通话者之间免受回声影响;在免提电话中,AEC可以使回声最小化。当不存在近端声音时,AEC的回声消除效果能够满足当前需求;然而,当存在明显的近端声音时,基于现有各种自适应滤波算法的AEC的性能将发生恶化,甚至不能保证自适应滤波算法的收敛。这就是回声消除在实际应用中的必须解决的关键问题,通常称为双端发声(Double-talk,简称DT)问题。为了降低或避免双端发声对AEC性能的影响,可以使用双端发声检测器(Double-talk detector,简称DTD),DTD的典型应用就是在双端讲话时段冻结AEC的更新,防止自适应滤波算法发散。In the prior art, an adaptive echo canceller (Acoustic Echo Canceller, AEC for short) is usually used to cancel the echo. The basic principle of AEC can be summarized as adaptively estimating the echo and subtracting the estimated echo from the signal picked up by the microphone. In the telephone circuit, no matter how far the distance is, AEC can avoid the influence of echo between the callers; in the hands-free phone, AEC can minimize the echo. When there is no near-end sound, the echo cancellation effect of AEC can meet the current needs; however, when there is obvious near-end sound, the performance of AEC based on various existing adaptive filtering algorithms will deteriorate, and it cannot even guarantee self-control. Adapt to the convergence of the filtering algorithm. This is the key problem that must be solved in the actual application of echo cancellation, which is usually called the double-talk (Double-talk, DT for short) problem. In order to reduce or avoid the impact of double-talk on AEC performance, double-talk detector (DTD) can be used. A typical application of DTD is to freeze AEC updates during double-talk periods to prevent adaptive filtering algorithms. Divergence.
DTD是基于双端发声检测算法工作的。双端发声检测算法具体可以包括基于能量的双端发声检测算法、基于信号的相关特性的双端发声检测算法和基于谱特征的双端发声检测算法等。这些双端发声检测算法都依赖于固定阈值的选取,通过将计算的统计量与该阈值进行比较来判断发声状态。然而,由于实际信道以及通话情况的多变性,固定阈值法无法准确的检测双端发声状态。这不仅影响回声消除的鲁棒性,而且会在进行后续处理时产生严重的切音现象,即传输到远端用户的声音会发生断续。DTD is based on the double-ended voice detection algorithm. The double-ended utterance detection algorithm may specifically include an energy-based double-ended utterance detection algorithm, a double-ended utterance detection algorithm based on signal correlation characteristics, and a double-ended utterance detection algorithm based on spectral characteristics. These double-ended vocalization detection algorithms all rely on the selection of a fixed threshold, and the vocalization state is judged by comparing the calculated statistics with the threshold. However, due to the variability of the actual channel and the call situation, the fixed threshold method cannot accurately detect the double-ended voice state. This not only affects the robustness of echo cancellation, but also produces severe sound cuts during subsequent processing, that is, the sound transmitted to the remote user will be intermittent.
在免提通话设备中主要的影响因素是麦克风接收信号的信回比, 即麦克风接收的近端语音与其接收的来自扬声器的回声信号的幅度(功率)比。相比于手持通话,免提通话时麦克风的信回比通常较低,且麦克风与近端讲话者的距离、近端讲话者音量、回声大小等都会令信回比发生变化,这使得传统的基于固定阈值的双端发声检测算法往往失效,难以平衡免提通话下的双工和去回声性能。The main influencing factor in hands-free communication equipment is the signal-to-return ratio of the signal received by the microphone, that is, the amplitude (power) ratio of the near-end voice received by the microphone to the echo signal received from the speaker. Compared with hand-held calls, the microphone's response ratio is usually lower during hands-free calls, and the distance between the microphone and the near-end talker, the volume of the near-end talker, and the size of the echo will change the return ratio. This makes the traditional The double-ended voice detection algorithm based on a fixed threshold often fails, and it is difficult to balance the duplex and de-echo performance in hands-free calling.
综上,现有技术中的回声消除技术,对于双端发声问题,尤其是免提通话、电话会议中的双端发声问题无法准确滤除其中的回声干扰,通话质量易受影响。To sum up, the echo cancellation technology in the prior art cannot accurately filter out the echo interference in double-ended voice problems, especially in hands-free calls and conference calls, and the call quality is easily affected.
发明内容Summary of the invention
本申请解决的技术问题是如何更好地消除回声,提升免提语音通信终端的双工通话体验。The technical problem solved by this application is how to better eliminate echo and improve the duplex call experience of the hands-free voice communication terminal.
为解决上述技术问题,本申请实施例提供一种基于双端发声检测的回声消除方法、装置及系统,其中,基于双端发声检测的回声消除方法可以包括:从声音采集设备获取输入声音信号;对所述输入声音信号进行自适应滤波,得到近端语音估计信号;根据所述近端语音估计信号判定当前的发声状态;获取预设的发声状态与处理方式之间的映射关系,根据所述映射关系获取所述当前的发声状态对应的处理方式;根据所述处理方式对所述近端语音估计信号进行处理;将处理后的近端语音估计信号输出,得到输出信号。In order to solve the above technical problems, embodiments of the present application provide an echo cancellation method, device, and system based on double-ended vocalization detection, where the echo cancellation method based on double-ended vocalization detection may include: acquiring an input sound signal from a sound collection device; Perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal; determine the current utterance state according to the near-end speech estimation signal; obtain a preset mapping relationship between the utterance state and the processing mode, according to the The mapping relationship obtains the processing mode corresponding to the current utterance state; processes the near-end speech estimation signal according to the processing mode; and outputs the processed near-end speech estimation signal to obtain an output signal.
可选的,所述根据所述近端语音估计信号判定当前的发声状态,包括:根据所述输入声音信号和所述近端语音估计信号计算所述输入声音信号中当前帧的双端发声状态统计量的平均值;获取所述当前帧对应的双讲判断阈值,所述双讲判断阈值根据所述输入声音信号的信回比和近端干扰信号得到;根据所述当前帧的双端发声状态统计量的平均值与双讲判断阈值的大小关系,判定当前的发声状态。Optionally, the determining the current utterance state according to the near-end speech estimation signal includes: calculating the double-ended utterance state of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal The average value of the statistics; obtain the dual-speaker judgment threshold corresponding to the current frame, the dual-speaker judgment threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal; according to the double-end utterance of the current frame The relationship between the average value of the state statistics and the dual-talk judgment threshold is used to determine the current utterance state.
可选的,所述根据所述输入声音信号和所述近端语音估计信号计 算所述输入声音信号中当前帧的双端发声状态统计量的平均值,包括:根据下述公式计算所述当前帧的双端发声状态统计量的平均值:;其中,为当前帧的双端发声状态统计量的平均值,表示近端语音估计信号在第k帧、第n个样本点的功率,表示所述输入声音信号在第k帧、第n个样本点的功率,表示取括号内数值的平均值。Optionally, the calculating the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal includes: calculating the current state according to the following formula The average value of the double-ended utterance state statistics of the frame: where, is the average value of the double-ended utterance state statistics of the current frame, represents the power of the near-end speech estimation signal at the kth frame and the nth sample point, and represents the total The power of the input sound signal in the k-th frame and the n-th sample point represents the average value of the values in the brackets.
可选的,所述获取所述当前帧对应的双讲判断阈值,所述双讲判断阈值根据所述输入声音信号的信回比和近端干扰信号得到,包括:实时估计所述输入声音信号的信回比,以得到所述输入声音信号中当前帧的平均信回比;获取预设的多个信回比阈值,并根据所述多个信回比阈值构建多个信回比区间;判断所述当前帧的平均信回比所属的信回比区间,并获取所述的信回比区间对应的双讲判断阈值作为所述当前帧的双讲判断阈值。Optionally, the obtaining the dual-talk judgment threshold corresponding to the current frame, the dual-talk judgment threshold being obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal, includes: real-time estimation of the input sound signal To obtain the average response ratio of the current frame in the input sound signal; obtain multiple preset thresholds for the response ratio, and construct multiple response ratio intervals according to the multiple thresholds; Determine the interval of the average response ratio of the current frame to which the average response ratio belongs, and obtain the dual-speaking judgment threshold corresponding to the interval of the said current frame as the dual-speaking judgment threshold of the current frame.
可选的,所述实时估计所述输入声音信号的信回比,以得到所述输入声音信号中当前帧的平均信回比,包括:获取近端干扰信号,所述近端干扰信号为与所述声音采集设备的同端发声设备产生的声音信号;根据下述公式计算所述输入声音信号中当前帧的平均信回比;;其中,表示估计得到的第k帧的平均信回比,其单位为dB,表示所述输入声音信号在第k帧、第n个样本点的功率,表示所述近端干扰信号在第k帧、第n个样本点的功率,表示取括号内数值的平均值。Optionally, the real-time estimation of the signal return ratio of the input sound signal to obtain the average signal return ratio of the current frame in the input sound signal includes: acquiring a near-end interference signal, where the near-end interference signal is and The sound signal generated by the sound generating device at the same end of the sound collection device; calculate the average response ratio of the current frame in the input sound signal according to the following formula; wherein, represents the estimated average response ratio of the k-th frame, Its unit is dB, which represents the power of the input sound signal at the k-th frame and the n-th sample point, represents the power of the near-end interference signal at the k-th frame and the n-th sample point, and represents the value in brackets average value.
可选的,所述获取预设的多个信回比阈值,并根据所述多个信回比阈值构建多个信回比区间,包括:将获取的所述多个信回比阈值中相邻的两个作为所述信回比区间的边界值,得到多个信回比区间。Optionally, the acquiring multiple preset thresholds of the return ratio, and constructing multiple intervals of the return ratio according to the multiple thresholds, includes: comparing the acquired multiple thresholds with the return ratio. The two adjacent ones are used as the boundary value of the RR interval to obtain multiple RL interval.
可选的,所述发声状态包括仅远端发声和非仅远端发声两种状态,所述预设的发声状态与处理方式之间的映射关系包括:当发声状态为仅远端发声时,对所述近端语音估计信号作置零处理或抑制至不可闻;当发声状态判断为非仅远端发声时,保留所述近端语音估计信 号。Optionally, the utterance state includes two states: only the far-end utterance and not only the far-end utterance, and the preset mapping relationship between the utterance state and the processing mode includes: when the utterance state is only the far-end utterance, Performing zeroing processing on the near-end speech estimation signal or suppressing it to be inaudible; when the utterance state is judged to be not only the far-end utterance, the near-end speech estimation signal is retained.
可选的,所述非仅远端发声包括仅近端发声和双端发声两种状态。Optionally, the not only far-end utterance includes two states: near-end utterance only and double-ended utterance.
可选的,所述对所述输入声音信号进行自适应滤波,得到近端语音估计信号,包括:分别对所述输入声音信号进行线性滤波和非线性滤波,得到所述近端语音估计信号。Optionally, the performing adaptive filtering on the input sound signal to obtain the near-end speech estimation signal includes: performing linear filtering and non-linear filtering on the input sound signal, respectively, to obtain the near-end speech estimation signal.
本申请实施例还提供一种基于双端发声检测的回声消除装置,所述装置包括:输入声音信号获取模块,用于从声音采集设备获取输入声音信号;滤波模块,用于对所述输入声音信号进行自适应滤波,得到近端语音估计信号;当前的发声状态判定模块,用于根据所述近端语音估计信号判定当前的发声状态;处理方式获取模块,用于获取预设的发声状态与处理方式之间的映射关系,根据所述映射关系获取所述当前的发声状态对应的处理方式;近端处理模块,用于根据所述处理方式对所述近端语音估计信号进行处理;输出模块,用于将处理后的近端语音估计信号输出得到输出信号。An embodiment of the present application also provides an echo cancellation device based on double-ended vocalization detection. The device includes: an input sound signal acquisition module for acquiring an input sound signal from a sound collection device; a filtering module for evaluating the input sound The signal is adaptively filtered to obtain the near-end speech estimation signal; the current utterance state determination module is used to determine the current utterance state according to the near-end speech estimation signal; the processing method acquisition module is used to obtain the preset utterance state and The mapping relationship between the processing modes, the processing mode corresponding to the current utterance state is obtained according to the mapping relationship; a near-end processing module, configured to process the near-end speech estimation signal according to the processing mode; an output module , Used to output the processed near-end speech estimation signal to obtain an output signal.
本申请实施例还提供一种基于双端发声检测的回声消除系统,包括声音采集设备、同端发声设备和回声消除设备,所述回声消除设备执行上述任一项所述方法的步骤。The embodiment of the present application also provides an echo cancellation system based on double-ended voice detection, including a sound collection device, a same-end voice device, and an echo cancellation device, and the echo cancellation device executes the steps of any one of the above-mentioned methods.
与现有技术相比,本申请实施例的技术方案具有以下有益效果:Compared with the prior art, the technical solutions of the embodiments of the present application have the following beneficial effects:
本申请实施例中提供一种基于双端发声检测的回声消除方法,所述方法包括:从声音采集设备获取输入声音信号;对所述输入声音信号进行自适应滤波,得到近端语音估计信号;根据所述近端语音估计信号判定当前的发声状态;获取预设的发声状态与处理方式之间的映射关系,根据所述映射关系获取所述当前的发声状态对应的处理方式;根据所述处理方式对所述近端语音估计信号进行处理;将处理后的近端语音估计信号输出,得到输出信号。较之现有技术,对于电话等语音通话中的输入声音信号,区别于现有通讯方案中直接传输给对 端设备或者只做自适应回声消除即传输给对端设备,本方法中的技术方案,根据输入声音信号对应的不同发声状态定制不同的处理方式,以结合双端发声的特性对输入声音信号中的回声进行精准滤除。尤其是对于免提通话、语音会议等受双端发声干扰影响较大的通话系统中,能够显著提高通话质量。An embodiment of the present application provides an echo cancellation method based on double-ended utterance detection. The method includes: acquiring an input sound signal from a sound collection device; adaptively filtering the input sound signal to obtain a near-end speech estimation signal; Determine the current utterance state according to the near-end speech estimation signal; obtain the mapping relationship between the preset utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship; The near-end speech estimation signal is processed in a manner; the processed near-end speech estimation signal is output to obtain an output signal. Compared with the prior art, the input sound signal in a voice call such as a telephone is different from the direct transmission to the peer device in the existing communication scheme or only the adaptive echo cancellation is transmitted to the peer device. The technical scheme in this method , Customize different processing methods according to different sounding states corresponding to the input sound signal, and accurately filter out the echo in the input sound signal by combining the characteristics of double-ended sounding. Especially in the call system that is greatly affected by the interference of double-end vocalization, such as hands-free call and voice conference, the call quality can be significantly improved.
进一步地,对输入声音信号中的各帧进行实时的发声状态判断,以实现实时更新对近端语音估计信号的处理方式,使得能够准确地对输入声音信号完整、准确地进行回声消除,保证通话过程的稳定。Further, the real-time sounding state judgment is performed on each frame of the input sound signal to realize the real-time update of the processing method of the near-end voice estimation signal, so that the input sound signal can be accurately and completely echo canceled, and the call can be guaranteed. Stability of the process.
进一步地,以采样的方式实时计算以近端干扰信号为回声源、输入声音信号的信回比,在近端干扰信号对输入声音信号的影响程度不同时,设定不同的双讲判断阈值,更加准确地判定当前的发声状态,提高对输入声音信号进行回声消除的准确性。Further, the signal-to-return ratio of the input sound signal with the near-end interference signal as the echo source is calculated in real time by sampling, and different dual-talk judgment thresholds are set when the influence of the near-end interference signal on the input sound signal is different. It can more accurately determine the current sounding state and improve the accuracy of echo cancellation for the input sound signal.
进一步地,定义了两种发声状态,并规定两种发声状态对应的处理方式,可基本满足常见的语音通话中实时回声消除的要求。Furthermore, two utterance states are defined, and processing methods corresponding to the two utterance states are specified, which can basically meet the requirements of real-time echo cancellation in common voice calls.
进一步地,对输入声音信号的自适应滤波包含线性滤波和非线性滤波两个操作,能够进一步对输入声音信号进行回声抑制。Further, the adaptive filtering of the input sound signal includes two operations of linear filtering and non-linear filtering, which can further suppress the echo of the input sound signal.
本申请实施例提供的基于双端发声检测的回声消除系统,可根据通信过程中产生的声学回声进行实时检测、并根据检测结果予以消除,从而能够在语音通信终端处于免提模式下提高回声的消除效果,以提升通话质量。尤其针对免提语音通信终端,本申请实施例中提供的基于双端发声检测的回声消除方法、装置及系统,能够将仅远端发声和仅近端发声或双端发声的情况进行实时区分。当判断为仅远端发声时对时域输出结果作置零处理或抑制至不可闻,从而在保证双工通话性能的同时对回声作最大程度的消除,达到同时提高回声消除和双工性能的目的,提升免提语音通信终端的双工通话体验。The echo cancellation system based on double-ended vocalization detection provided by the embodiments of the present application can perform real-time detection based on the acoustic echo generated in the communication process, and eliminate it based on the detection result, so that the echo cancellation system can be improved when the voice communication terminal is in the hands-free mode. Eliminate the effect to improve the quality of the call. Especially for hands-free voice communication terminals, the echo cancellation method, device and system based on double-ended utterance detection provided in the embodiments of the present application can distinguish between only far-end utterance and only near-end utterance or double-ended utterance in real time. When it is judged that only the far-end is speaking, the time-domain output result is zeroed or suppressed to inaudible, so that the echo can be eliminated to the greatest extent while ensuring the duplex call performance, so as to improve the echo cancellation and duplex performance at the same time. The purpose is to improve the duplex call experience of the hands-free voice communication terminal.
附图说明Description of the drawings
图1是本申请实施例的一种基于双端发声检测的回声消除方法的流程示意图;FIG. 1 is a schematic flowchart of an echo cancellation method based on double-ended vocalization detection according to an embodiment of the present application;
图2是本申请实施例的一种基于双端发声检测的回声消除方法的应用示意图;2 is a schematic diagram of the application of an echo cancellation method based on double-ended vocalization detection according to an embodiment of the present application;
图3是本申请一实施例中图1的步骤S103的流程示意图;FIG. 3 is a schematic flowchart of step S103 in FIG. 1 in an embodiment of the present application;
图4是本申请一实施例中图3中步骤S302的流程示意图;FIG. 4 is a schematic flowchart of step S302 in FIG. 3 in an embodiment of the present application;
图5是本申请实施例的一种信回比区间的示意图;FIG. 5 is a schematic diagram of a response ratio interval according to an embodiment of the present application;
图6是本申请实施例的一种基于双端发声检测的回声消除装置的结构示意图;6 is a schematic structural diagram of an echo cancellation device based on double-ended vocalization detection according to an embodiment of the present application;
图7是本申请实施例的一种基于双端发声检测的回声消除系统的结构示意图。FIG. 7 is a schematic structural diagram of an echo cancellation system based on double-ended vocalization detection according to an embodiment of the present application.
具体实施方式Detailed ways
如背景技术所言,现有技术中的回声消除技术,对于双端发声问题,尤其是免提通话、电话会议中的双端发声问题无法准确滤除其中的回声干扰,通话质量易受影响。As mentioned in the background art, the echo cancellation technology in the prior art cannot accurately filter out the echo interference in double-ended voice problems, especially in hands-free calls and conference calls, and the call quality is easily affected.
为解决上述技术问题,本申请实施例提供一种基于双端发声检测的回声消除方法,所述方法包括:从声音采集设备获取输入声音信号;对所述输入声音信号进行自适应滤波,得到近端语音估计信号;根据所述近端语音估计信号判定当前的发声状态;获取预设的发声状态与处理方式之间的映射关系,根据所述映射关系获取所述当前的发声状态对应的处理方式;根据处理方式对所述近端语音估计信号进行处理;将处理后的近端语音估计信号输出,得到输出信号。In order to solve the above technical problems, an embodiment of the present application provides an echo cancellation method based on double-ended vocalization detection. The method includes: acquiring an input sound signal from a sound collection device; performing adaptive filtering on the input sound signal to obtain a close End speech estimation signal; determine the current utterance state according to the near-end speech estimation signal; obtain the mapping relationship between the preset utterance state and processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship ; Process the near-end speech estimation signal according to the processing mode; output the processed near-end speech estimation signal to obtain an output signal.
采用本实施例所述方案的能够滤除双端发声中的干扰信号,显著提高通话质量。Adopting the solution described in this embodiment can filter out the interference signal in the double-ended voice, and significantly improve the quality of the call.
具体请参见图1,图1提供了一种基于双端发声检测的回声消除 方法的流程示意图;该方法具体可以包括以下步骤:Please refer to Figure 1 for details. Figure 1 provides a schematic flow chart of an echo cancellation method based on double-ended vocalization detection; the method may specifically include the following steps:
S101,从声音采集设备获取输入声音信号。S101: Obtain an input sound signal from a sound collection device.
输入声音信号是声音采集设备采集到的声音信号。其中声音采集设备可以为麦克风等设备,对于电话或者类电话形式的通话中,为手机、座机或电脑等终端自带的声音采集设备。The input sound signal is the sound signal collected by the sound collection device. Among them, the sound collection device may be a microphone or other device, and for a telephone or phone-like call, it is a sound collection device that comes with a terminal such as a mobile phone, a landline or a computer.
在电话通讯过程中,电话等终端实时通过声音采集设备采集本端的声音,并将其通过通信线路传输到通话的对端,在本端的声音采集设备采集到输入声音信号后,不直接传输给通话的对端,而是通过下述步骤S102至S106的步骤,对输入声音信号进行回声消除,以提高语音通话的质量。In the process of telephone communication, the terminal such as the telephone collects the sound of the local end through the sound collection device in real time, and transmits it to the opposite end of the call through the communication line. After the sound collection device at the local end collects the input sound signal, it is not directly transmitted to the call. Instead, through the following steps S102 to S106, the input sound signal is echo canceled to improve the quality of the voice call.
S102,对所述输入声音信号进行自适应滤波,得到近端语音估计信号。S102: Perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal.
从声音采集设备获取输入声音信号后,对获取的输入声音信号进行滤波,以过滤掉本端产生的、干扰正常通话的回声信号,并获取滤除回声信号后的近端语音估计信号。其中,自适应滤波的方法,可以采用自适应回声消除器(即AEC)对输入声音信号进行滤波,过滤出近端语音估计信号。After acquiring the input sound signal from the sound collection device, the acquired input sound signal is filtered to filter out the echo signal generated at the local end that interferes with the normal call, and to obtain the near-end voice estimation signal after the echo signal is filtered out. Among them, the adaptive filtering method can use an adaptive echo canceller (ie, AEC) to filter the input sound signal to filter out the near-end speech estimation signal.
S103,根据所述近端语音估计信号判定当前的发声状态。S103: Determine the current utterance state according to the near-end speech estimation signal.
其中,发声状态可以包括仅远端发声、双端发声和仅近端发声等等不同的状态,发声状态对应不同的对获取的近端语音估计信号的处理方式,可根据需要设定多种不同的发声状态,不限于前述所举的例子。当前的发声状态即为对本次过滤得到的近端语音估计信号进行实时的发声状态判定,以确定其实时对应的发声状态。Among them, the utterance state can include different states such as far-end utterance only, double-end utterance, and near-end utterance only. The utterance state corresponds to different processing methods for the obtained near-end speech estimation signal, which can be set according to needs. The vocal state of is not limited to the examples mentioned above. The current utterance state is to determine the real-time utterance state of the near-end speech estimation signal obtained this time to determine its real-time corresponding utterance state.
在获取近端语音估计信号后,可根据该语音信号的波形、信道等属性判定其对应的发声状态。After obtaining the near-end speech estimation signal, the corresponding utterance state can be determined according to the waveform, channel and other attributes of the speech signal.
S104,获取预设的发声状态与处理方式之间的映射关系,根据所 述映射关系获取所述当前的发声状态对应的处理方式。S104: Obtain a preset mapping relationship between the utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship.
处理方式是对各发声状态的近端语音估计信号进行对应处理方式,可以包括将近端语音估计信号置零(0)、完全保留或者保留部分等等处理方式。可通过预先设定发声状态与处理方式之间的映射关系,在判定当前的发声状态后,即可根据映射关系自动获取对应的处理方式。The processing method is a corresponding processing method for the near-end speech estimation signal of each utterance state, and may include processing methods such as setting the near-end speech estimation signal to zero (0), fully retaining or retaining part, and so on. The mapping relationship between the utterance state and the processing mode can be set in advance. After the current utterance state is determined, the corresponding processing mode can be automatically obtained according to the mapping relationship.
S105,根据所述处理方式对所述近端语音估计信号进行处理。S105: Process the near-end speech estimation signal according to the processing manner.
在步骤S104中获取当前的发声状态对应的处理方式后,即根据此处理方式对述近端语音估计信号进行处理。After obtaining the processing mode corresponding to the current utterance state in step S104, the near-end speech estimation signal is processed according to this processing mode.
S106,将处理后的近端语音估计信号输出,得到输出信号。S106: Output the processed near-end speech estimation signal to obtain an output signal.
处理后的近端语音估计信号即可正确反映本端的通话信息,可将此输出信号通过通信链路传输给通话对端。The processed near-end voice estimation signal can correctly reflect the call information of the local end, and this output signal can be transmitted to the call peer through the communication link.
通过上述实施例中的方法,对于电话等语音通话中的输入声音信号,区别于现有通讯方案中直接传输给对端设备或者只做自适应回声消除即传输给对端设备,本实施例中根据输入声音信号对应的不同发声状态定制不同的处理方式,以结合双端发声的特性对输入声音信号中的干扰或回声进行精准滤除。尤其是对于免提通话、语音会议等受双端发声干扰影响较大的通话系统中,能够显著提高通话质量。Through the method in the above embodiment, for the input sound signal in a voice call such as a telephone, it is different from the direct transmission to the opposite device in the existing communication scheme or the transmission to the opposite device with only adaptive echo cancellation. In this embodiment Different processing methods can be customized according to the different sounding states of the input sound signal, and the interference or echo in the input sound signal can be accurately filtered by combining the characteristics of double-ended sounding. Especially in the call system that is greatly affected by the interference of double-end vocalization, such as hands-free call and voice conference, the call quality can be significantly improved.
请参见图2,图2提供了一种基于双端发声检测的回声消除方法的应用示意图;在图2示出的应用场景中,通话对象包含远端设备200和近端设备210,其中,远端设备200包括远端麦克风201和远端扬声器202,近端设备210包括近端扬声器203和近端麦克风204。Please refer to Figure 2. Figure 2 provides a schematic diagram of the application of an echo cancellation method based on double-ended utterance detection; in the application scenario shown in Figure 2, the call object includes a far-end device 200 and a near-end device 210, where the far The end device 200 includes a far-end microphone 201 and a far-end speaker 202, and the near-end device 210 includes a near-end speaker 203 and a near-end microphone 204.
在通讯过程中,远端麦克风201将下行信号S1送至近端扬声器203,直接回声S2是由近端扬声器203发出而被近端麦克风204直接拾取的声音信号,间接回声S3是由近端扬声器203发出经环境反射而被近端麦克风204间接拾取的声音信号。在拾取回声(直接回声S2和间接回声S3)的同时,人(图未示)对近端麦克风204发出语 音(见图中标记的“语音”),由近端麦克风204拾取语音且产生上行信号S4被发送到远端扬声器202播放出来。In the communication process, the far-end microphone 201 sends the downlink signal S1 to the near-end speaker 203, the direct echo S2 is the sound signal that is emitted by the near-end speaker 203 and is directly picked up by the near-end microphone 204, and the indirect echo S3 is the sound signal from the near-end speaker. 203 emits a sound signal that is reflected by the environment and indirectly picked up by the near-end microphone 204. While picking up the echoes (direct echo S2 and indirect echo S3), a person (not shown) sends a voice to the near-end microphone 204 (marked "voice" in the figure), and the near-end microphone 204 picks up the voice and generates an uplink signal S4 is sent to the remote speaker 202 to be played out.
图1中的基于双端发声检测的回声消除方法可应用于图2中的近端麦克风204侧,在近端麦克风204获取待发送给远端设备200的输入声音信号(即根据图2中语音所得到的声音信号)之前,先通过图1中的回声消除方法对输入声音信号进行处理。The echo cancellation method based on double-ended voice detection in FIG. 1 can be applied to the near-end microphone 204 side in FIG. 2 where the near-end microphone 204 obtains the input sound signal to be sent to the far-end device 200 (that is, according to the voice in FIG. 2 Before the obtained sound signal), the input sound signal is processed by the echo cancellation method in FIG. 1 first.
在一个实施例中,请继续参见图1,图1中的步骤S103根据所述近端语音估计信号判定当前的发声状态,具体可以包括图3中的步骤S301至S303:In an embodiment, please continue to refer to FIG. 1. Step S103 in FIG. 1 determines the current utterance state according to the near-end speech estimation signal, which may specifically include steps S301 to S303 in FIG. 3.
S301,根据所述输入声音信号和所述近端语音估计信号计算所述输入声音信号中当前帧的双端发声状态统计量的平均值。S301: Calculate the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal.
当前帧的双端发声状态统计量为以输入声音信号中的当前帧作为参考点,对参考点之前的输入声音信号和近端语音估计信号分别进行采样,通过对输入声音信号和近端语音估计信号进行对比、计算得出的、用于反映输入声音信号中当前的发声状态的数值。其平均值即为取双端发声状态统计量在若干个采样点的平均值。当前帧的双端发声状态统计量的平均值可以是将输入声音信号和近端语音估计信号输入双端发声检测器得到。The double-ended utterance state statistics of the current frame are based on the current frame in the input sound signal as the reference point, and the input sound signal and the near-end speech estimation signal before the reference point are respectively sampled, and the input sound signal and the near-end speech estimation signal are respectively sampled. Signals are compared, calculated, and used to reflect the current sounding state of the input sound signal. The average value is the average value of the double-ended voice state statistics at several sampling points. The average value of the double-ended utterance state statistics of the current frame may be obtained by inputting the input sound signal and the near-end speech estimation signal into the double-ended utterance detector.
S302,获取所述当前帧对应的双讲判断阈值,所述双讲判断阈值根据所述输入声音信号的信回比和近端干扰信号得到。S302: Acquire a dual-talk judgment threshold corresponding to the current frame, where the dual-talk judgment threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal.
其中,输入声音信号的信回比为输入声音信号中的信号和回声的能量比,可对输入声音信号进行信回比计算,以求得其信回比。Among them, the signal-to-return ratio of the input sound signal is the energy ratio of the signal and the echo in the input sound signal, and the signal-to-return ratio of the input sound signal can be calculated to obtain the signal-to-return ratio.
近端干扰信号是与声音采集设备对应的同端发声设备产生的声音对麦克风的接收产生的干扰信号,可从与声音采集设备相对应的发声设备获取。发声设备可以为在电话通信中本端麦克风对应的扬声器等设备。The near-end interference signal is the interference signal generated by the sound generated by the same-end sounding device corresponding to the sound collecting device on the reception of the microphone, and can be obtained from the sounding device corresponding to the sound collecting device. The sound-producing device can be a device such as a speaker corresponding to the local microphone in telephone communication.
双讲判断阈值为用于判定当前帧的双端发声状态统计量的平均 值对应的发声状态的阈值,对双端发声状态统计量的平均值设定多个发声状态的阈值,即双讲判断阈值。此双讲判断阈值为基于输入声音信号的信回比和近端干扰信号两个因素进行设定。The dual-talk judgment threshold is the threshold value used to determine the utterance state corresponding to the average of the double-ended utterance state statistics of the current frame. Set multiple thresholds of the utterance state for the average value of the double-ended utterance state statistics, that is, dual-talk judgment Threshold. The dual-talk judgment threshold is set based on two factors: the signal-to-return ratio of the input sound signal and the near-end interference signal.
S303,根据所述当前帧的双端发声状态统计量的平均值与双讲判断阈值的大小关系,判定当前的发声状态。S303: Determine the current utterance state according to the magnitude relationship between the average value of the double-ended utterance state statistics of the current frame and the double-talk determination threshold.
根据输入声音信号中当前帧的双端发声状态统计量的平均值和步骤S302中得到的双讲判断阈值的大小关系,来判定当前帧的双端发声状态统计量的平均值处于哪一发声状态的阈值区间内,从而判定当前的发声状态。According to the relationship between the average value of the double-ended utterance state statistics of the current frame in the input sound signal and the double-talk judgment threshold obtained in step S302, it is determined which utterance state the average value of the double-ended utterance state statistics of the current frame is in. Within the threshold interval to determine the current vocalization state.
本实施例中,对输入声音信号中的各帧进行实时的发声状态判断,以实现实时更新对近端语音估计信号的处理方式,使得能够准确地对输入声音信号完整、准确地进行回声消除,保证通话过程的稳定。In this embodiment, the real-time sounding state judgment is performed on each frame of the input sound signal to realize real-time update of the processing method of the near-end speech estimation signal, so that the input sound signal can be accurately and completely echo canceled. Ensure the stability of the call process.
在一个实施例中,请继续参见图3,步骤S301中输入声音信号中当前帧的双端发声状态统计量的平均值的计算方式可以根据以下公式计算得到:In an embodiment, please continue to refer to FIG. 3. In step S301, the calculation method of the average value of the double-ended utterance state statistics of the current frame in the input sound signal can be calculated according to the following formula:
Figure PCTCN2020114168-appb-000001
Figure PCTCN2020114168-appb-000001
其中,
Figure PCTCN2020114168-appb-000002
为当前帧的双端发声状态统计量的平均值,P S(k,n)表示近端语音估计信号在第k帧、第n个样本点的功率,P m(k,n)表示所述输入声音信号在第k帧、第n个样本点的功率,mean()表示取括号内数值的平均值。
among them,
Figure PCTCN2020114168-appb-000002
Is the average of the double-ended utterance state statistics of the current frame, P S (k, n) represents the power of the near-end speech estimation signal in the k-th frame and the n-th sample point, and P m (k, n) represents the The power of the input sound signal at the k-th frame and the n-th sample point, mean() means to take the average of the values in the brackets.
在一个实施例中,请继续参见图3,图3中的步骤S302获取所述当前帧对应的双讲判断阈值,所述双讲判断阈值根据所述输入声音信号的信回比和近端干扰信号得到,可以包括图4中的步骤S401至S403,其中:In one embodiment, please continue to refer to FIG. 3. Step S302 in FIG. 3 obtains the dual-talk judgment threshold corresponding to the current frame. The dual-talk judgment threshold is based on the signal-back ratio and near-end interference of the input sound signal. Obtaining the signal may include steps S401 to S403 in Fig. 4, where:
S401,实时估计所述输入声音信号的信回比,以得到所述输入声音信号中当前帧的平均信回比。S401: Estimate the signal-to-return ratio of the input sound signal in real time to obtain an average signal-to-return ratio of the current frame in the input sound signal.
将输入声音信号传输到信回比计算器,以实时获取输入声音信号的信回比,并以当前帧作为参考点,计算在参考点之前的输入声音信号中近端干扰信号对其影响的量化数值,并对数值求平均值,得到当前帧的平均信回比。Transmit the input sound signal to the signal return ratio calculator to obtain the signal return ratio of the input sound signal in real time, and use the current frame as the reference point to calculate the quantification of the influence of the near-end interference signal in the input sound signal before the reference point Value, and average the values to get the average return ratio of the current frame.
S402,获取预设的多个信回比阈值,并根据所述多个信回比阈值构建多个信回比区间。S402: Acquire multiple preset thresholds of the response ratio, and construct multiple intervals of the ratio of the response ratio according to the multiple thresholds.
预设的多个信回比阈值是通过经验或者技术人员极端得到的数值,可根据多个信回比阈值生成多个信回比区间的边界值,以限定多个信回比区间。每个信回比区间设定对应的双讲判断阈值。The preset multiple thresholds are values obtained through experience or extreme technical personnel. The boundary values of multiple thresholds can be generated based on multiple thresholds to define multiple thresholds. The corresponding dual-speaking judgment threshold is set for each response ratio interval.
S403,判断所述当前帧的平均信回比所属的信回比区间,并获取所述的信回比区间对应的双讲判断阈值作为所述当前帧的双讲判断阈值。S403: Determine a signal-to-return ratio interval to which the average return ratio of the current frame belongs, and obtain a dual-speaking judgment threshold corresponding to the signal-to-return ratio interval as the dual-speaking judgment threshold of the current frame.
判定得到的输入声音信号中当前帧的平均信回比所属的信回比区间,并将该信回比区间设定对应的双讲判断阈值作为当前帧的双讲判断阈值,执行上述图3中步骤S302的操作。Determine the average response ratio of the current frame in the input sound signal to which the response ratio belongs, and set the corresponding dual-speaking judgment threshold for this ratio interval as the dual-speaking judgment threshold of the current frame, and execute the above-mentioned figure 3 Operation in step S302.
本实施例中,以采样的方式实时计算以近端干扰信号为回声源、输入声音信号的信回比,在近端干扰信号对输入声音信号的影响程度不同时,设定不同的双讲判断阈值,更加准确地判定当前的发声状态,提高对输入声音信号进行回声消除的准确性。In this embodiment, the signal-to-return ratio of the input sound signal with the near-end interference signal as the echo source is calculated in real time by sampling. When the near-end interference signal has a different degree of influence on the input sound signal, different dual-talk judgments are set Threshold, more accurately determine the current sounding state, and improve the accuracy of echo cancellation for the input sound signal.
可选的,图4中的步骤S401中当前帧的平均信回比计算方式如下:Optionally, the calculation method of the average response ratio of the current frame in step S401 in FIG. 4 is as follows:
获取近端干扰信号,所述近端干扰信号为与所述声音采集设备的同端发声设备产生的声音信号。Acquire a near-end interference signal, where the near-end interference signal is a sound signal generated by a sound-producing device at the same end as the sound collection device.
根据下述公式计算所述输入声音信号中当前帧的平均信回比;Calculate the average signal-to-return ratio of the current frame in the input sound signal according to the following formula;
Figure PCTCN2020114168-appb-000003
Figure PCTCN2020114168-appb-000003
其中,
Figure PCTCN2020114168-appb-000004
表示估计得到的第k帧的平均信回比,其单位为dB, P m(k,n)表示所述输入声音信号在第k帧、第n个样本点的功率,P x(k,n)表示所述近端干扰信号在第k帧、第n个样本点的功率,mean()表示取括号内数值的平均值。
among them,
Figure PCTCN2020114168-appb-000004
Represents the estimated average signal-to-return ratio of the k-th frame, in dB, P m (k, n) represents the power of the input sound signal at the k-th frame and the n-th sample point, P x (k, n ) Represents the power of the near-end interference signal at the k-th frame and the n-th sample point, and mean() represents the average value of the values in the brackets.
P m(k,n)和P x(k,n)是对输入声音信号和近端干扰信号分别按帧进行采样、得到的采样点的功率值。采样过程为:各自获取输入声音信号和近端干扰信号中的n个样本点,每一样本点对应的信号帧为第k帧。其中,n和k为可变的计数数值。 P m (k, n) and P x (k, n) are the power values of the sampling points obtained by sampling the input sound signal and the near-end interference signal respectively in frames. The sampling process is: acquiring n sample points in the input sound signal and the near-end interference signal respectively, and the signal frame corresponding to each sample point is the k-th frame. Among them, n and k are variable count values.
在一个实施例中,请继续参见图4,图4中的步骤S402获取预设的多个信回比阈值,并根据所述多个信回比阈值构建多个信回比区间,可以包括:将获取的所述多个信回比阈值中相邻的两个作为所述信回比区间的边界值,得到多个信回比区间。In one embodiment, please continue to refer to FIG. 4. Step S402 in FIG. 4 obtains multiple preset thresholds of the response ratio, and constructs multiple intervals of the response ratio according to the multiple thresholds, which may include: Two adjacent ones of the acquired multiple thresholds of the response ratio are used as the boundary value of the response ratio interval to obtain a plurality of ratio intervals.
可存储预设的多个信回比阈值,当需构建信回比区间时,从存储区域获取存储的信回比阈值,将其按照数值大小的顺序进行排序,以排序后相邻的两个阈值作为一个信回比区间的边界值,以得到多个信回比区间。It can store multiple preset thresholds for the response ratio. When it is necessary to construct the interval for the response ratio, obtain the stored thresholds from the storage area, and sort them in the order of numerical value to the two adjacent ones after sorting. The threshold value is used as the boundary value of a RL interval to obtain multiple RL interval.
例如,预设的多个信回比阈值为SER_thr_1,SER_thr_2,SER_thr_3,…,SER_thr_k,由该信回比阈值构成信回比区间请参见图5,图5提供了一实施例中信回比区间的示意图。For example, the preset multiple thresholds of the return ratio are SER_thr_1, SER_thr_2, SER_thr_3,..., SER_thr_k, and the return ratio interval is formed by the thresholds. Schematic.
其中,信回比区间可表示为:信回比区间501,信回比区间502,…,信回比区间50k,其中k为一可变的数值,表示第k个信回比区间50k,根据k+1个信回比阈值可构建k个信回比区间50k。Among them, the information response ratio interval can be expressed as: the information response ratio interval 501, the information response ratio interval 502,..., the information response ratio interval 50k, where k is a variable value, which represents the kth information response ratio interval 50k, according to K+1 thresholds of the response ratio can construct k response ratio intervals of 50k.
继续地,为每一信回比区间设定对应的双讲判断阈值,即图5中的双讲判断阈值m1,双讲判断阈值m2,…,双讲判断阈值mk。判定当前帧的平均信回比所属的信回比区间后,即获取对应的双讲判断阈值,即步骤S403。Continuing, set the corresponding dual-speaking judgment threshold for each response ratio interval, that is, the dual-speaking judgment threshold m1, the dual-speaking judgment threshold m2, ..., the dual-speaking judgment threshold mk in Fig. 5. After determining the average response ratio of the current frame to which it belongs, the corresponding dual-talk judgment threshold is obtained, that is, step S403.
本实施例中,信回比区间是以预设的信回比阈值为区间边界值自动构建的。In this embodiment, the RR interval is automatically constructed based on the preset RR threshold value as the interval boundary value.
在一个实施例中,所述发声状态包括仅远端发声和非仅远端发声两种状态,所述预设的发声状态与处理方式之间的映射关系包括:当发声状态为仅远端发声时,对所述近端语音估计信号作置零处理或抑制至不可闻;当发声状态为非仅远端发声时,保留所述近端语音估计信号。In one embodiment, the utterance state includes two states: only the far-end utterance and not only the far-end utterance, and the preset mapping relationship between the utterance state and the processing mode includes: when the utterance state is only the far-end utterance When the near-end speech estimation signal is zeroed or suppressed to be inaudible; when the utterance state is not only the far-end utterance, the near-end speech estimation signal is retained.
可根据近端语音估计信号对于实际通话的影响,来设置两种发声状态,即仅远端发声和非仅远端发声。当根据近端语音估计信号判定当前的发声状态为仅远端发声时,则需将近端语音估计信号作置零处理或抑制至不可闻,即滤除近端语音估计信号,将静音信号作为本端的传输信号传输到通话的对端设备。当根据近端语音估计信号判定当前的发声状态为非仅远端发声时,则需保留近端语音估计信号,并将近端语音估计信号作为本端的传输信号传输到通话的对端设备。According to the influence of the near-end voice estimation signal on the actual call, two voice states can be set, namely, only the far-end voice and not only the far-end voice. When it is determined based on the near-end speech estimation signal that the current utterance state is only the far-end utterance, the near-end speech estimation signal needs to be zeroed or suppressed to be inaudible, that is, the near-end speech estimation signal is filtered out, and the mute signal is used as The transmission signal of the local end is transmitted to the opposite end device of the call. When it is determined based on the near-end voice estimation signal that the current utterance state is not only the far-end utterance, the near-end voice estimation signal needs to be retained, and the near-end voice estimation signal is transmitted to the peer device of the call as a transmission signal of the local end.
本实施例中,定义了两种发声状态,并规定两种发声状态对应的处理方式,可基本满足常见的语音通话中实时回声消除的要求。In this embodiment, two utterance states are defined, and processing methods corresponding to the two utterance states are specified, which can basically meet the requirements of real-time echo cancellation in common voice calls.
可选的,所述非仅远端发声包括仅近端发声和双端发声两种状态。Optionally, the not only far-end utterance includes two states: near-end utterance only and double-ended utterance.
可继续对上述的非仅远端发声这一发声状态进行细化分析,将其分为仅近端发声和双端发声两种状态,仅近端发声即为声音采集设备仅采集到本端的传输信号,而未采集到近端干扰信号;双端发声状态即声音采集设备既采集到本端的传输信号,又采集到近端干扰信号。可针对这两个状态进一步规定处理方式。如,对于仅近端发声则不做处理而直接向对端传输声音信号等等。You can continue to perform detailed analysis of the above-mentioned sounding state, which is not only the far-end sound, and divide it into two states: near-end sound-only and double-end sound. Only near-end sound means that the sound collection device only collects the transmission from the local end. Signals, but no near-end interference signal is collected; the double-ended sounding state means that the sound collection device collects both the local transmission signal and the near-end interference signal. The processing method can be further specified for these two states. For example, for only the near-end voice, no processing is done and the voice signal is directly transmitted to the opposite end, and so on.
在一个实施例中请继续参见图1,图1中的步骤S102对所述输入声音信号进行自适应滤波,得到近端语音估计信号,具体可以包括两次滤波操作,即线性滤波和非线性滤波。In one embodiment, please continue to refer to FIG. 1. Step S102 in FIG. 1 performs adaptive filtering on the input sound signal to obtain a near-end speech estimation signal, which may specifically include two filtering operations, namely linear filtering and non-linear filtering. .
将输入声音信号经AEC等滤波器件中的线性滤波处理,以消除部分回声。然而输入声音信号经线性滤波后,仍包含线性残余回声和 非线性回声,在有近端发声的情况下,还会包含近端语音。对包含残余回声的声音信号继续进行非线性处理滤波可用于实现进一步的回声抑制。The input sound signal is processed by linear filtering in filters such as AEC to eliminate part of the echo. However, after linear filtering, the input sound signal still contains linear residual echo and nonlinear echo. In the case of near-end utterance, it also contains near-end speech. Continuous non-linear processing and filtering of the sound signal containing residual echo can be used to achieve further echo suppression.
本实施例中,对输入声音信号的自适应滤波包含线性滤波和非线性滤波两个操作,能够进一步对输入声音信号进行回声抑制。In this embodiment, the adaptive filtering of the input sound signal includes two operations of linear filtering and non-linear filtering, which can further suppress the echo of the input sound signal.
本申请实施例还提供了一种基于双端发声检测的回声消除装置,请参见图6,该装置可包含输入声音信号获取模块601、滤波模块602、发声状态判定模块603、处理方式获取模块604、近端处理模块605和输出模块606,其中:The embodiment of the application also provides an echo cancellation device based on double-ended vocalization detection. Please refer to FIG. 6. The device may include an input sound signal acquisition module 601, a filtering module 602, a sound state determination module 603, and a processing mode acquisition module 604. , Near-end processing module 605 and output module 606, where:
输入声音信号获取模块601,用于从声音采集设备获取输入声音信号。The input sound signal acquisition module 601 is used to acquire the input sound signal from the sound collection device.
滤波模块602,用于对所述输入声音信号进行自适应滤波,得到近端语音估计信号。The filtering module 602 is configured to perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal.
发声状态判定模块603,用于根据所述近端语音估计信号判定当前的发声状态。The utterance state determination module 603 is configured to determine the current utterance state according to the near-end speech estimation signal.
处理方式获取模块604,用于获取预设的发声状态与处理方式之间的映射关系,根据所述映射关系获取所述当前的发声状态对应的处理方式。The processing mode obtaining module 604 is configured to obtain a preset mapping relationship between the utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship.
近端处理模块605,用于根据所述处理方式对所述近端语音估计信号进行处理。The near-end processing module 605 is configured to process the near-end speech estimation signal according to the processing mode.
输出模块606,用于将处理后的近端语音估计信号输出,得到输出信号。The output module 606 is configured to output the processed near-end speech estimation signal to obtain an output signal.
在一个实施例中,发声状态判定模块603可以包括:In an embodiment, the utterance state determination module 603 may include:
发声状态实时获取单元,用于根据所述输入声音信号和所述近端语音估计信号计算所述输入声音信号中当前帧的双端发声状态统计量的平均值;A real-time utterance state acquisition unit, configured to calculate the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal;
阈值获取单元,用于获取所述当前帧对应的双讲判断阈值,所述双讲判断阈值根据所述输入声音信号的信回比和近端干扰信号得到;A threshold obtaining unit, configured to obtain a dual-talk judgment threshold corresponding to the current frame, the dual-talk judgment threshold being obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal;
发声状态判定单元,用于根据所述当前帧的双端发声状态统计量的平均值与双讲判断阈值的大小关系,判定当前的发声状态。The utterance state determination unit is configured to determine the current utterance state according to the magnitude relationship between the average value of the double-ended utterance state statistics of the current frame and the dual-talk determination threshold.
在一个实施例中,所述阈值获取单元,包括:In an embodiment, the threshold value acquisition unit includes:
当前信回比获取子单元,用于实时估计所述输入声音信号的信回比,以得到所述输入声音信号中当前帧的平均信回比;The current response ratio obtaining subunit is used to estimate the signal response ratio of the input sound signal in real time to obtain the average signal response ratio of the current frame in the input sound signal;
信回比区间构建子单元,用于获取预设的多个信回比阈值,并根据所述多个信回比阈值构建多个信回比区间;A signal response ratio interval construction subunit for obtaining a plurality of preset signal response ratio thresholds, and constructing a plurality of signal response ratio intervals according to the plurality of signal response ratio thresholds;
阈值判断子单元,用于判断所述当前帧的平均信回比所属的信回比区间,并获取所述的信回比区间对应的双讲判断阈值作为所述当前帧的双讲判断阈值。The threshold judging subunit is used to judge the interval of the average response ratio of the current frame to which the average response ratio belongs, and obtain the dual-speaking judgment threshold corresponding to the said interval as the dual-speaking judgment threshold of the current frame.
在一个实施例中,上述信回比区间构建子单元,还用于将获取的所述多个信回比阈值中相邻的两个作为所述信回比区间的边界值,得到多个信回比区间。In an embodiment, the above-mentioned signal response ratio interval construction subunit is further used to use two adjacent ones of the obtained multiple signal response ratio threshold values as the boundary value of the signal response ratio interval to obtain multiple signal response ratios. Back to the interval.
在一个实施例中,请继续参见图6,图6中的滤波模块602还用于,分别对所述输入声音信号进行线性滤波和非线性滤波,得到所述近端语音估计信号。In one embodiment, please continue to refer to FIG. 6. The filtering module 602 in FIG. 6 is further configured to perform linear filtering and non-linear filtering on the input sound signal to obtain the near-end speech estimation signal.
关于所述基于双端发声检测的回声消除装置的工作原理、工作方式的更多内容,可以参照上述图1至图5中的相关描述,这里不再赘述。For more details on the working principle and working mode of the echo cancellation device based on double-ended vocalization detection, reference may be made to the related descriptions in Figs. 1 to 5 above, which will not be repeated here.
本申请实施例中还提供一种基于双端发声检测的回声消除系统,包括声音采集设备、同端发声设备和回声消除设备,所述回声消除设备执行图1至图5中提供的基于双端发声检测的回声消除方法的步骤。The embodiment of the present application also provides an echo cancellation system based on double-ended vocalization detection, including a sound collection device, a same-end sounding device, and an echo cancellation device. The echo cancellation device performs the double-ended-based echo cancellation system provided in FIGS. 1 to 5. The steps of the echo cancellation method for vocal detection.
请参见图7,图7为一种基于双端发声检测的回声消除系统的示 意图;该系统包括声音采集设备701、回声消除设备702和同端发声设备703。其中,声音采集设备701可以为电话通讯中的麦克风,用于采集输入声音信号A1。同端发声设备703可以为电话通讯中与麦克风同端连接的扬声器,用于产生声音信号,但其对于输入声音信号A1可能产生干扰,故将其作为干扰声音信号A6。回声消除设备702为本申请实现图1至图5中基于双端发声检测的回声消除方法的设备,可通过实体或者逻辑电路、以及软件编程等方式实现回声消除设备的功能。Please refer to Figure 7. Figure 7 is a schematic diagram of an echo cancellation system based on double-ended voice detection; the system includes a sound collection device 701, an echo cancellation device 702, and a same-end voice device 703. Wherein, the sound collection device 701 may be a microphone in telephone communication for collecting the input sound signal A1. The same-end sounding device 703 may be a speaker connected to the same end as a microphone in telephone communication to generate a sound signal, but it may interfere with the input sound signal A1, so it is used as the interference sound signal A6. The echo cancellation device 702 is a device for implementing the echo cancellation method based on double-ended vocalization detection in FIGS. 1 to 5 in this application. The function of the echo cancellation device can be realized by means of entity or logic circuit, software programming, etc.
可选的,如图7所示,回声消除设备702可以包括线性AEC滤波器7021、NLP滤波器7022、双端发声检测器7023、信回比估计器7024、阈值判断器7025以及处理器7026。Optionally, as shown in FIG. 7, the echo cancellation device 702 may include a linear AEC filter 7021, an NLP filter 7022, a double-ended utterance detector 7023, a signal-to-return ratio estimator 7024, a threshold determiner 7025 and a processor 7026.
该回声消除设备702对于通讯过程中接收自声音采集设备701、同端发声设备703的声音信号的处理方式为:The echo cancellation device 702 processes the sound signals received from the sound collection device 701 and the same-end sound device 703 in the communication process as follows:
从声音采集设备701中获取输入声音信号A1,对其进行回声消除,将输入声音信号A1经由线性AEC滤波器7021进行线性滤波,得到线性滤波后的声音信号A2,再经NLP滤波器对A2进行非线性滤波,得到近端语音估计信号A3,将其作为双端发声检测器7023的一路输入信号。将输入声音信号A1直接作为双端发声检测器的另一路输入信号。其中,线性AEC滤波器7021以干扰声音信号A6作为滤波参考因子对输入声音信号A1进行线性滤波。Obtain the input sound signal A1 from the sound collection device 701, and perform echo cancellation on it. The input sound signal A1 is linearly filtered through the linear AEC filter 7021 to obtain the linearly filtered sound signal A2, and then the NLP filter is applied to A2. Non-linear filtering obtains the near-end speech estimation signal A3, which is used as an input signal of the double-ended utterance detector 7023. The input sound signal A1 is directly used as another input signal of the double-ended sounding detector. Among them, the linear AEC filter 7021 uses the interference sound signal A6 as a filtering reference factor to linearly filter the input sound signal A1.
另外,将输入声音信号A1输入信回比估计器7024,实时计算输入声音信号当前帧的平均信回比A4,并将平均信回比A4传输至阈值判断器7025,阈值判断器7025根据预设的多个信回比阈值所构建的多个信回比区间,以确定当前帧的平均信回比A4所对应的双讲判断阈值A5,将双讲判断阈值A5发送至双端发声检测器7023作为当前的发声状态判定的依据。其中,信回比估计器7024对输入声音信号A1和干扰声音信号A6进行采样,根据下列公式计算得到当前帧的平均信回比A4:In addition, the input sound signal A1 is input to the echo ratio estimator 7024, the average echo ratio A4 of the current frame of the input sound signal is calculated in real time, and the average echo ratio A4 is transmitted to the threshold determiner 7025, which is based on the preset Multiple signal response ratio intervals constructed by multiple signal response ratio thresholds to determine the dual-talk judgment threshold A5 corresponding to the average return ratio of the current frame A4, and send the double-talk judgment threshold A5 to the double-ended utterance detector 7023 As the basis for judging the current utterance state. Among them, the signal-to-return ratio estimator 7024 samples the input sound signal A1 and the interference sound signal A6, and calculates the average signal-to-return ratio A4 of the current frame according to the following formula:
Figure PCTCN2020114168-appb-000005
Figure PCTCN2020114168-appb-000005
其中,
Figure PCTCN2020114168-appb-000006
表示估计得到的第k帧的平均信回比,其单位为dB,P m(k,n)表示所述输入声音信号A1在第k帧、第n个样本点的功率,P x(k,n)表示近端干扰信号A6在第k帧、第n个样本点的功率,mean()表示取括号内数值的平均值。
among them,
Figure PCTCN2020114168-appb-000006
Represents the estimated average signal-to-return ratio of the k-th frame, in dB, P m (k, n) represents the power of the input sound signal A1 at the k-th frame and the n-th sample point, P x (k, n) represents the power of the near-end interference signal A6 at the k-th frame and the n-th sample point, and mean() represents the average value of the values in the brackets.
双端发声检测器7023获取第一路输入信号(即近端语音估计信号A3)、第二路输入信号(即输入声音信号A1)以及双讲判断阈值A5,根据这些信息来实时判定当前的发声状态A7。当前的发声状态是基于当前帧的双端发声状态统计量的平均值得到的。双端发声检测器7023对近端语音估计信号A3和输入声音信号A1进行采样,根据下列公式计算得到当前帧的双端发声状态统计量的平均值:The double-ended utterance detector 7023 acquires the first input signal (ie the near-end voice estimation signal A3), the second input signal (ie the input sound signal A1), and the dual-talk judgment threshold A5, and determines the current utterance in real time based on this information State A7. The current utterance state is obtained based on the average of the double-ended utterance state statistics of the current frame. The double-ended utterance detector 7023 samples the near-end speech estimation signal A3 and the input sound signal A1, and calculates the average value of the double-ended utterance state statistics of the current frame according to the following formula:
Figure PCTCN2020114168-appb-000007
Figure PCTCN2020114168-appb-000007
其中,
Figure PCTCN2020114168-appb-000008
为当前帧的双端发声状态统计量的平均值,P S(k,n)表示近端语音估计信号A3在第k帧、第n个样本点的功率,P m(k,n)表示所述输入声音信号A1在第k帧、第n个样本点的功率,mean()表示取括号内数值的平均值。
among them,
Figure PCTCN2020114168-appb-000008
Is the average of the double-ended utterance state statistics of the current frame, P S (k, n) represents the power of the near-end speech estimation signal A3 at the k-th frame and the n-th sample point, and P m (k, n) represents the total For the power of the input sound signal A1 at the k-th frame and the n-th sample point, mean() represents the average value of the values in the brackets.
双端发声检测器7023将得到的当前的发声状态A7发送给处理器7026,处理器7026根据当前的发声状态A7对近端语音估计信号A3进行处理。处理方式为:当发声状态为仅远端发声时,对近端语音估计信号A3作置零处理或抑制至不可闻;当发声状态为非仅远端发声时,保留近端语音估计信号A3。将处理后的近端语音估计信号输出,得到输出信号A8,可将该输出信号A8经由通信链路传输至通信对端的设备。The double-ended utterance detector 7023 sends the obtained current utterance state A7 to the processor 7026, and the processor 7026 processes the near-end voice estimation signal A3 according to the current utterance state A7. The processing method is: when the utterance state is only the far-end utterance, the near-end speech estimation signal A3 is zeroed or suppressed to inaudible; when the utterance state is not only the far-end utterance, the near-end speech estimation signal A3 is retained. The processed near-end speech estimation signal is output to obtain an output signal A8, and the output signal A8 can be transmitted to the device of the communication opposite end via the communication link.
请参见图2中的基于双端发声检测的回声消除方法的应用场景,可将图7中的系统应用到该场景中的近端设备侧,以消除图2中的直接回声S2、间接回声S3。Please refer to the application scenario of the echo cancellation method based on double-ended voice detection in Figure 2. The system in Figure 7 can be applied to the near-end device side in this scenario to eliminate direct echo S2 and indirect echo S3 in Figure 2 .
上述基于双端发声检测的回声消除系统,根据通信过程中产生的 声学回声进行实时检测、并根据检测结果予以消除,从而能够在语音通信终端处于免提模式下提高回声的消除效果,以提升通话质量。The above-mentioned echo cancellation system based on double-ended vocalization detection performs real-time detection based on the acoustic echo generated in the communication process, and eliminates it according to the detection result, so that the echo cancellation effect can be improved when the voice communication terminal is in the hands-free mode to improve the call quality.
尤其针对免提语音通信终端,本申请实施例中提供的基于双端发声检测的回声消除方法、装置及系统,能够将仅远端发声和仅近端发声或双端发声的情况进行实时区分。当判断为仅远端发声时对时域输出结果作置零处理或抑制至不可闻,从而在保证双工通话性能的同时对回声作最大程度的消除,达到同时提高回声消除和双工性能的目的,提升免提语音通信终端的双工通话体验。Especially for hands-free voice communication terminals, the echo cancellation method, device and system based on double-ended utterance detection provided in the embodiments of the present application can distinguish between only far-end utterance and only near-end utterance or double-ended utterance in real time. When it is judged that only the far-end is speaking, the time-domain output result is zeroed or suppressed to inaudible, so that the echo can be eliminated to the greatest extent while ensuring the duplex call performance, so as to improve the echo cancellation and duplex performance at the same time. The purpose is to improve the duplex call experience of the hands-free voice communication terminal.
虽然本申请披露如上,但本申请并非限定于此。任何本领域技术人员,在不脱离本申请的精神和范围内,均可作各种更动与修改,因此本申请的保护范围应当以权利要求所限定的范围为准。Although this application is disclosed as above, this application is not limited to this. Any person skilled in the art can make various changes and modifications without departing from the spirit and scope of this application. Therefore, the protection scope of this application shall be subject to the scope defined by the claims.

Claims (11)

  1. 一种基于双端发声检测的回声消除方法,其特征在于,所述方法包括:An echo cancellation method based on double-ended utterance detection, characterized in that the method includes:
    从声音采集设备获取输入声音信号;Obtain the input sound signal from the sound collection device;
    对所述输入声音信号进行自适应滤波,得到近端语音估计信号;Performing adaptive filtering on the input sound signal to obtain a near-end speech estimation signal;
    根据所述近端语音估计信号判定当前的发声状态;Judging the current utterance state according to the near-end speech estimation signal;
    获取预设的发声状态与处理方式之间的映射关系,根据所述映射关系获取所述当前的发声状态对应的处理方式;Acquiring a preset mapping relationship between a utterance state and a processing manner, and acquiring a processing manner corresponding to the current utterance state according to the mapping relationship;
    根据所述处理方式对所述近端语音估计信号进行处理;Processing the near-end speech estimation signal according to the processing manner;
    将处理后的近端语音估计信号输出,得到输出信号。The processed near-end speech estimation signal is output to obtain an output signal.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述近端语音估计信号判定当前的发声状态,包括:The method according to claim 1, wherein the determining the current utterance state according to the near-end speech estimation signal comprises:
    根据所述输入声音信号和所述近端语音估计信号计算所述输入声音信号中当前帧的双端发声状态统计量的平均值;Calculating an average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal;
    获取所述当前帧对应的双讲判断阈值,所述双讲判断阈值根据所述输入声音信号的信回比和近端干扰信号得到;Acquiring a dual-talk judgment threshold corresponding to the current frame, where the dual-talk judgment threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal;
    根据所述当前帧的双端发声状态统计量的平均值与双讲判断阈值的大小关系,判定当前的发声状态。The current utterance state is determined according to the relationship between the average value of the double-ended utterance state statistics of the current frame and the double-talk determination threshold.
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述输入声音信号和所述近端语音估计信号计算所述输入声音信号中当前帧的双端发声状态统计量的平均值,包括:The method according to claim 2, wherein the calculating the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal comprises :
    根据下述公式计算所述当前帧的双端发声状态统计量的平均值:Calculate the average value of the double-ended voice state statistics of the current frame according to the following formula:
    Figure PCTCN2020114168-appb-100001
    Figure PCTCN2020114168-appb-100001
    其中,
    Figure PCTCN2020114168-appb-100002
    为当前帧的双端发声状态统计量的平均值,P S(k,n)表 示近端语音估计信号在第k帧、第n个样本点的功率,P m(k,n)表示所述输入声音信号在第k帧、第n个样本点的功率,mean()表示取括号内数值的平均值。
    among them,
    Figure PCTCN2020114168-appb-100002
    Is the average of the double-ended utterance state statistics of the current frame, P S (k, n) represents the power of the near-end speech estimation signal in the k-th frame and the n-th sample point, and P m (k, n) represents the The power of the input sound signal at the k-th frame and the n-th sample point, mean() means to take the average of the values in the brackets.
  4. 根据权利要求2所述的方法,其特征在于,所述获取所述当前帧对应的双讲判断阈值,所述双讲判断阈值根据所述输入声音信号的信回比和近端干扰信号得到,包括:The method according to claim 2, wherein said obtaining a dual-talk determination threshold corresponding to the current frame, the dual-talk determination threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal, include:
    实时估计所述输入声音信号的信回比,以得到所述输入声音信号中当前帧的平均信回比;Real-time estimation of the signal-to-return ratio of the input sound signal to obtain an average signal-to-return ratio of the current frame in the input sound signal;
    获取预设的多个信回比阈值,并根据所述多个信回比阈值构建多个信回比区间;Acquiring multiple preset thresholds for the response ratio, and constructing multiple intervals of the response ratio according to the multiple thresholds;
    判断所述当前帧的平均信回比所属的信回比区间,并获取所述的信回比区间对应的双讲判断阈值作为所述当前帧的双讲判断阈值。Determine the average response ratio interval of the current frame to which the response ratio interval belongs, and obtain the dual-speaker judgment threshold corresponding to the ratio interval as the dual-speaker judgment threshold of the current frame.
  5. 根据权利要求4所述的方法,其特征在于,所述实时估计所述输入声音信号的信回比,以得到所述输入声音信号中当前帧的平均信回比,包括:The method according to claim 4, wherein the real-time estimation of the signal-to-return ratio of the input sound signal to obtain the average signal-to-return ratio of the current frame in the input sound signal comprises:
    获取近端干扰信号,所述近端干扰信号为与所述声音采集设备的同端发声设备产生的声音信号;Acquiring a near-end interference signal, where the near-end interference signal is a sound signal generated by a sound-producing device at the same end as the sound collection device;
    根据下述公式计算所述输入声音信号中当前帧的平均信回比;Calculate the average signal-to-return ratio of the current frame in the input sound signal according to the following formula;
    Figure PCTCN2020114168-appb-100003
    Figure PCTCN2020114168-appb-100003
    其中,
    Figure PCTCN2020114168-appb-100004
    表示估计得到的第k帧的平均信回比,其单位为dB,P m(k,n)表示所述输入声音信号在第k帧、第n个样本点的功率,P x(k,n)表示所述近端干扰信号在第k帧、第n个样本点的功率,mean()表示取括号内数值的平均值。
    among them,
    Figure PCTCN2020114168-appb-100004
    Represents the estimated average signal-to-return ratio of the k-th frame, and its unit is dB, P m (k, n) represents the power of the input sound signal at the k-th frame and the n-th sample point, P x (k, n ) Represents the power of the near-end interference signal at the k-th frame and the n-th sample point, and mean() represents the average value of the values in the brackets.
  6. 根据权利要求4所述的方法,其特征在于,所述获取预设的多个信回比阈值,并根据所述多个信回比阈值构建多个信回比区间, 包括:The method according to claim 4, wherein the obtaining multiple preset thresholds of the return ratio, and constructing multiple intervals of the return ratio according to the multiple thresholds, comprises:
    将获取的所述多个信回比阈值中相邻的两个作为所述信回比区间的边界值,得到多个信回比区间。Two adjacent ones of the acquired multiple thresholds of the response ratio are used as the boundary value of the response ratio interval to obtain a plurality of ratio intervals.
  7. 根据权利要求1所述的方法,其特征在于,所述发声状态包括仅远端发声和非仅远端发声两种状态,所述预设的发声状态与处理方式之间的映射关系包括:The method according to claim 1, wherein the utterance state includes two states: only the far-end utterance and not only the far-end utterance, and the mapping relationship between the preset utterance state and the processing mode includes:
    当发声状态为仅远端发声时,对所述近端语音估计信号作置零处理或抑制至不可闻;When the utterance state is that only the far-end is uttered, zeroing the near-end speech estimation signal or suppressing it to be inaudible;
    当发声状态为非仅远端发声时,保留所述近端语音估计信号。When the utterance state is not only the far-end utterance, the near-end speech estimation signal is retained.
  8. 根据权利要求7所述的方法,其特征在于,所述非仅远端发声包括仅近端发声和双端发声两种状态。The method according to claim 7, wherein the non-only far-end utterance includes two states: only near-end utterance and double-ended utterance.
  9. 根据权利要求1所述的方法,其特征在于,所述对所述输入声音信号进行自适应滤波,得到近端语音估计信号,包括:The method according to claim 1, wherein said performing adaptive filtering on said input sound signal to obtain a near-end speech estimation signal comprises:
    分别对所述输入声音信号进行线性滤波和非线性滤波,得到所述近端语音估计信号。Perform linear filtering and nonlinear filtering on the input sound signal respectively to obtain the near-end speech estimation signal.
  10. 一种基于双端发声检测的回声消除装置,其特征在于,所述装置包括:An echo cancellation device based on double-ended vocalization detection, characterized in that the device includes:
    输入声音信号获取模块,用于从声音采集设备获取输入声音信号;The input sound signal acquisition module is used to acquire the input sound signal from the sound collection device;
    滤波模块,用于对所述输入声音信号进行自适应滤波,得到近端语音估计信号;The filtering module is used to perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal;
    发声状态判定模块,用于根据所述近端语音估计信号判定当前的发声状态;The utterance state determination module is configured to determine the current utterance state according to the near-end speech estimation signal;
    处理方式获取模块,用于获取预设的发声状态与处理方式之间的映射关系,根据所述映射关系获取所述当前的发声状态对应的处理方式;A processing method acquisition module, configured to acquire a preset mapping relationship between a utterance state and a processing method, and acquire the processing method corresponding to the current utterance state according to the mapping relationship;
    近端处理模块,用于根据所述处理方式对所述近端语音估计信号进行处理;The near-end processing module is configured to process the near-end speech estimation signal according to the processing mode;
    输出模块,用于将处理后的近端语音估计信号输出,得到输出信号。The output module is used to output the processed near-end speech estimation signal to obtain an output signal.
  11. 一种基于双端发声检测的回声消除系统,包括声音采集设备、同端发声设备和回声消除设备,其特征在于,所述回声消除设备执行权利要求1至9任一项所述方法的步骤。An echo cancellation system based on double-ended voice detection, comprising a sound collection device, a same-end voice device, and an echo cancellation device, characterized in that the echo cancellation device executes the steps of the method according to any one of claims 1 to 9.
PCT/CN2020/114168 2019-12-13 2020-09-09 Echo cancellation method, apparatus, and system employing double-talk detection WO2021114779A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911284296.3A CN110995951B (en) 2019-12-13 2019-12-13 Echo cancellation method, device and system based on double-end sounding detection
CN201911284296.3 2019-12-13

Publications (1)

Publication Number Publication Date
WO2021114779A1 true WO2021114779A1 (en) 2021-06-17

Family

ID=70093348

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/114168 WO2021114779A1 (en) 2019-12-13 2020-09-09 Echo cancellation method, apparatus, and system employing double-talk detection

Country Status (2)

Country Link
CN (1) CN110995951B (en)
WO (1) WO2021114779A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110995951B (en) * 2019-12-13 2021-09-03 展讯通信(上海)有限公司 Echo cancellation method, device and system based on double-end sounding detection
CN111556210B (en) * 2020-04-23 2021-10-22 深圳市未艾智能有限公司 Call voice processing method and device, terminal equipment and storage medium
CN113225442B (en) * 2021-04-16 2022-09-02 杭州网易智企科技有限公司 Method and device for eliminating echo
CN113241085B (en) * 2021-04-29 2022-07-22 北京梧桐车联科技有限责任公司 Echo cancellation method, device, equipment and readable storage medium
CN113808609A (en) * 2021-09-18 2021-12-17 展讯通信(上海)有限公司 Echo detection method and device, computer readable storage medium and terminal equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101179294A (en) * 2006-11-09 2008-05-14 爱普拉斯通信技术(北京)有限公司 Self-adaptive echo eliminator and echo eliminating method thereof
CN102655558A (en) * 2012-05-21 2012-09-05 宁波工程学院 Double-end pronouncing robust structure and acoustic echo cancellation method
US20120281603A1 (en) * 2011-05-06 2012-11-08 Futurewei Technologies, Inc. Transmit Phase Control for the Echo Cancel Based Full Duplex Transmission System
CN107635082A (en) * 2016-07-18 2018-01-26 深圳市有信网络技术有限公司 A kind of both-end sounding end detecting system
CN110995951A (en) * 2019-12-13 2020-04-10 展讯通信(上海)有限公司 Echo cancellation method, device and system based on double-end sounding detection

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3922997B2 (en) * 2002-10-30 2007-05-30 沖電気工業株式会社 Echo canceller
CN1925346A (en) * 2006-09-05 2007-03-07 华为技术有限公司 Detecting method for double speaking state in echo wave counteract
JP4411309B2 (en) * 2006-09-21 2010-02-10 Okiセミコンダクタ株式会社 Double talk detection method
US8345860B1 (en) * 2008-05-09 2013-01-01 Hellosoft India PVT. Ltd Method and system for detection of onset of near-end signal in an echo cancellation system
CN102160296B (en) * 2009-01-20 2014-01-22 华为技术有限公司 Method and apparatus for detecting double talk
CN101719969B (en) * 2009-11-26 2013-10-02 美商威睿电通公司 Method and system for judging double-end conversation and method and system for eliminating echo
CN102377453B (en) * 2010-08-06 2014-02-26 联芯科技有限公司 Method and device for controlling updating of self-adaptive filter and echo canceller
CN101917527B (en) * 2010-09-02 2013-07-03 杭州华三通信技术有限公司 Method and device of echo elimination
CN102065190B (en) * 2010-12-31 2013-08-28 杭州华三通信技术有限公司 Method and device for eliminating echo
CN103179296B (en) * 2011-12-26 2017-02-15 中兴通讯股份有限公司 Echo canceller and echo cancellation method
US9100466B2 (en) * 2013-05-13 2015-08-04 Intel IP Corporation Method for processing an audio signal and audio receiving circuit
CN104519212B (en) * 2013-09-27 2017-06-20 华为技术有限公司 A kind of method and device for eliminating echo
US20160171988A1 (en) * 2014-12-15 2016-06-16 Wire Swiss Gmbh Delay estimation for echo cancellation using ultrasonic markers
CN106533500B (en) * 2016-11-25 2019-11-12 上海伟世通汽车电子系统有限公司 A method of optimization Echo Canceller convergence property
CN108134863B (en) * 2017-12-26 2020-06-19 中山大学花都产业科技研究院 Improved double-end detection device and detection method based on double statistics
CN108540680B (en) * 2018-02-02 2021-03-02 广州视源电子科技股份有限公司 Switching method and device of speaking state and conversation system
CN108696648B (en) * 2018-05-16 2021-08-24 上海小度技术有限公司 Method, device, equipment and storage medium for processing short-time voice signal
CN109547655A (en) * 2018-12-30 2019-03-29 广东大仓机器人科技有限公司 A kind of method of the echo cancellation process of voice-over-net call
CN110138990A (en) * 2019-05-14 2019-08-16 浙江工业大学 A method of eliminating mobile device voip phone echo
CN110335618B (en) * 2019-06-06 2021-07-30 福建星网智慧软件有限公司 Method for improving nonlinear echo suppression and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101179294A (en) * 2006-11-09 2008-05-14 爱普拉斯通信技术(北京)有限公司 Self-adaptive echo eliminator and echo eliminating method thereof
US20120281603A1 (en) * 2011-05-06 2012-11-08 Futurewei Technologies, Inc. Transmit Phase Control for the Echo Cancel Based Full Duplex Transmission System
CN102655558A (en) * 2012-05-21 2012-09-05 宁波工程学院 Double-end pronouncing robust structure and acoustic echo cancellation method
CN107635082A (en) * 2016-07-18 2018-01-26 深圳市有信网络技术有限公司 A kind of both-end sounding end detecting system
CN110995951A (en) * 2019-12-13 2020-04-10 展讯通信(上海)有限公司 Echo cancellation method, device and system based on double-end sounding detection

Also Published As

Publication number Publication date
CN110995951B (en) 2021-09-03
CN110995951A (en) 2020-04-10

Similar Documents

Publication Publication Date Title
WO2021114779A1 (en) Echo cancellation method, apparatus, and system employing double-talk detection
CN105825864B (en) Both-end based on zero-crossing rate index is spoken detection and echo cancel method
CN103428385B (en) For handling the method for audio signal and circuit arrangement for handling audio signal
KR100989266B1 (en) Double talk detection method based on spectral acoustic properties
US6792107B2 (en) Double-talk detector suitable for a telephone-enabled PC
KR101444100B1 (en) Noise cancelling method and apparatus from the mixed sound
US9094744B1 (en) Close talk detector for noise cancellation
US9443528B2 (en) Method and device for eliminating echoes
WO2019140755A1 (en) Echo elimination method and system based on microphone array
US9100756B2 (en) Microphone occlusion detector
JP4568439B2 (en) Echo suppression device
US5390244A (en) Method and apparatus for periodic signal detection
WO2015043150A1 (en) Echo cancellation method and apparatus
JPH09172396A (en) System and method for removing influence of acoustic coupling
JP2008507926A (en) Headset for separating audio signals in noisy environments
EP3791565A1 (en) Method, apparatus, and computer-readable media utilizing residual echo estimate information to derive secondary echo reduction parameters
US8041028B2 (en) Double-talk detection
CN106571147B (en) Method for suppressing acoustic echo of network telephone
CN111742541B (en) Acoustic echo cancellation method, acoustic echo cancellation device and storage medium
WO2010083641A1 (en) Method and apparatus for detecting double talk
TWI506620B (en) Communication apparatus and voice processing method therefor
CN100508031C (en) Method for identifying and eliminating echo generated by speech at remote end in SCDMA handset
CN110634496B (en) Double-talk detection method and device, computer equipment and storage medium
CN111556210B (en) Call voice processing method and device, terminal equipment and storage medium
JP3607625B2 (en) Multi-channel echo suppression method, apparatus thereof, program thereof and recording medium thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20899723

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20899723

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20899723

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.01.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20899723

Country of ref document: EP

Kind code of ref document: A1