WO2019169272A1 - Enhanced barge-in detector - Google Patents

Enhanced barge-in detector Download PDF

Info

Publication number
WO2019169272A1
WO2019169272A1 PCT/US2019/020305 US2019020305W WO2019169272A1 WO 2019169272 A1 WO2019169272 A1 WO 2019169272A1 US 2019020305 W US2019020305 W US 2019020305W WO 2019169272 A1 WO2019169272 A1 WO 2019169272A1
Authority
WO
WIPO (PCT)
Prior art keywords
barge
asr
prompt
vad
signal
Prior art date
Application number
PCT/US2019/020305
Other languages
French (fr)
Inventor
Matthew R. KIRSCH
Guillaume Lamy
Original Assignee
Continental Automotive Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Continental Automotive Systems, Inc. filed Critical Continental Automotive Systems, Inc.
Publication of WO2019169272A1 publication Critical patent/WO2019169272A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L15/222Barge in, i.e. overridable guidance for interrupting prompts

Definitions

  • Hands-free systems are typically speaker phones that enable a driver or passenger to use his or her cell phone without having to hold the device against the user’s ear.
  • Hands free systems may also provide automatic speech recognition of voice commands, such as entering a destination into a navigation system, placing a telephone call to someone in a contacts list, and the hke.
  • Hands-free systems typically comprise one or more microphones and one or more speakers.
  • the microphone or microphones are typically non- directional and designed to pick up sounds from almost anywhere inside a vehicle, including a user’s voice.
  • the speaker or speakers are typically designed to provide audio at power levels that can be heard anywhere inside the vehicle.
  • voice barge-in allows a user to interrupt an ASR audio prompt by speaking the next command (i.e., barging in and interrupting the prompt).
  • the user does not have to wait for the end of the prompt in order to proceed with the ASR session. For example, if the audio prompt specifies several options, the user does not have to wait for all options to be hsted before making a selection.
  • AEC and additional processing may be bypassed since there is no longer an echo signal to be removed.
  • the barge-in detector has two main performance metrics: false negative rate and false positive rate.
  • a false negative condition occurs when the user in the vehicle cabin speaks, but the barge-in event is not detected.
  • the ASR audio prompt is never stopped, and the signal is not passed to the ASR engine.
  • a false positive condition occurs when the user in the vehicle cabin does not speak, but the barge-in condition is detected. In this case, the ASR audio prompt is stopped prematurely, and the resulting signal is sent to the ASR engine for recognition.
  • VUI voice user interface
  • the recognition rate of the ASR engine is another important indirect
  • the echo removal is too lax, then there will still be some residual echo in the AEC/ES output signal; this echo could trigger the barge-in detector (generating a false positive), and it could also interfere with the ASR performance. But if the echo removal is too aggressive, then the speech from the user in the vehicle may be attenuated, which may lead to the VAD never triggering (generating a false negative) and/or reduced ASR performance.
  • an enhanced-voice-barge-in-detection system that allows a user to interrupt an ASR audio prompt by speaking a next command
  • the enhanced-voice-barge-in-detection system including: an ASR-prompt voice activity detector (VAD) configured to perform voice activity detection on the ASR-prompt; an acoustic echo canceller (AEC) configured to attenuate the ASR prompt played through the vehicle speakers and captured by the one or more microphones in the vehicle cabin; an uphnk voice activity detector (UL VAD) configured to perform voice activity detection on the uphnk signal; and a barge-in detector configured to receive data inputs from the ASR-prompt VAD, AEC, and UL VAD to generate a barge-in decision indicating either that barge-in is active or that barge-in is not active such that when the barge-in decision indicates that barge-in is active, the ASR-prompt voice activity detector (VAD) configured to perform voice activity detection on the ASR-
  • FIG. 1 depicts a hands-free audio system, which may be used in an
  • FIG. 2 is a schematic block diagram of a typical ASR barge-in detection system.
  • FIG. 3 depicts a schematic block diagram for an enhanced barge-in
  • FIG. 1 depicts a hands-free audio system, which may be used in an
  • the hands-free audio system 105 comprises a microphone 112 or multiple microphones (only one shown) and a loudspeaker 114 or multiple loudspeakers (one shown).
  • the hands free audio system may be a vehicle telematics system (optionally including a network access device), a vehicle radio or head unit, or any other similar type of system that contains hands-free speech recognition capability.
  • the microphone 112 transduces or“picks up” audio -frequency signals from within the passenger compartment or interior 103 of the vehicle 104 and provides electrical signals representing those audio signals to a controller 130 for the hands-free audio system 105.
  • the microphone 112 thus picks up road noise, wind noise, and engine noise caused by the vehicle being driven about as well as audio signals output from loudspeakers 114 in the cabin 103.
  • the loudspeaker 114 portion of the hands-free system 105 receives
  • the loudspeaker 114 transduces those electrical signals into sound waves or audio signals 113 that can be heard throughout the passenger
  • Audio signals 113 picked up by the microphone 112 are converted to
  • FIG. 2 is a schematic block diagram of a typical ASR barge-in detection system.
  • the echo signal is removed (AEC, echo suppression) and then the resulting signal is passed to the voice activity detector (VAD).
  • AEC echo suppression
  • VAD voice activity detector
  • the VAD can only make a decision based on the input signal. If there is residual echo in this signal, it may trigger the VAD, even though there is no speech from the user in the cabin.
  • US Patent 6785365B2 entitled Method and apparatus for facilitating speech barge-in in connection with voice recognition systems, is an example of such a barge-in detector.
  • the method disclosed in the‘365 patent calculates the energy in the prompt signal (Sp) and the hne input signal (Si).
  • the line input signal is the microphone signal after local echo cancellation has been applied but a“prompt residue” (i.e., residual echo) remains; this is equivalent to the“error” signal in the terminology used in this document for describing embodiments of the present invention.
  • the goal of the method of the‘365 patent is to use Sp and Si to compute an attenuation parameter that is used to sufficiently attenuate (or remove) the“prompt residue”.
  • the resulting signal is compared to a threshold. At this point, any remaining energy above the threshold is considered to be speech (barge-in event is triggered).
  • Acoustic echo cancellation (AEC) and residual echo suppression (ES) is applied to the input signal and then a voice activity detector (VAD) is used to detect barge-in.
  • AEC Acoustic echo cancellation
  • ES residual echo suppression
  • VAD voice activity detector
  • VAD voice activity detector
  • VAD voice activity detector
  • VAD voice activity detector
  • speaker input voice activity detector
  • an activation decision is made when at least one of the VADs is active. They have different power modes in which one or both VADs may be used. They are not trying to detect speech from the user in the presence of a prompt (barge-in), but rather any speech input, either from the user or being played from the speaker.
  • US Published Patent Application 20090036170 Al entitled Voice activity detector and method, describes a method to detect near-end voice activity detection in the presence of echo. The detector is used to distinguish between the“far end echo only” and“double talk” segments based on estimated near-end input power.
  • the error signal e(k) is the sum of the near-end speech [v(k)], the residual echo that the AEC does not remove [r(k)], and the near-end noise [n(k)].
  • the error signal is available, and the near-end noise level can be estimated using available techniques.
  • the ERLE is the ratio of power in the
  • the main novelty of their method is the way in which they estimate the near-end speech power.
  • we make use of the microphone power directly which contains near-end speech, echo, and noise.
  • Both methods make use of the power of the error signal and the ERLE (difference in power in microphone and power in error signal), but in different ways.
  • the method of the‘170 published application is also more computationally complex than methods in accordance with embodiments of the present invention.
  • AES acoustic echo suppression
  • e_s(n) is the error signal (output from AEC) and d_s(n) is the
  • E_err(n) is Energy ⁇ UL speech + noise ⁇ / EnergyfUL speech + noise + echo ⁇ . Even if there is no UL speech in the cabin, the E_err(n) > Th_g condition can easily be met during high noise conditions.
  • the threshold is dynamic and based on the long-term average of the cross power between the error and microphone signal, so it moves with the noise floor. During high noise conditions, it becomes harder to enter the barge-in detected state, which results in fewer false positives.
  • FIG. 3 depicts a schematic block diagram for an enhanced barge-in
  • a barge-in detector has been added.
  • the barge-in detector receives data inputs from the prompt VAD, AEC, and UL VAD.
  • sohd lines indicates audio signal input/output
  • dashed lines indicate data input/output.
  • the prompt VAD and UL VAD are standard voice activity detectors that are run on the ASR prompt signal and the uplink signal, respectively.
  • the data input from the AEC is the power in the microphone input signal to the AEC and the power in the output signal from AEC (known as the“error” signal). Both powers are computed as follows for input signal x
  • N is the frame size (in number of samples). Therefore P m ic refers to the power in the microphone signal, and P e rr refers to the power in the error signal. [0035] Additionally, the cross-power between the microphone and error signals is computed as follows:
  • Pmic.dB refers to the power in the microphone signal represented in decibels
  • Perr.dB refers to the power in the error signal represented in decibels
  • P cross, dB refers to the cross-power between the error and microphone signals represented in decibels.
  • a preferred value for a (alpha) is 0.05. Other suitable values may also be used.
  • FIG. 4 depicts steps of a method for generating a barge-in decision in
  • the prompt VAD is checked: if the prompt VAD is not active, then this is not really a true barge-in condition (there is no prompt being played). Therefore, the standard UL VAD can be used to generate a decision. If the prompt VAD is active, then the power in the microphone signal (P mic.dB ) is checked: if it is not above P cross.dB.LT plus a tunable threshold (“thresh 1”), then there is not enough energy in the microphone signal to indicate that there is speech from an in-cabin user and/or prompt echo! therefore, there is not barge-in. Otherwise, there is some activity detected, but it is not yet known if it is in-cabin speech or echo.
  • P mic.dB the power in the microphone signal
  • Thresh 1 a tunable threshold
  • a preferred value is 6 dB for both of the thresholds,“thresh 1” and
  • threshold2 Other suitable threshold values may be used for either or both of these thresholds.
  • a preferred value is five consecutive 10ms frames (or an equivalent number of frames corresponding to 50ms if using a different frame size).

Abstract

In an automatic speech recognition (ASR) system, an enhanced-voice-barge-in- detection system that allows a user to interrupt an ASR audio prompt by speaking a next command, the enhanced-voice-barge-in-detection system including: an ASR-prompt voice activity detector (VAD) configured to perform voice activity detection on the ASR-prompt; an acoustic echo canceller (AEC) configured to attenuate the ASR prompt played through the vehicle speakers and captured by the one or more microphones in the vehicle cabin; an uplink voice activity detector (UL VAD) configured to perform voice activity detection on the uplink signal; and a barge-in detector configured to receive data inputs from the ASR-prompt VAD, AEC, and UL VAD to generate a barge-in decision indicating either that barge-in is active or that barge-in is not active such that when the barge-in decision indicates that barge-in is active, the ASR prompt can be stopped and echo processing of the ASR prompt can be bypassed.

Description

ENHANCED BARGE-IN DETECTOR
BACKGROUND
[0001] Many motor vehicles are provided with“hands-free” systems, which are typically speaker phones that enable a driver or passenger to use his or her cell phone without having to hold the device against the user’s ear. Hands free systems may also provide automatic speech recognition of voice commands, such as entering a destination into a navigation system, placing a telephone call to someone in a contacts list, and the hke.
[0002] Hands-free systems typically comprise one or more microphones and one or more speakers. The microphone or microphones are typically non- directional and designed to pick up sounds from almost anywhere inside a vehicle, including a user’s voice. The speaker or speakers are typically designed to provide audio at power levels that can be heard anywhere inside the vehicle.
[0003] In automatic speech recognition (ASR) systems, voice barge-in allows a user to interrupt an ASR audio prompt by speaking the next command (i.e., barging in and interrupting the prompt). With this functionality, the user does not have to wait for the end of the prompt in order to proceed with the ASR session. For example, if the audio prompt specifies several options, the user does not have to wait for all options to be hsted before making a selection.
[0004] To detect barge-in, acoustic echo cancellation (AEC) and a voice activity detector (VAD) are typically used. The AEC significantly attenuates the echo signal (the prompt that is played through the vehicle speakers and captured at the microphone signal). If the VAD triggers after the echo has been cancelled, then this indicates that there is speech in the vehicle cabin (i.e., that the user in the vehicle is attempting to barge-in to interrupt the prompt). [0005] Additional components, such as microphone gain (Mic Gain), high-pass filtering (HPF), noise suppression (NS), echo suppression, comfort noise (CN), and uplink gain (UL Gain) may also be enabled in order to improve the barge-in detection and/or ASR performance. A block diagram of a typical barge-in detection system is shown in FIG. 2.
[0006] Once the barge-in event is detected, the audio prompt is stopped, and the
AEC and additional processing may be bypassed since there is no longer an echo signal to be removed.
[0007] The barge-in detector has two main performance metrics: false negative rate and false positive rate. A false negative condition occurs when the user in the vehicle cabin speaks, but the barge-in event is not detected. In this case, the ASR audio prompt is never stopped, and the signal is not passed to the ASR engine. A false positive condition occurs when the user in the vehicle cabin does not speak, but the barge-in condition is detected. In this case, the ASR audio prompt is stopped prematurely, and the resulting signal is sent to the ASR engine for recognition.
[0008] False negative conditions are undesirable since the user may be required to repeat a command multiple times before the barge-in event is triggered. But false positive conditions are generally considered to be much worse since this disrupts the natural flow of the ASR session. The user did not intend to issue a command, but the voice user interface (VUI) may take an undesirable action or return an error message.
[0009] The recognition rate of the ASR engine is another important indirect
metric. If the echo removal is too lax, then there will still be some residual echo in the AEC/ES output signal; this echo could trigger the barge-in detector (generating a false positive), and it could also interfere with the ASR performance. But if the echo removal is too aggressive, then the speech from the user in the vehicle may be attenuated, which may lead to the VAD never triggering (generating a false negative) and/or reduced ASR performance.
[0010] As such, there is a need in the prior art for more reliable detection of voice barge -in for ASR applications.
BRIEF SUMMARY
[0011] In accordance with embodiments of the invention, in an automatic speech recognition (ASR) system, an enhanced-voice-barge-in-detection system that allows a user to interrupt an ASR audio prompt by speaking a next command, the enhanced-voice-barge-in-detection system including: an ASR-prompt voice activity detector (VAD) configured to perform voice activity detection on the ASR-prompt; an acoustic echo canceller (AEC) configured to attenuate the ASR prompt played through the vehicle speakers and captured by the one or more microphones in the vehicle cabin; an uphnk voice activity detector (UL VAD) configured to perform voice activity detection on the uphnk signal; and a barge-in detector configured to receive data inputs from the ASR-prompt VAD, AEC, and UL VAD to generate a barge-in decision indicating either that barge-in is active or that barge-in is not active such that when the barge-in decision indicates that barge-in is active, the ASR prompt can be stopped and echo processing of the ASR prompt can be bypassed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] FIG. 1 depicts a hands-free audio system, which may be used in an
automotive vehicle.
[0013] FIG. 2 is a schematic block diagram of a typical ASR barge-in detection system.
[0014] FIG. 3 depicts a schematic block diagram for an enhanced barge-in
detection system in accordance with embodiments of the invention. [0015] FIG. 4 depicts steps of a method for generating a barge-in decision in accordance with embodiments of the invention.
DETAILED DESCRIPTION
[0016] FIG. 1 depicts a hands-free audio system, which may be used in an
automotive vehicle 104. In the vehicle 104, the hands-free audio system 105 comprises a microphone 112 or multiple microphones (only one shown) and a loudspeaker 114 or multiple loudspeakers (one shown). The hands free audio system may be a vehicle telematics system (optionally including a network access device), a vehicle radio or head unit, or any other similar type of system that contains hands-free speech recognition capability.
[0017] The microphone 112 transduces or“picks up” audio -frequency signals from within the passenger compartment or interior 103 of the vehicle 104 and provides electrical signals representing those audio signals to a controller 130 for the hands-free audio system 105. The microphone 112 thus picks up road noise, wind noise, and engine noise caused by the vehicle being driven about as well as audio signals output from loudspeakers 114 in the cabin 103.
[0018] The loudspeaker 114 portion of the hands-free system 105 receives
electrical signals in the audio-frequency range corresponding to ASR prompts from the controller 130 for the hands-free audio system 105. The loudspeaker 114 transduces those electrical signals into sound waves or audio signals 113 that can be heard throughout the passenger
compartment 103 of the vehicle 104.
[0019] Audio signals 113 picked up by the microphone 112 are converted to
electrical signals that are provided to the controller 130.
[0020] FIG. 2 is a schematic block diagram of a typical ASR barge-in detection system. In the system of FIG. 2, the echo signal is removed (AEC, echo suppression) and then the resulting signal is passed to the voice activity detector (VAD). The VAD can only make a decision based on the input signal. If there is residual echo in this signal, it may trigger the VAD, even though there is no speech from the user in the cabin.
[0021] US Patent 6785365B2, entitled Method and apparatus for facilitating speech barge-in in connection with voice recognition systems, is an example of such a barge-in detector. The method disclosed in the‘365 patent calculates the energy in the prompt signal (Sp) and the hne input signal (Si). The line input signal is the microphone signal after local echo cancellation has been applied but a“prompt residue” (i.e., residual echo) remains; this is equivalent to the“error” signal in the terminology used in this document for describing embodiments of the present invention.
[0022] The goal of the method of the‘365 patent is to use Sp and Si to compute an attenuation parameter that is used to sufficiently attenuate (or remove) the“prompt residue”. Once the prompt residue has been removed, the resulting signal is compared to a threshold. At this point, any remaining energy above the threshold is considered to be speech (barge-in event is triggered). Acoustic echo cancellation (AEC) and residual echo suppression (ES) is applied to the input signal and then a voice activity detector (VAD) is used to detect barge-in.
[0023] US Published Application 20170178628 Al, entitled Voice activation
system, describes having one voice activity detector (VAD) based on microphone input and another voice activity detector (VAD) based on speaker input. Based on these 2 VAD decisions, an activation decision is made when at least one of the VADs is active. They have different power modes in which one or both VADs may be used. They are not trying to detect speech from the user in the presence of a prompt (barge-in), but rather any speech input, either from the user or being played from the speaker. [0024] US Published Patent Application 20090036170 Al, entitled Voice activity detector and method, describes a method to detect near-end voice activity detection in the presence of echo. The detector is used to distinguish between the“far end echo only” and“double talk” segments based on estimated near-end input power. They present a method to estimate the near-end speech level as follows. The error signal e(k) is the sum of the near-end speech [v(k)], the residual echo that the AEC does not remove [r(k)], and the near-end noise [n(k)]. The error signal is available, and the near-end noise level can be estimated using available techniques. They present a novel method of estimating the residual echo level. This method is based on the expected AEC gain (also termed“echo return loss enhancement” or ERLE). The ERLE is the ratio of power in the
microphone input signal (prior to AEC) to the power in the error signal (post AEC; prior to ES). (The ERLE of the‘170 published application is the same as the Pdiff in the detector disclosed in accordance with embodiments of the present invention.)
[0025] Once the power of this r(k) term is estimated, then they estimate the
power in the near-end speech signal. This is an estimate of the level of the speech in the microphone signal with the power of the noise and echo removed. They then use a simple energy-based VAD to detect if there is near-end speech. This VAD decision and a similar VAD decision based on the far-end input signal ahows them to discriminate between“far end echo only”,“near-end speech only”, and“double talk” conditions.
[0026] The main novelty of their method is the way in which they estimate the near-end speech power. In accordance with embodiments of the present invention, we make use of the microphone power directly (which contains near-end speech, echo, and noise). Both methods make use of the power of the error signal and the ERLE (difference in power in microphone and power in error signal), but in different ways. The method of the‘170 published application is also more computationally complex than methods in accordance with embodiments of the present invention.
[0027] Environmentally Adaptive Acoustic Echo Suppression for Barge-in Speech
Recognition, by Jong Han Joo et al., (World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering; Vol:9, No: l, 2015) proposes a novel technique for acoustic echo suppression (AES) during speech recognition under barge-in conditions. Conventional AES methods based on spectral subtraction apply fixed weights to the estimated echo path transfer function (EPTF) at the current signal segment and to the EPTF estimated until the previous time interval. However, the effects of echo path changes should be considered for eliminating the undesired echoes. They describe an approach that adaptively updates weight parameters in response to abrupt changes in the acoustic environment due to background noises or double-talk. Furthermore, they devised a voice activity detector and an initial time- delay estimator for barge-in speech recognition in communication networks. The initial time delay is estimated using log-spectral distance measure, as well as cross-correlation coefficients. The experimental results show that the developed techniques can be successfully applied in barge-in speech recognition systems.
[0028] During high noise conditions, the double talk detector disclosed by by Jong
Han Joo et al., will false trigger. They compute the "normalized error energy as
Figure imgf000008_0001
[0029] where e_s(n) is the error signal (output from AEC) and d_s(n) is the
microphone signal. The double talk condition is met if E_err(n) is greater than a threshold (Th_g). [0030] E_err(n) is Energy{UL speech + noise} / EnergyfUL speech + noise + echo}. Even if there is no UL speech in the cabin, the E_err(n) > Th_g condition can easily be met during high noise conditions.
[0031] In the double talk detector in accordance with embodiments of the present invention, the threshold is dynamic and based on the long-term average of the cross power between the error and microphone signal, so it moves with the noise floor. During high noise conditions, it becomes harder to enter the barge-in detected state, which results in fewer false positives.
[0032] The goal of any barge-in detector is to detect in-cabin speech in the
presence of the ASR prompt echo. Since this is a speciahzed use case, a generic VAD may not provide the best performance. There is other information that can be used in order to more accurately detect the barge- in condition.
[0033] FIG. 3 depicts a schematic block diagram for an enhanced barge-in
detection system in accordance with embodiments of the invention. A barge-in detector has been added. The barge-in detector receives data inputs from the prompt VAD, AEC, and UL VAD. In FIG. 3, sohd lines indicates audio signal input/output, and dashed lines indicate data input/output. The prompt VAD and UL VAD are standard voice activity detectors that are run on the ASR prompt signal and the uplink signal, respectively. The data input from the AEC is the power in the microphone input signal to the AEC and the power in the output signal from AEC (known as the“error” signal). Both powers are computed as follows for input signal x
Figure imgf000009_0001
[0034] where N is the frame size (in number of samples). Therefore Pmic refers to the power in the microphone signal, and Perr refers to the power in the error signal. [0035] Additionally, the cross-power between the microphone and error signals is computed as follows:
Figure imgf000010_0001
[0036] It is often more convenient to work with these values in the logarithmic domain. These power values can be converted to the decibel (dB) equivalent as follows:
Rc,άB = 10 log 10(P )
[0037] Therefore, Pmic.dB refers to the power in the microphone signal represented in decibels, Perr.dB refers to the power in the error signal represented in decibels, and P cross, dB refers to the cross-power between the error and microphone signals represented in decibels.
[0038] The absolute value of the difference between these power values is
computed as follows:
Figure imgf000010_0002
[0039] Additionally, the long-term running average of the cross-power between the error and microphone signals is computed as follows:
Figure imgf000010_0003
[0040] A preferred value for a (alpha) is 0.05. Other suitable values may also be used.
[0041] FIG. 4 depicts steps of a method for generating a barge-in decision in
accordance with embodiments of the invention. First, the prompt VAD is checked: if the prompt VAD is not active, then this is not really a true barge-in condition (there is no prompt being played). Therefore, the standard UL VAD can be used to generate a decision. If the prompt VAD is active, then the power in the microphone signal (Pmic.dB) is checked: if it is not above Pcross.dB.LT plus a tunable threshold (“thresh 1”), then there is not enough energy in the microphone signal to indicate that there is speech from an in-cabin user and/or prompt echo! therefore, there is not barge-in. Otherwise, there is some activity detected, but it is not yet known if it is in-cabin speech or echo. Finally, this is checked by comparing Pdiff to another tunable threshold (“thresh2”). If the difference is smaller than the threshold, then this means that the microphone and (post AEC) error signals are very similar. The error signal is just the microphone signal with the echo removed. Therefore, if they are very similar, then this indicates that there is in-cabin speech and the barge-in detector is set to active. Otherwise, the barge-in detector is set to not active.
[0042] A preferred value is 6 dB for both of the thresholds,“thresh 1” and
“thresh2”. Other suitable threshold values may be used for either or both of these thresholds.
[0043] It will typically be useful to require multiple frames of barge-in active decisions in a row before the barge-in event is triggered and the audio prompt is stopped. A preferred value is five consecutive 10ms frames (or an equivalent number of frames corresponding to 50ms if using a different frame size).
[0044] The foregoing description is for purposes of illustration only. The true
scope of the invention is set forth in the following claims.

Claims

1. In an automatic speech recognition (ASR) system, an enhanced- voice- barge-in-detection system that allows a user to interrupt an ASR audio prompt by speaking a next command, the enhanced-voice-barge-in-detection system
comprising:
an ASR-prompt voice activity detector (VAD) configured to perform voice activity detection on the ASR-prompt;
an acoustic echo canceller (AEC) configured to attenuate the ASR prompt played through the vehicle speakers and captured by the one or more microphones in the vehicle cabin;
an uphnk voice activity detector (UL VAD) configured to perform voice activity detection on the uplink signal; and
a barge-in detector configured to receive data inputs from the ASR-prompt VAD, AEC, and UL VAD to generate a barge-in decision indicating either that barge-in is active or that barge-in is not active such that when the barge-in decision indicates that barge-in is active, the ASR prompt can be stopped and echo
processing of the ASR prompt can be bypassed.
2. The enhanced-voice-barge-in-detection system of claim 1, wherein the AEC provides to the barge-in detector the power in the microphone input signal to the AEC and an error signal that represents the power in the output signal from the AEC.
3. The enhanced-voice-ba ge-in-detection system of claim 2, wherein the barge-in detector calculates a cross-power between the error signal and the microphone signal.
4. The enhanced-voice-ba ge-in-detection system of claim 3, wherein the barge-in detector calculates a long-term running average of the cross-power between the error signal and the microphone signal.
5. A method for generating a barge-in decision during an automatic speech recognition (ASR) prompt, the method comprising:
determining whether an ASR-prompt voice activity detector (VAD) is active; if the ASR-prompt VAD is not active, then using a standard uplink (UL) VAD to generate a decision because this is not a true barge-in condition as there is no ASR-prompt being played;
otherwise, if the ASR-prompt VAD is active, then determining: the power in the microphone signal ( Pmic.dB ), the power in the error signal (Pf d/i), the cross power between the error signal and the microphone signal ( PCross,dB ) and an absolute value of the difference between the power in the microphone signal and the power in the error signal (Pdif), and a long-term running average of the cross-power between the error and microphone signals (Pcmss.dn. i i)',
if Pmic.dB is not above Pcmss.dB.LTphis a first tunable threshold, then setting the barge-in decision to inactive because there is not enough energy in the microphone signal to indicate that there is speech from at least one of an in-cabin user and an ASR-prompt echo;
otherwise, if Pmic.dB is above Pcmss, /surplus the first tunable threshold, comparing Pdiff to a second tunable threshold, and if Pdiff is smaller than the second tunable threshold, then setting the barge-in decision to active because the microphone signal and the post-AEC error signal being very similar to each other, and the error signal being the microphone signal with the echo removed, together mean that there is in-cabin speech; and
otherwise, if Pdiff is not smaller than the second tunable threshold, then the barge-in detector is set to not active.
6. The method of claim 5, further comprising: requiring a plurality of consecutive frames of barge-in active decisions before a barge-in event is triggered and the ASR prompt is stopped.
7. The method of claim 6, wherein the plurality of consecutive frames corresponds to a duration of at least 50 milliseconds.
8. The method of claim 5, wherein the first tunable threshold is 6 dB.
9. The method of claim 5, wherein the second tunable threshold is 6 dB.
PCT/US2019/020305 2018-03-02 2019-03-01 Enhanced barge-in detector WO2019169272A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862637779P 2018-03-02 2018-03-02
US62/637,779 2018-03-02

Publications (1)

Publication Number Publication Date
WO2019169272A1 true WO2019169272A1 (en) 2019-09-06

Family

ID=65763896

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/020305 WO2019169272A1 (en) 2018-03-02 2019-03-01 Enhanced barge-in detector

Country Status (1)

Country Link
WO (1) WO2019169272A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11817091B1 (en) 2020-09-30 2023-11-14 Amazon Technologies, Inc. Fault-tolerance techniques for dialog-driven applications
US11948019B1 (en) * 2020-09-30 2024-04-02 Amazon Technologies, Inc. Customized configuration of multimodal interactions for dialog-driven applications

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6785365B2 (en) 1996-05-21 2004-08-31 Speechworks International, Inc. Method and apparatus for facilitating speech barge-in in connection with voice recognition systems
US20090036170A1 (en) 2007-07-30 2009-02-05 Texas Instruments Incorporated Voice activity detector and method
US20090254342A1 (en) * 2008-03-31 2009-10-08 Harman Becker Automotive Systems Gmbh Detecting barge-in in a speech dialogue system
US20110238417A1 (en) * 2010-03-26 2011-09-29 Kabushiki Kaisha Toshiba Speech detection apparatus
US20170178628A1 (en) 2015-12-22 2017-06-22 Nxp B.V. Voice activation system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6785365B2 (en) 1996-05-21 2004-08-31 Speechworks International, Inc. Method and apparatus for facilitating speech barge-in in connection with voice recognition systems
US20090036170A1 (en) 2007-07-30 2009-02-05 Texas Instruments Incorporated Voice activity detector and method
US20090254342A1 (en) * 2008-03-31 2009-10-08 Harman Becker Automotive Systems Gmbh Detecting barge-in in a speech dialogue system
US20110238417A1 (en) * 2010-03-26 2011-09-29 Kabushiki Kaisha Toshiba Speech detection apparatus
US20170178628A1 (en) 2015-12-22 2017-06-22 Nxp B.V. Voice activation system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JONG HAN JOO ET AL.: "Environmentally Adaptive Acoustic Echo Suppression for Barge-in Speech Recognition", WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF COMPUTER AND INFORMATION ENGINEERING, vol. 9, no. l, 2015

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11817091B1 (en) 2020-09-30 2023-11-14 Amazon Technologies, Inc. Fault-tolerance techniques for dialog-driven applications
US11948019B1 (en) * 2020-09-30 2024-04-02 Amazon Technologies, Inc. Customized configuration of multimodal interactions for dialog-driven applications

Similar Documents

Publication Publication Date Title
EP0901267B1 (en) The detection of the speech activity of a source
KR100836522B1 (en) A downlink activity and double talk probability detector and method for an echo canceler circuit
US8068619B2 (en) Method and apparatus for noise suppression in a small array microphone system
KR100790770B1 (en) Echo canceler circuit and method for detecting double talk activity
KR100623410B1 (en) An echo canceler circuit and method
JP4836032B2 (en) Echo canceller with step size controlled by the level of interference
US7536006B2 (en) Method and system for near-end detection
JP4568439B2 (en) Echo suppression device
JPH11500277A (en) Voice activity detection
US9330684B1 (en) Real-time wind buffet noise detection
JP2003500936A (en) Improving near-end audio signals in echo suppression systems
JP4678349B2 (en) Call determination device
JP3009647B2 (en) Acoustic echo control system, simultaneous speech detector of acoustic echo control system, and simultaneous speech control method of acoustic echo control system
US11089404B2 (en) Sound processing apparatus and sound processing method
WO2021077599A1 (en) Double-talk detection method and apparatus, computer device and storage medium
US8064966B2 (en) Method of detecting a double talk situation for a “hands-free” telephone device
WO2019169272A1 (en) Enhanced barge-in detector
JP2009094802A (en) Telecommunication apparatus
US6816591B2 (en) Voice switching system and voice switching method
JP4888262B2 (en) Call state determination device and echo canceller having the call state determination device
JP4735419B2 (en) Voice communication device
WO2017134798A1 (en) Voice communication device
CN111091846B (en) Noise reduction method and echo cancellation system applying same
CN113824843B (en) Voice call quality detection method, device, equipment and storage medium
JP2000252883A (en) Controller for echo canceller

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19710964

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19710964

Country of ref document: EP

Kind code of ref document: A1