WO2019169272A1

WO2019169272A1 - Enhanced barge-in detector

Info

Publication number: WO2019169272A1
Application number: PCT/US2019/020305
Authority: WO
Inventors: Matthew R. KIRSCH; Guillaume Lamy
Original assignee: Continental Automotive Systems, Inc.
Priority date: 2018-03-02
Filing date: 2019-03-01
Publication date: 2019-09-06

Abstract

In an automatic speech recognition (ASR) system, an enhanced-voice-barge-in- detection system that allows a user to interrupt an ASR audio prompt by speaking a next command, the enhanced-voice-barge-in-detection system including: an ASR-prompt voice activity detector (VAD) configured to perform voice activity detection on the ASR-prompt; an acoustic echo canceller (AEC) configured to attenuate the ASR prompt played through the vehicle speakers and captured by the one or more microphones in the vehicle cabin; an uplink voice activity detector (UL VAD) configured to perform voice activity detection on the uplink signal; and a barge-in detector configured to receive data inputs from the ASR-prompt VAD, AEC, and UL VAD to generate a barge-in decision indicating either that barge-in is active or that barge-in is not active such that when the barge-in decision indicates that barge-in is active, the ASR prompt can be stopped and echo processing of the ASR prompt can be bypassed.

Description

ENHANCED BARGE-IN DETECTOR

BACKGROUND

[0001] Many motor vehicles are provided with“hands-free” systems, which are typically speaker phones that enable a driver or passenger to use his or her cell phone without having to hold the device against the user’s ear. Hands free systems may also provide automatic speech recognition of voice commands, such as entering a destination into a navigation system, placing a telephone call to someone in a contacts list, and the hke.

[0002] Hands-free systems typically comprise one or more microphones and one or more speakers. The microphone or microphones are typically non- directional and designed to pick up sounds from almost anywhere inside a vehicle, including a user’s voice. The speaker or speakers are typically designed to provide audio at power levels that can be heard anywhere inside the vehicle.

[0003] In automatic speech recognition (ASR) systems, voice barge-in allows a user to interrupt an ASR audio prompt by speaking the next command (i.e., barging in and interrupting the prompt). With this functionality, the user does not have to wait for the end of the prompt in order to proceed with the ASR session. For example, if the audio prompt specifies several options, the user does not have to wait for all options to be hsted before making a selection.

[0004] To detect barge-in, acoustic echo cancellation (AEC) and a voice activity detector (VAD) are typically used. The AEC significantly attenuates the echo signal (the prompt that is played through the vehicle speakers and captured at the microphone signal). If the VAD triggers after the echo has been cancelled, then this indicates that there is speech in the vehicle cabin (i.e., that the user in the vehicle is attempting to barge-in to interrupt the prompt). [0005] Additional components, such as microphone gain (Mic Gain), high-pass filtering (HPF), noise suppression (NS), echo suppression, comfort noise (CN), and uplink gain (UL Gain) may also be enabled in order to improve the barge-in detection and/or ASR performance. A block diagram of a typical barge-in detection system is shown in FIG. 2.

[0006] Once the barge-in event is detected, the audio prompt is stopped, and the

AEC and additional processing may be bypassed since there is no longer an echo signal to be removed.

[0007] The barge-in detector has two main performance metrics: false negative rate and false positive rate. A false negative condition occurs when the user in the vehicle cabin speaks, but the barge-in event is not detected. In this case, the ASR audio prompt is never stopped, and the signal is not passed to the ASR engine. A false positive condition occurs when the user in the vehicle cabin does not speak, but the barge-in condition is detected. In this case, the ASR audio prompt is stopped prematurely, and the resulting signal is sent to the ASR engine for recognition.

[0008] False negative conditions are undesirable since the user may be required to repeat a command multiple times before the barge-in event is triggered. But false positive conditions are generally considered to be much worse since this disrupts the natural flow of the ASR session. The user did not intend to issue a command, but the voice user interface (VUI) may take an undesirable action or return an error message.

[0009] The recognition rate of the ASR engine is another important indirect

metric. If the echo removal is too lax, then there will still be some residual echo in the AEC/ES output signal; this echo could trigger the barge-in detector (generating a false positive), and it could also interfere with the ASR performance. But if the echo removal is too aggressive, then the speech from the user in the vehicle may be attenuated, which may lead to the VAD never triggering (generating a false negative) and/or reduced ASR performance.

[0010] As such, there is a need in the prior art for more reliable detection of voice barge -in for ASR applications.

BRIEF SUMMARY

[0011] In accordance with embodiments of the invention, in an automatic speech recognition (ASR) system, an enhanced-voice-barge-in-detection system that allows a user to interrupt an ASR audio prompt by speaking a next command, the enhanced-voice-barge-in-detection system including: an ASR-prompt voice activity detector (VAD) configured to perform voice activity detection on the ASR-prompt; an acoustic echo canceller (AEC) configured to attenuate the ASR prompt played through the vehicle speakers and captured by the one or more microphones in the vehicle cabin; an uphnk voice activity detector (UL VAD) configured to perform voice activity detection on the uphnk signal; and a barge-in detector configured to receive data inputs from the ASR-prompt VAD, AEC, and UL VAD to generate a barge-in decision indicating either that barge-in is active or that barge-in is not active such that when the barge-in decision indicates that barge-in is active, the ASR prompt can be stopped and echo processing of the ASR prompt can be bypassed.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] FIG. 1 depicts a hands-free audio system, which may be used in an

automotive vehicle.

[0013] FIG. 2 is a schematic block diagram of a typical ASR barge-in detection system.

[0014] FIG. 3 depicts a schematic block diagram for an enhanced barge-in

detection system in accordance with embodiments of the invention. [0015] FIG. 4 depicts steps of a method for generating a barge-in decision in accordance with embodiments of the invention.

DETAILED DESCRIPTION

[0016] FIG. 1 depicts a hands-free audio system, which may be used in an

automotive vehicle 104. In the vehicle 104, the hands-free audio system 105 comprises a microphone 112 or multiple microphones (only one shown) and a loudspeaker 114 or multiple loudspeakers (one shown). The hands free audio system may be a vehicle telematics system (optionally including a network access device), a vehicle radio or head unit, or any other similar type of system that contains hands-free speech recognition capability.

[0017] The microphone 112 transduces or“picks up” audio -frequency signals from within the passenger compartment or interior 103 of the vehicle 104 and provides electrical signals representing those audio signals to a controller 130 for the hands-free audio system 105. The microphone 112 thus picks up road noise, wind noise, and engine noise caused by the vehicle being driven about as well as audio signals output from loudspeakers 114 in the cabin 103.

[0018] The loudspeaker 114 portion of the hands-free system 105 receives

electrical signals in the audio-frequency range corresponding to ASR prompts from the controller 130 for the hands-free audio system 105. The loudspeaker 114 transduces those electrical signals into sound waves or audio signals 113 that can be heard throughout the passenger

compartment 103 of the vehicle 104.

[0019] Audio signals 113 picked up by the microphone 112 are converted to

electrical signals that are provided to the controller 130.

[0020] FIG. 2 is a schematic block diagram of a typical ASR barge-in detection system. In the system of FIG. 2, the echo signal is removed (AEC, echo suppression) and then the resulting signal is passed to the voice activity detector (VAD). The VAD can only make a decision based on the input signal. If there is residual echo in this signal, it may trigger the VAD, even though there is no speech from the user in the cabin.

[0021] US Patent 6785365B2, entitled Method and apparatus for facilitating speech barge-in in connection with voice recognition systems, is an example of such a barge-in detector. The method disclosed in the‘365 patent calculates the energy in the prompt signal (Sp) and the hne input signal (Si). The line input signal is the microphone signal after local echo cancellation has been applied but a“prompt residue” (i.e., residual echo) remains; this is equivalent to the“error” signal in the terminology used in this document for describing embodiments of the present invention.

[0022] The goal of the method of the‘365 patent is to use Sp and Si to compute an attenuation parameter that is used to sufficiently attenuate (or remove) the“prompt residue”. Once the prompt residue has been removed, the resulting signal is compared to a threshold. At this point, any remaining energy above the threshold is considered to be speech (barge-in event is triggered). Acoustic echo cancellation (AEC) and residual echo suppression (ES) is applied to the input signal and then a voice activity detector (VAD) is used to detect barge-in.

[0023] US Published Application 20170178628 Al, entitled Voice activation

system, describes having one voice activity detector (VAD) based on microphone input and another voice activity detector (VAD) based on speaker input. Based on these 2 VAD decisions, an activation decision is made when at least one of the VADs is active. They have different power modes in which one or both VADs may be used. They are not trying to detect speech from the user in the presence of a prompt (barge-in), but rather any speech input, either from the user or being played from the speaker. [0024] US Published Patent Application 20090036170 Al, entitled Voice activity detector and method, describes a method to detect near-end voice activity detection in the presence of echo. The detector is used to distinguish between the“far end echo only” and“double talk” segments based on estimated near-end input power. They present a method to estimate the near-end speech level as follows. The error signal e(k) is the sum of the near-end speech [v(k)], the residual echo that the AEC does not remove [r(k)], and the near-end noise [n(k)]. The error signal is available, and the near-end noise level can be estimated using available techniques. They present a novel method of estimating the residual echo level. This method is based on the expected AEC gain (also termed“echo return loss enhancement” or ERLE). The ERLE is the ratio of power in the

microphone input signal (prior to AEC) to the power in the error signal (post AEC; prior to ES). (The ERLE of the‘170 published application is the same as the Pdiff in the detector disclosed in accordance with embodiments of the present invention.)

[0025] Once the power of this r(k) term is estimated, then they estimate the

power in the near-end speech signal. This is an estimate of the level of the speech in the microphone signal with the power of the noise and echo removed. They then use a simple energy-based VAD to detect if there is near-end speech. This VAD decision and a similar VAD decision based on the far-end input signal ahows them to discriminate between“far end echo only”,“near-end speech only”, and“double talk” conditions.

[0026] The main novelty of their method is the way in which they estimate the near-end speech power. In accordance with embodiments of the present invention, we make use of the microphone power directly (which contains near-end speech, echo, and noise). Both methods make use of the power of the error signal and the ERLE (difference in power in microphone and power in error signal), but in different ways. The method of the‘170 published application is also more computationally complex than methods in accordance with embodiments of the present invention.

[0027] Environmentally Adaptive Acoustic Echo Suppression for Barge-in Speech

Recognition, by Jong Han Joo et al., (World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering; Vol:9, No: l, 2015) proposes a novel technique for acoustic echo suppression (AES) during speech recognition under barge-in conditions. Conventional AES methods based on spectral subtraction apply fixed weights to the estimated echo path transfer function (EPTF) at the current signal segment and to the EPTF estimated until the previous time interval. However, the effects of echo path changes should be considered for eliminating the undesired echoes. They describe an approach that adaptively updates weight parameters in response to abrupt changes in the acoustic environment due to background noises or double-talk. Furthermore, they devised a voice activity detector and an initial time- delay estimator for barge-in speech recognition in communication networks. The initial time delay is estimated using log-spectral distance measure, as well as cross-correlation coefficients. The experimental results show that the developed techniques can be successfully applied in barge-in speech recognition systems.

[0028] During high noise conditions, the double talk detector disclosed by by Jong

Han Joo et al., will false trigger. They compute the "normalized error energy as

[0029] where e_s(n) is the error signal (output from AEC) and d_s(n) is the

microphone signal. The double talk condition is met if E_err(n) is greater than a threshold (Th_g). [0030] E_err(n) is Energy{UL speech + noise} / EnergyfUL speech + noise + echo}. Even if there is no UL speech in the cabin, the E_err(n) > Th_g condition can easily be met during high noise conditions.

[0031] In the double talk detector in accordance with embodiments of the present invention, the threshold is dynamic and based on the long-term average of the cross power between the error and microphone signal, so it moves with the noise floor. During high noise conditions, it becomes harder to enter the barge-in detected state, which results in fewer false positives.

[0032] The goal of any barge-in detector is to detect in-cabin speech in the

presence of the ASR prompt echo. Since this is a speciahzed use case, a generic VAD may not provide the best performance. There is other information that can be used in order to more accurately detect the barge- in condition.

[0033] FIG. 3 depicts a schematic block diagram for an enhanced barge-in

detection system in accordance with embodiments of the invention. A barge-in detector has been added. The barge-in detector receives data inputs from the prompt VAD, AEC, and UL VAD. In FIG. 3, sohd lines indicates audio signal input/output, and dashed lines indicate data input/output. The prompt VAD and UL VAD are standard voice activity detectors that are run on the ASR prompt signal and the uplink signal, respectively. The data input from the AEC is the power in the microphone input signal to the AEC and the power in the output signal from AEC (known as the“error” signal). Both powers are computed as follows for input signal x

[0034] where N is the frame size (in number of samples). Therefore P_mic refers to the power in the microphone signal, and P_err refers to the power in the error signal. [0035] Additionally, the cross-power between the microphone and error signals is computed as follows:

[0036] It is often more convenient to work with these values in the logarithmic domain. These power values can be converted to the decibel (dB) equivalent as follows:

Rc,άB = 10 log ₁₀(P )

[0037] Therefore, Pmic.dB refers to the power in the microphone signal represented in decibels, Perr.dB refers to the power in the error signal represented in decibels, and P cross, dB refers to the cross-power between the error and microphone signals represented in decibels.

[0038] The absolute value of the difference between these power values is

computed as follows:

[0039] Additionally, the long-term running average of the cross-power between the error and microphone signals is computed as follows:

[0040] A preferred value for a (alpha) is 0.05. Other suitable values may also be used.

[0041] FIG. 4 depicts steps of a method for generating a barge-in decision in

accordance with embodiments of the invention. First, the prompt VAD is checked: if the prompt VAD is not active, then this is not really a true barge-in condition (there is no prompt being played). Therefore, the standard UL VAD can be used to generate a decision. If the prompt VAD is active, then the power in the microphone signal (P_mic.dB) is checked: if it is not above P_cross.dB.LT plus a tunable threshold (“thresh 1”), then there is not enough energy in the microphone signal to indicate that there is speech from an in-cabin user and/or prompt echo! therefore, there is not barge-in. Otherwise, there is some activity detected, but it is not yet known if it is in-cabin speech or echo. Finally, this is checked by comparing Pdiff to another tunable threshold (“thresh2”). If the difference is smaller than the threshold, then this means that the microphone and (post AEC) error signals are very similar. The error signal is just the microphone signal with the echo removed. Therefore, if they are very similar, then this indicates that there is in-cabin speech and the barge-in detector is set to active. Otherwise, the barge-in detector is set to not active.

[0042] A preferred value is 6 dB for both of the thresholds,“thresh 1” and

“thresh2”. Other suitable threshold values may be used for either or both of these thresholds.

[0043] It will typically be useful to require multiple frames of barge-in active decisions in a row before the barge-in event is triggered and the audio prompt is stopped. A preferred value is five consecutive 10ms frames (or an equivalent number of frames corresponding to 50ms if using a different frame size).

[0044] The foregoing description is for purposes of illustration only. The true

scope of the invention is set forth in the following claims.

Claims

1. In an automatic speech recognition (ASR) system, an enhanced- voice- barge-in-detection system that allows a user to interrupt an ASR audio prompt by speaking a next command, the enhanced-voice-barge-in-detection system

comprising:

an ASR-prompt voice activity detector (VAD) configured to perform voice activity detection on the ASR-prompt;

an acoustic echo canceller (AEC) configured to attenuate the ASR prompt played through the vehicle speakers and captured by the one or more microphones in the vehicle cabin;

an uphnk voice activity detector (UL VAD) configured to perform voice activity detection on the uplink signal; and

a barge-in detector configured to receive data inputs from the ASR-prompt VAD, AEC, and UL VAD to generate a barge-in decision indicating either that barge-in is active or that barge-in is not active such that when the barge-in decision indicates that barge-in is active, the ASR prompt can be stopped and echo

processing of the ASR prompt can be bypassed.

2. The enhanced-voice-barge-in-detection system of claim 1, wherein the AEC provides to the barge-in detector the power in the microphone input signal to the AEC and an error signal that represents the power in the output signal from the AEC.

3. The enhanced-voice-ba ge-in-detection system of claim 2, wherein the barge-in detector calculates a cross-power between the error signal and the microphone signal.

4. The enhanced-voice-ba ge-in-detection system of claim 3, wherein the barge-in detector calculates a long-term running average of the cross-power between the error signal and the microphone signal.

5. A method for generating a barge-in decision during an automatic speech recognition (ASR) prompt, the method comprising:

determining whether an ASR-prompt voice activity detector (VAD) is active; if the ASR-prompt VAD is not active, then using a standard uplink (UL) VAD to generate a decision because this is not a true barge-in condition as there is no ASR-prompt being played;

otherwise, if the ASR-prompt VAD is active, then determining: the power in the microphone signal ( Pmic.dB ), the power in the error signal (P_f d/i), the cross power between the error signal and the microphone signal ( P_Cross,dB ) and an absolute value of the difference between the power in the microphone signal and the power in the error signal (Pdif), and a long-term running average of the cross-power between the error and microphone signals (Pcmss.dn. i i)^',

if Pmic.dB is not above Pcmss.dB.LTphis a first tunable threshold, then setting the barge-in decision to inactive because there is not enough energy in the microphone signal to indicate that there is speech from at least one of an in-cabin user and an ASR-prompt echo;

otherwise, if Pmic.dB is above Pcmss, /surplus the first tunable threshold, comparing Pdiff to a second tunable threshold, and if Pdiff is smaller than the second tunable threshold, then setting the barge-in decision to active because the microphone signal and the post-AEC error signal being very similar to each other, and the error signal being the microphone signal with the echo removed, together mean that there is in-cabin speech; and

otherwise, if Pdiff is not smaller than the second tunable threshold, then the barge-in detector is set to not active.

6. The method of claim 5, further comprising: requiring a plurality of consecutive frames of barge-in active decisions before a barge-in event is triggered and the ASR prompt is stopped.

7. The method of claim 6, wherein the plurality of consecutive frames corresponds to a duration of at least 50 milliseconds.

8. The method of claim 5, wherein the first tunable threshold is 6 dB.

9. The method of claim 5, wherein the second tunable threshold is 6 dB.