WO2014133525A1 - Server-side asr adaptation to speaker, device and noise condition via non-asr audio transmission - Google Patents

Server-side asr adaptation to speaker, device and noise condition via non-asr audio transmission Download PDF

Info

Publication number
WO2014133525A1
WO2014133525A1 PCT/US2013/028288 US2013028288W WO2014133525A1 WO 2014133525 A1 WO2014133525 A1 WO 2014133525A1 US 2013028288 W US2013028288 W US 2013028288W WO 2014133525 A1 WO2014133525 A1 WO 2014133525A1
Authority
WO
WIPO (PCT)
Prior art keywords
asr
audio
speech
mobile device
input
Prior art date
Application number
PCT/US2013/028288
Other languages
French (fr)
Other versions
WO2014133525A8 (en
Inventor
Daniel Willett
Jean-guy E. DAHAN
William F. Ganong
Jianxiong Wu
Original Assignee
Nuance Communication, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nuance Communication, Inc. filed Critical Nuance Communication, Inc.
Priority to PCT/US2013/028288 priority Critical patent/WO2014133525A1/en
Priority to US14/770,371 priority patent/US9679560B2/en
Publication of WO2014133525A1 publication Critical patent/WO2014133525A1/en
Publication of WO2014133525A8 publication Critical patent/WO2014133525A8/en
Priority to US15/619,877 priority patent/US10229701B2/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • G10L15/075Adaptation to the speaker supervised, i.e. under machine guidance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics

Definitions

  • the invention generally relates to automatic speech recognition (ASR), and more specifically, to client-server ASR on mobile devices.
  • ASR automatic speech recognition
  • An automatic speech recognition (ASR) system determines a semantic meaning of a speech input.
  • the input speech is processed into a sequence of digital speech feature frames.
  • Each speech feature frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech.
  • the multi-dimensional vector of each speech frame can be derived from cepstral features of the short time Fourier transform spectrum of the speech signal (MFCCs)— the short time power or component of a given frequency band— as well as the corresponding first- and second-order derivatives (“deltas" and "delta-deltas").
  • MFCCs short time Fourier transform spectrum of the speech signal
  • deltas first- and second-order derivatives
  • variable numbers of speech frames are organized as "utterances" representing a period of speech followed by a pause, which in real life loosely corresponds to a spoken sentence or phrase.
  • the ASR system compares the input utterances to find statistical acoustic models that best match the vector sequence characteristics and determines corresponding representative text associated with the acoustic models. More formally, given some input observations A, the probability that some string of words W were spoken is represented as (W
  • the acoustic models are typically probabilistic state sequence models such as hidden Markov models (HMMs) that model speech sounds using mixtures of probability distribution functions (Gaussians). Acoustic models often represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of a statistical language model.
  • HMMs hidden Markov models
  • Gasians mixtures of probability distribution functions
  • a system may produce a single best recognition candidate - the recognition result - or multiple recognition hypotheses in various forms such as an N-best list, a recognition lattice, or a confusion network. Further details regarding continuous speech recognition are provided in U.S. Patent No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Patent No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
  • ASR technology has advanced enough to have applications that are implemented on the limited footprint of a mobile device. This can involve a somewhat limited stand-alone ASR arrangement on the mobile device, or more extensive capability can be provided in a client-server arrangement where the local mobile device does initial processing of speech inputs, and possibly some local ASR recognition processing, but the main ASR processing is performed at a remote server with greater resources, then the recognition results are returned for use at the mobile device.
  • U.S. Patent Publication 20110054899 describes a hybrid client-server ASR arrangement for a mobile device in which speech recognition may be performed locally by the device and/or remotely by a remote ASR server depending on one or more criteria such as time, policy, confidence score, network availability, and the like.
  • An example screen shot of the initial prompt interface from one such mobile device ASR application, Dragon Dictation for iPhone is shown in Fig. 1A which processes unprompted speech inputs and produces representative text output.
  • Fig. IB shows a screen shot of the recording interface for Dragon Dictation for iPhone.
  • Fig. 1C shows an example screen shot of the results interface produced for the ASR results by Dragon Dictation for iPhone.
  • Embodiments of the present invention are directed to a mobile device which is adapted for automatic speech recognition (ASR).
  • a user interface for interaction with a user includes an input microphone for obtaining speech inputs from the user for automatic speech recognition, and an output interface for system output to the user based on ASR results that correspond to the speech input.
  • a local controller obtains a sample of non-ASR audio from the input microphone (i.e., audio fetched at the mobile device outside the ASR interaction process) for ASR-adaptation to channel-specific ASR characteristics, and then provides a representation of the non-ASR audio to a remote ASR server for server-side adaptation to the channel-specific ASR characteristics, and then provides a representation of an unknown speech input from the input microphone to the remote ASR server for determining ASR results corresponding to the unknown speech input, and then provides the system output to the output interface.
  • the input microphone i.e., audio fetched at the mobile device outside the ASR interaction process
  • the non-ASR audio may include audio sampled by the input microphone during a rolling sample window before the unknown speech input.
  • the non-ASR audio may include non-ASR speech audio sampled by the input microphone before the unknown speech input, and the non-ASR speech audio may be limited to speech data sampled from short data windows.
  • the representation of the non-ASR audio may include pre-processed ASR adaptation data produced by the mobile device from the non-ASR audio, for example, at least one of background noise model data and ASR acoustic model adaptation data.
  • the representation of the non-ASR audio may be limited to speech feature data.
  • Figure 1 A-C shows various example screen shots from a hybrid ASR application for a mobile device.
  • Figure 2 shows various elements in a hybrid ASR arrangement according to an embodiment of the present invention.
  • Figure 3 shows various functional steps in a hybrid ASR arrangement according to an embodiment of the present invention.
  • Embodiments of the present invention are directed to a mobile device using a client-server ASR arrangement.
  • This enables the server ASR engine to establish speaker, channel and environment information even before first ASR use, and it allows keeping this information up-to-date.
  • there various options for handling potential privacy concerns are common practice in server-side ASR to exploit speech data received from a given specific device for recognition performance improvement for future utterances received from the same device. The underlying assumption is that speech data coming from a given specific device is correlated in terms of channel, acoustic environment and speaker.
  • Embodiments of the present invention exploit other channel-specific audio sources by transmitting non-ASR audio data captured by the mobile device outside the pure ASR event, and using this audio data for adaptation of the ASR engine to the specific device, environment and speaker.
  • FIG. 2 shows various elements in an ASR arrangement according to an embodiment of the present invention.
  • a user interface 201 on mobile device 200 for interaction with a user includes an input microphone for obtaining speech inputs from the user for automatic speech recognition.
  • a local controller 202 passes representations of unknown speech inputs over a communications network 204 (such as the Internet) to a remote ASR server 205 that uses server data sources 206 to determine ASR results corresponding to the unknown ASR speech input.
  • the ASR results are then sent back over the communications network 204 to the local controller 202 at the mobile device 200 which sends a system output based on the ASR results to the user interface 201 for output to the user.
  • the system output at the user interface 201 may be recognition text or audio that corresponds to the speech input and/or a next dialog prompt in an automated dialog process based on the ASR results.
  • FIG. 3 shows various functional steps in an ASR arrangement according to an embodiment of the present invention.
  • the local controller 202 obtains a sample of non- ASR audio, step 301, either directly from the input microphone of the user interface 201 or from non-ASR audio memory 203 on the mobile device 200.
  • the local controller 202 provides a representation of the non-ASR audio to the remote ASR server 205, step 302, for server-side adaptation to the channel-specific ASR characteristics. This allows the system to then operate in conventional ASR, providing representations of unknown ASR speech inputs to the remote ASR server 205, step 303, and then providing the ASR results received from the remote ASR server 205 to the user interface 201 for display to the user, step 304.
  • the mobile device can capture a few seconds of the audio that is representative for the surrounding noise environment as sensed by the input microphone shortly before the user is expected to use the server-side ASR; for example, by recording a rolling window of audio before the user activates the ASR input.
  • Such pre-ASR audio can be captured when starting up the speech application, which can be assumed representative of the speaker and channel in anticipation of foreground ASR audio to come.
  • a pre-ASR sample of the audio environment can be used to derive a background noise model for de- noising, estimates for device gain settings and signal-to-noise ratio considerations, ASR acoustic model adaptation on the server-side, and decoder parameterization selection.
  • the pre-ASR audio sample need not be transmitted to the server in its entirety. While it could be sent to the ASR server as-is (encoded, with indication of when the button was actually pressed in order to bias any end-pointer and ASR acoustic model toward silence in that section), or it can be pre-processed by the mobile device to derive sufficient audio statistics (e.g., channel mean, channel variance, etc.) so as to reduce processing load and latency at the server-side.
  • sufficient audio statistics e.g., channel mean, channel variance, etc.
  • the non-ASR audio can be obtained from sampling audio transmissions by the mobile device in non-ASR speech use cases, such as most prominently when using the mobile device as a telephone for human to human
  • Audio captured along these lines also can be assumed representative for speaker and channel and exploitation in channel and speaker adaptation is straightforward. This could trigger ASR speaker adaptation even before a specific user has addressed a first utterance to the server-based ASR service. This would be especially useful for handling the long tail of goat speakers (e.g., strongly accented speech) for whom speaker adapted ASR is often necessary to be useable in any meaningful way.
  • Capturing audio data for ASR improvement outside the ASR cycle and transmitting it to the ASR server could potentially raise some privacy concerns. This should be less a lesser concern for the first sampling approach above which captures a short audio sample (presumably non-speech). But the concern is more legitimate for the second approach above which captures true speech audio to prepare the ASR server for the specific speaker, device and channel as described above. Such concerns can be alleviated by one or more of:
  • Sample shredding speech data sampled from short data windows such as one second out of every ten.
  • Embodiments of the invention maybe implemented in part in any conventional computer programming language such as VHDL, SystemC, Verilog, ASM, etc.
  • Embodiments can be implemented in part as a computer program product for use with a computer system.
  • Such implementation may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium.
  • the medium may be either a tangible medium (e.g., optical or analog communications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques).
  • the series of computer instructions embodies all or part of the functionality previously described herein with respect to the system.
  • Such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the network (e.g., the Internet or World Wide Web). Of course, some embodiments of the invention may be implemented as a combination of both software (e.g. , a computer program product) and hardware. Still other embodiments of the invention are implemented as entirely hardware, or entirely software ( e.g. , a computer program product).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

A mobile device is adapted for automatic speech recognition (ASR). A user interface for interaction with a user includes an input microphone for obtaining speech inputs from the user for automatic speech recognition, and an output interface for system output to the user based on ASR results that correspond to the speech input. A local controller obtains a sample of non-ASR audio from the input microphone for ASR- adaptation to channel-specific ASR characteristics, and then provides a representation of the non-ASR audio to a remote ASR server for server-side adaptation to the channel- specific ASR characteristics, and then provides a representation of an unknown ASR speech input from the input microphone to the remote ASR server for determining ASR results corresponding to the unknown ASR speech input, and then provides the system output to the output interface.

Description

TITLE
Server-Side ASR Adaptation to Speaker, Device and Noise Condition Via Non-ASR
Audio Transmission
TECHNICAL FIELD
[0001] The invention generally relates to automatic speech recognition (ASR), and more specifically, to client-server ASR on mobile devices.
BACKGROUND ART
[0002] An automatic speech recognition (ASR) system determines a semantic meaning of a speech input. Typically, the input speech is processed into a sequence of digital speech feature frames. Each speech feature frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech. For example, the multi-dimensional vector of each speech frame can be derived from cepstral features of the short time Fourier transform spectrum of the speech signal (MFCCs)— the short time power or component of a given frequency band— as well as the corresponding first- and second-order derivatives ("deltas" and "delta-deltas"). In a continuous recognition system, variable numbers of speech frames are organized as "utterances" representing a period of speech followed by a pause, which in real life loosely corresponds to a spoken sentence or phrase.
[0003] The ASR system compares the input utterances to find statistical acoustic models that best match the vector sequence characteristics and determines corresponding representative text associated with the acoustic models. More formally, given some input observations A, the probability that some string of words W were spoken is represented as (W|A), where the ASR system attempts to determine the most likely word string:
W = arg max f (W I A)
w
Given a system of statistical acoustic models, this formula can be re-expressed as:
W = arg maxP(W)P(A I W)
w
where P(A|W) corresponds to the acoustic models and P(W) reflects the prior probability of the word sequence as provided by a statistical language model. [0004] The acoustic models are typically probabilistic state sequence models such as hidden Markov models (HMMs) that model speech sounds using mixtures of probability distribution functions (Gaussians). Acoustic models often represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of a statistical language model.
[0005] The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate - the recognition result - or multiple recognition hypotheses in various forms such as an N-best list, a recognition lattice, or a confusion network. Further details regarding continuous speech recognition are provided in U.S. Patent No. 5,794,189, entitled "Continuous Speech Recognition," and U.S. Patent No. 6,167,377, entitled "Speech Recognition Language Models," the contents of which are incorporated herein by reference.
[0006] Recently, ASR technology has advanced enough to have applications that are implemented on the limited footprint of a mobile device. This can involve a somewhat limited stand-alone ASR arrangement on the mobile device, or more extensive capability can be provided in a client-server arrangement where the local mobile device does initial processing of speech inputs, and possibly some local ASR recognition processing, but the main ASR processing is performed at a remote server with greater resources, then the recognition results are returned for use at the mobile device.
[0007] U.S. Patent Publication 20110054899 describes a hybrid client-server ASR arrangement for a mobile device in which speech recognition may be performed locally by the device and/or remotely by a remote ASR server depending on one or more criteria such as time, policy, confidence score, network availability, and the like. An example screen shot of the initial prompt interface from one such mobile device ASR application, Dragon Dictation for iPhone, is shown in Fig. 1A which processes unprompted speech inputs and produces representative text output. Fig. IB shows a screen shot of the recording interface for Dragon Dictation for iPhone. Fig. 1C shows an example screen shot of the results interface produced for the ASR results by Dragon Dictation for iPhone.
[0008] One of the challenges of server-side ASR is the requirement of low latency response, even at first use when the ASR engine has no prior knowledge of the speaker, the audio channel and the noise environment. The ASR engine needs to derive noise, channel and speaker characteristics from the incoming signal only once the user starts speaking to the mobile device and the audio is transmitted to the server. And due to latency constraints this can only be implemented in an incremental online approach in order not to introduce unacceptable latency, which strongly limits the options.
[0009] This is a particular challenge both at first use and also at later stages of use. Once audio has been collected from the specific device and speaker, only limited information can be drawn from past utterances since noise and channel conditions are subject to change over time and previous usage might have been in a different acoustic environment. In addition, the speaker using the mobile device can not necessarily be assumed to be the same as in a previous utterance.
SUMMARY
[0010] Embodiments of the present invention are directed to a mobile device which is adapted for automatic speech recognition (ASR). A user interface for interaction with a user includes an input microphone for obtaining speech inputs from the user for automatic speech recognition, and an output interface for system output to the user based on ASR results that correspond to the speech input. A local controller obtains a sample of non-ASR audio from the input microphone (i.e., audio fetched at the mobile device outside the ASR interaction process) for ASR-adaptation to channel-specific ASR characteristics, and then provides a representation of the non-ASR audio to a remote ASR server for server-side adaptation to the channel-specific ASR characteristics, and then provides a representation of an unknown speech input from the input microphone to the remote ASR server for determining ASR results corresponding to the unknown speech input, and then provides the system output to the output interface.
[0011] The non-ASR audio may include audio sampled by the input microphone during a rolling sample window before the unknown speech input. In addition or alternatively, the non-ASR audio may include non-ASR speech audio sampled by the input microphone before the unknown speech input, and the non-ASR speech audio may be limited to speech data sampled from short data windows.
[0012] The representation of the non-ASR audio may include pre-processed ASR adaptation data produced by the mobile device from the non-ASR audio, for example, at least one of background noise model data and ASR acoustic model adaptation data. The representation of the non-ASR audio may be limited to speech feature data.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] Figure 1 A-C shows various example screen shots from a hybrid ASR application for a mobile device.
[0014] Figure 2 shows various elements in a hybrid ASR arrangement according to an embodiment of the present invention.
[0015] Figure 3 shows various functional steps in a hybrid ASR arrangement according to an embodiment of the present invention.
DETAILED DESCRIPTION
[0016] Embodiments of the present invention are directed to a mobile device using a client-server ASR arrangement. There is adaptation to channel-specific characteristics by capturing, transmitting and exploiting non-ASR audio (i.e., audio fetched at the mobile device outside the ASR interaction process). This enables the server ASR engine to establish speaker, channel and environment information even before first ASR use, and it allows keeping this information up-to-date. In addition, there various options for handling potential privacy concerns. [0017] It is common practice in server-side ASR to exploit speech data received from a given specific device for recognition performance improvement for future utterances received from the same device. The underlying assumption is that speech data coming from a given specific device is correlated in terms of channel, acoustic environment and speaker. It also is common practice to exploit text data sources available on the mobile device for ASR accuracy improvement. For example, the ASR arrangement only has a reasonable chance of getting correct orthography variants such as Jon or John only when making use of personal contact names as stored on the device. Embodiments of the present invention exploit other channel-specific audio sources by transmitting non-ASR audio data captured by the mobile device outside the pure ASR event, and using this audio data for adaptation of the ASR engine to the specific device, environment and speaker.
[0018] Figure 2 shows various elements in an ASR arrangement according to an embodiment of the present invention. A user interface 201 on mobile device 200 for interaction with a user includes an input microphone for obtaining speech inputs from the user for automatic speech recognition. A local controller 202 passes representations of unknown speech inputs over a communications network 204 (such as the Internet) to a remote ASR server 205 that uses server data sources 206 to determine ASR results corresponding to the unknown ASR speech input. The ASR results are then sent back over the communications network 204 to the local controller 202 at the mobile device 200 which sends a system output based on the ASR results to the user interface 201 for output to the user. Specifically, the system output at the user interface 201 may be recognition text or audio that corresponds to the speech input and/or a next dialog prompt in an automated dialog process based on the ASR results.
[0019] Figure 3 shows various functional steps in an ASR arrangement according to an embodiment of the present invention. The local controller 202 obtains a sample of non- ASR audio, step 301, either directly from the input microphone of the user interface 201 or from non-ASR audio memory 203 on the mobile device 200. The local controller 202 provides a representation of the non-ASR audio to the remote ASR server 205, step 302, for server-side adaptation to the channel-specific ASR characteristics. This allows the system to then operate in conventional ASR, providing representations of unknown ASR speech inputs to the remote ASR server 205, step 303, and then providing the ASR results received from the remote ASR server 205 to the user interface 201 for display to the user, step 304.
[0020] In further regard to the non-ASR audio, two general approaches come to mind. In a first approach, the mobile device can capture a few seconds of the audio that is representative for the surrounding noise environment as sensed by the input microphone shortly before the user is expected to use the server-side ASR; for example, by recording a rolling window of audio before the user activates the ASR input. Such pre-ASR audio can be captured when starting up the speech application, which can be assumed representative of the speaker and channel in anticipation of foreground ASR audio to come. A pre-ASR sample of the audio environment can be used to derive a background noise model for de- noising, estimates for device gain settings and signal-to-noise ratio considerations, ASR acoustic model adaptation on the server-side, and decoder parameterization selection. To save on transmission cost, the pre-ASR audio sample need not be transmitted to the server in its entirety. While it could be sent to the ASR server as-is (encoded, with indication of when the button was actually pressed in order to bias any end-pointer and ASR acoustic model toward silence in that section), or it can be pre-processed by the mobile device to derive sufficient audio statistics (e.g., channel mean, channel variance, etc.) so as to reduce processing load and latency at the server-side.
[0021] In a different approach, the non-ASR audio can be obtained from sampling audio transmissions by the mobile device in non-ASR speech use cases, such as most prominently when using the mobile device as a telephone for human to human
conversation. Audio captured along these lines also can be assumed representative for speaker and channel and exploitation in channel and speaker adaptation is straightforward. This could trigger ASR speaker adaptation even before a specific user has addressed a first utterance to the server-based ASR service. This would be especially useful for handling the long tail of goat speakers (e.g., strongly accented speech) for whom speaker adapted ASR is often necessary to be useable in any meaningful way. [0022] Capturing audio data for ASR improvement outside the ASR cycle and transmitting it to the ASR server could potentially raise some privacy concerns. This should be less a lesser concern for the first sampling approach above which captures a short audio sample (presumably non-speech). But the concern is more legitimate for the second approach above which captures true speech audio to prepare the ASR server for the specific speaker, device and channel as described above. Such concerns can be alleviated by one or more of:
• Sending only adaptation statistics or adapted models or estimated speaker
characteristics out from the mobile device rather than audio waveforms as such.
• Sending speech features only out from the mobile device (e.g., mean-normalized MFCCs, bottle-neck features, etc.)
• Sample shredding— speech data sampled from short data windows such as one second out of every ten.
• Very short data retention on the ASR server— immediately establishing adaptation statistics on the ASR server immediately and then delete without delay.
[0023] Embodiments of the invention maybe implemented in part in any conventional computer programming language such as VHDL, SystemC, Verilog, ASM, etc.
Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components.
[0024] Embodiments can be implemented in part as a computer program product for use with a computer system. Such implementation may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium ( e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium. The medium may be either a tangible medium ( e.g., optical or analog communications lines) or a medium implemented with wireless techniques ( e.g., microwave, infrared or other transmission techniques). The series of computer instructions embodies all or part of the functionality previously described herein with respect to the system. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation ( e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the network ( e.g., the Internet or World Wide Web). Of course, some embodiments of the invention may be implemented as a combination of both software (e.g. , a computer program product) and hardware. Still other embodiments of the invention are implemented as entirely hardware, or entirely software ( e.g. , a computer program product).
[0025] Although various exemplary embodiments of the invention have been disclosed, it should be apparent to those skilled in the art that various changes and modifications can be made which will achieve some of the advantages of the invention without departing from the true scope of the invention.

Claims

CLAIMS We claim:
1. A mobile device adapted for automatic speech recognition (ASR) and employing at least one hardware implemented computer processor, the device comprising:
a user interface for interaction with a user and including:
a. an input microphone for obtaining speech inputs from the user for automatic speech recognition, and
b. an output interface for providing a system output to the user based on ASR results that correspond to the speech input; and
a local controller for:
a. obtaining a sample of non-ASR audio from the input microphone for ASR-adaptation to channel- specific ASR characteristics, and then b. providing a representation of the non-ASR audio to a remote ASR server for server-side adaptation to the channel-specific ASR
characteristics, and then
c. providing a representation of an unknown ASR speech input from the input microphone to the remote ASR server for determining ASR results corresponding to the unknown ASR speech input, and then d. providing the system output to the output interface.
2. The mobile device according to claim 1 , wherein the non-ASR audio includes audio sampled by the input microphone during a rolling sample window before the unknown ASR speech input.
3. The mobile device according to claim 1 , wherein the non-ASR audio includes non- ASR speech audio sampled by the input microphone before the unknown ASR speech input.
4. The mobile device according to claim 3, wherein the non-ASR speech audio is limited to speech data sampled from short data windows.
5. The mobile device according to claim 1, wherein the representation of the non-ASR audio includes pre-processed ASR adaptation data produced by the mobile device from the non-ASR audio.
6. The mobile device according to claim 5, wherein the ASR adaptation data includes at least one of background noise model data and ASR acoustic model adaptation data.
7. The mobile device according to claim 1, wherein the representation of the non-ASR audio is limited to speech feature data.
8. A computer-implemented method employing at least one hardware implemented computer processor for automatic speech recognition (ASR) on a mobile device, the method comprising:
obtaining a sample of non-ASR audio from an input microphone on the mobile device for ASR-adaptation to channel-specific ASR characteristics, and then
providing a representation of the non-ASR audio to a remote ASR server for
server-side adaptation to the channel-specific ASR characteristics, and then providing a representation of an unknown ASR speech input from the input
microphone to the remote ASR server for determining ASR results corresponding to the unknown ASR speech input, and then providing a system output to the user at the mobile device based on the ASR
results.
9. The method according to claim 8, wherein the non-ASR audio includes audio sampled by the input microphone during a rolling sample window before the unknown ASR speech input.
10. The method according to claim 8, wherein the non-ASR audio includes non-ASR speech audio sampled by the input microphone before the unknown ASR speech input.
11. The method according to claim 10, wherein the non-ASR speech audio is limited to speech data sampled from short data windows.
12. The method according to claim 8, wherein the representation of the non-ASR audio includes pre-processed ASR adaptation data produced by the mobile device from the non- ASR audio.
13. The method according to claim 12, wherein the ASR adaptation data includes at least one of background noise model data and ASR acoustic model adaptation data.
14. The method according to claim 8, wherein the representation of the non-ASR audio is limited to speech feature data.
15. A computer program product encoded in a non-transitory computer-readable medium for automatic speech recognition (ASR) on a mobile device, the product comprising: program code for obtaining a sample of non-ASR audio from an input microphone on the mobile device for ASR-adaptation to channel-specific ASR characteristics, and then
program code for providing a representation of the non-ASR audio to a remote
ASR server for server-side adaptation to the channel-specific ASR characteristics, and then
program code for providing a representation of an unknown ASR speech input from the input microphone to the remote ASR server for determining ASR results corresponding to the unknown ASR speech input, and then program code for providing a system output to the user at the mobile device based on the ASR results.
16. The product according to claim 15, wherein the non-ASR audio includes audio sampled by the input microphone during a rolling sample window before the unknown ASR speech input.
17. The product according to claim 15, wherein the non-ASR audio includes non-ASR speech audio sampled by the input microphone before the unknown ASR speech input.
18. The product according to claim 17, wherein the non-ASR speech audio is limited to speech data sampled from short data windows.
19. The product according to claim 15, wherein the representation of the non-ASR audio includes pre-processed ASR adaptation data produced by the mobile device from the non- ASR audio.
20. The product according to claim 19, wherein the ASR adaptation data includes at least one of background noise model data and ASR acoustic model adaptation data.
21. The product according to claim 15, wherein the representation of the non-ASR audio is limited to speech feature data.
PCT/US2013/028288 2013-02-28 2013-02-28 Server-side asr adaptation to speaker, device and noise condition via non-asr audio transmission WO2014133525A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/US2013/028288 WO2014133525A1 (en) 2013-02-28 2013-02-28 Server-side asr adaptation to speaker, device and noise condition via non-asr audio transmission
US14/770,371 US9679560B2 (en) 2013-02-28 2013-02-28 Server-side ASR adaptation to speaker, device and noise condition via non-ASR audio transmission
US15/619,877 US10229701B2 (en) 2013-02-28 2017-06-12 Server-side ASR adaptation to speaker, device and noise condition via non-ASR audio transmission

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2013/028288 WO2014133525A1 (en) 2013-02-28 2013-02-28 Server-side asr adaptation to speaker, device and noise condition via non-asr audio transmission

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US14/770,371 A-371-Of-International US9679560B2 (en) 2013-02-28 2013-02-28 Server-side ASR adaptation to speaker, device and noise condition via non-ASR audio transmission
US15/619,877 Continuation US10229701B2 (en) 2013-02-28 2017-06-12 Server-side ASR adaptation to speaker, device and noise condition via non-ASR audio transmission

Publications (2)

Publication Number Publication Date
WO2014133525A1 true WO2014133525A1 (en) 2014-09-04
WO2014133525A8 WO2014133525A8 (en) 2015-09-17

Family

ID=51428642

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2013/028288 WO2014133525A1 (en) 2013-02-28 2013-02-28 Server-side asr adaptation to speaker, device and noise condition via non-asr audio transmission

Country Status (2)

Country Link
US (1) US9679560B2 (en)
WO (1) WO2014133525A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017003579A1 (en) * 2015-06-29 2017-01-05 Google Inc. Privacy-preserving training corpus selection
EP3309780A1 (en) * 2016-09-27 2018-04-18 Vocollect, Inc. Utilization of location and environment to improve recognition

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9536527B1 (en) * 2015-06-30 2017-01-03 Amazon Technologies, Inc. Reporting operational metrics in speech-based systems
US9761227B1 (en) * 2016-05-26 2017-09-12 Nuance Communications, Inc. Method and system for hybrid decoding for enhanced end-user privacy and low latency
US10885919B2 (en) * 2018-01-05 2021-01-05 Nuance Communications, Inc. Routing system and method
US10777203B1 (en) * 2018-03-23 2020-09-15 Amazon Technologies, Inc. Speech interface device with caching component
US11289086B2 (en) * 2019-11-01 2022-03-29 Microsoft Technology Licensing, Llc Selective response rendering for virtual assistants
US11676586B2 (en) * 2019-12-10 2023-06-13 Rovi Guides, Inc. Systems and methods for providing voice command recommendations
CN114374912B (en) * 2021-12-10 2023-01-06 北京百度网讯科技有限公司 Voice input method, device, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7024359B2 (en) * 2001-01-31 2006-04-04 Qualcomm Incorporated Distributed voice recognition system using acoustic feature vector modification
US20090323925A1 (en) * 2008-06-26 2009-12-31 Embarq Holdings Company, Llc System and Method for Telephone Based Noise Cancellation
US20110087491A1 (en) * 2009-10-14 2011-04-14 Andreas Wittenstein Method and system for efficient management of speech transcribers
US20110257974A1 (en) * 2010-04-14 2011-10-20 Google Inc. Geotagged environmental audio for enhanced speech recognition accuracy

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5794189A (en) 1995-11-13 1998-08-11 Dragon Systems, Inc. Continuous speech recognition
US6167377A (en) 1997-03-28 2000-12-26 Dragon Systems, Inc. Speech recognition language models
US20110054899A1 (en) 2007-03-07 2011-03-03 Phillips Michael S Command and control utilizing content information in a mobile voice-to-speech application

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7024359B2 (en) * 2001-01-31 2006-04-04 Qualcomm Incorporated Distributed voice recognition system using acoustic feature vector modification
US20090323925A1 (en) * 2008-06-26 2009-12-31 Embarq Holdings Company, Llc System and Method for Telephone Based Noise Cancellation
US20110087491A1 (en) * 2009-10-14 2011-04-14 Andreas Wittenstein Method and system for efficient management of speech transcribers
US20110257974A1 (en) * 2010-04-14 2011-10-20 Google Inc. Geotagged environmental audio for enhanced speech recognition accuracy

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9990925B2 (en) 2015-06-29 2018-06-05 Google Llc Privacy-preserving training corpus selection
KR20190071010A (en) * 2015-06-29 2019-06-21 구글 엘엘씨 Privacy-preserving training corpus selection
CN107209842A (en) * 2015-06-29 2017-09-26 谷歌公司 Secret protection training corpus is selected
GB2551917A (en) * 2015-06-29 2018-01-03 Google Inc Privacy-preserving training corpus selection
US9881613B2 (en) 2015-06-29 2018-01-30 Google Llc Privacy-preserving training corpus selection
CN111695146B (en) * 2015-06-29 2023-12-15 谷歌有限责任公司 Privacy preserving training corpus selection
KR20170094415A (en) * 2015-06-29 2017-08-17 구글 인코포레이티드 Privacy-preserving training corpus selection
DE112016000292B4 (en) 2015-06-29 2021-10-07 Google LLC (n.d.Ges.d. Staates Delaware) Method and device for privacy-preserving training corpus selection
WO2017003579A1 (en) * 2015-06-29 2017-01-05 Google Inc. Privacy-preserving training corpus selection
KR101991473B1 (en) * 2015-06-29 2019-09-30 구글 엘엘씨 Privacy-preserving training corpus selection
KR102109876B1 (en) * 2015-06-29 2020-05-28 구글 엘엘씨 Privacy-preserving training corpus selection
CN111695146A (en) * 2015-06-29 2020-09-22 谷歌有限责任公司 Privacy preserving training corpus selection
GB2551917B (en) * 2015-06-29 2021-10-06 Google Llc Privacy-preserving training corpus selection
US10181321B2 (en) 2016-09-27 2019-01-15 Vocollect, Inc. Utilization of location and environment to improve recognition
EP3309780A1 (en) * 2016-09-27 2018-04-18 Vocollect, Inc. Utilization of location and environment to improve recognition

Also Published As

Publication number Publication date
US9679560B2 (en) 2017-06-13
US20160012819A1 (en) 2016-01-14
WO2014133525A8 (en) 2015-09-17

Similar Documents

Publication Publication Date Title
US9679560B2 (en) Server-side ASR adaptation to speaker, device and noise condition via non-ASR audio transmission
US10229701B2 (en) Server-side ASR adaptation to speaker, device and noise condition via non-ASR audio transmission
US10755702B2 (en) Multiple parallel dialogs in smart phone applications
US10482904B1 (en) Context driven device arbitration
US20150279352A1 (en) Hybrid controller for asr
US7941313B2 (en) System and method for transmitting speech activity information ahead of speech features in a distributed voice recognition system
US9940926B2 (en) Rapid speech recognition adaptation using acoustic input
US20170256264A1 (en) System and Method for Performing Dual Mode Speech Recognition
CN108346425B (en) Voice activity detection method and device and voice recognition method and device
CN110047481B (en) Method and apparatus for speech recognition
US20150279351A1 (en) Keyword detection based on acoustic alignment
EP3092639B1 (en) A methodology for enhanced voice search experience
US20060122837A1 (en) Voice interface system and speech recognition method
US8645131B2 (en) Detecting segments of speech from an audio stream
WO2013169232A1 (en) Differential acoustic model representation and linear transform-based adaptation for efficient user profile update techniques in automatic speech recognition
KR20160005050A (en) Adaptive audio frame processing for keyword detection
US20090024390A1 (en) Multi-Class Constrained Maximum Likelihood Linear Regression
US11763801B2 (en) Method and system for outputting target audio, readable storage medium, and electronic device
WO2009104332A1 (en) Speech segmentation system, speech segmentation method, and speech segmentation program
US20180322863A1 (en) Cepstral variance normalization for audio feature extraction
US9159315B1 (en) Environmentally aware speech recognition
US20230386458A1 (en) Pre-wakeword speech processing
CN115881094A (en) Voice command recognition method, device, equipment and storage medium for intelligent elevator

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13876355

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 14770371

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13876355

Country of ref document: EP

Kind code of ref document: A1