US20150279351A1 - Keyword detection based on acoustic alignment - Google Patents

Keyword detection based on acoustic alignment Download PDF

Info

Publication number
US20150279351A1
US20150279351A1 US13/861,020 US201313861020A US2015279351A1 US 20150279351 A1 US20150279351 A1 US 20150279351A1 US 201313861020 A US201313861020 A US 201313861020A US 2015279351 A1 US2015279351 A1 US 2015279351A1
Authority
US
United States
Prior art keywords
acoustic
vectors
audio frame
keyword
alignment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/861,020
Inventor
Patrick An Phu Nguyen
Maria Carolina Parada San Martin
Johan Schalkwyk
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US13/861,020 priority Critical patent/US20150279351A1/en
Assigned to GOOGLE INC. reassignment GOOGLE INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NGUYEN, PATRICK AN PHU, SAN MARTIN, Maria Carolina Parada, SCHALKWYK, JOHAN
Publication of US20150279351A1 publication Critical patent/US20150279351A1/en
Assigned to GOOGLE LLC reassignment GOOGLE LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: GOOGLE INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Definitions

  • This specification describes technologies related to voice recognition.
  • Automatic speech recognition is an important technology that is used in mobile devices.
  • One task that is a common goal for this technology is to be able to use voice commands to wake up and have basic spoken interactions with the device. For example, it may be desirable to recognize a “hotword” that signals that the mobile device should activate when the mobile device is in a sleep state.
  • the methods and systems described herein provide keyword recognition that is fast and low latency, power efficient, flexible, and optionally speaker adaptive. A designer or user can choose the keywords.
  • Embodiments include various systems directed towards robust and efficient keyword detection.
  • one innovative aspect of the subject matter described in this specification can be embodied in a process that is performed by a data processing apparatus.
  • the process includes receiving a plurality of audio frame vectors that each model an audio waveform during a different period of time, selecting a non-empty subset of the audio frame vectors, obtaining a corresponding non-empty subset of detected acoustic event vectors that results from coding the subset of the audio frame vectors, aligning the detected acoustic event vectors and a set of expected event vectors that correspond to a keyword to generate an output feature vector that characterizes an acoustic match between the detected acoustic event vectors and the expected event vectors, and inputting the output feature vector into a keyword classifier.
  • the process may include determining, using the keyword classifier, that a keyword was present in the audio waveform during an overall period of time modeled by the audio frame vectors.
  • Embodiments may include embodiments in which the audio frame vectors are coded using a neural network and in which the audio frame vectors are coded using a Gaussian mixture model.
  • the system extracts features to characterize the acoustic match, the features comprising one or more of: length of alignment, number of phones aligned, frame distance across phone boundaries, probability of the duration of each phone with respect to average duration of a phone in training data, speaker speaking rate, average acoustic score, worst acoustic score, best acoustic score, standard deviation of acoustic scores, start frame of the alignment, stability of the alignment, binary features representing changes related to the difference between detected acoustic events and expected acoustic events, and binary features representing changes related to the difference between detected acoustic events and acoustic events in an alignment window.
  • the process may also include producing a plurality of audio frame vectors by performing front-end feature extraction on an acoustic signal.
  • Embodiments provide a way to recognize whether or not a keyword was uttered in a way that provides a simple design that can obtain good results while minimizing the need for processing and power resources.
  • FIG. 1 is a block diagram 100 that illustrates dataflow in an embodiment.
  • FIG. 2 is a block diagram 200 that illustrates dataflow in a front-end feature extraction process.
  • FIG. 3 is a block diagram 300 that illustrates dataflow in an acoustic modeling process.
  • FIG. 4 is a block diagram 400 that illustrates dataflow in a high-level feature extraction process.
  • FIG. 5 is a block diagram 500 that illustrates dataflow in an output classification process.
  • FIG. 6 is a flowchart 600 of the stages involved in an example process for detecting keyword utterances in an audio waveform.
  • FIG. 7 is a block diagram 700 of an example system that can detect keyword utterances in an audio waveform.
  • a mobile device When using a mobile device, it is desirable to provide a way of turning the device on or performing another action based on the utterance of a keyword. For example, if a user says “Google,” it may cause a smartphone to activate. However, it requires power to constantly monitor and process the audio received by the mobile device, and hence it is important to provide an approach for recognizing whether or not the keyword has been uttered while minimizing the power consumption needed to “listen” for the keyword.
  • Embodiments may listen for keywords while minimizing resource usage through a variety of approaches. For example, a variety of acoustic modeling techniques may be used to obtain feature vectors that represent audio received at the mobile device.
  • a variety of acoustic modeling techniques may be used to obtain feature vectors that represent audio received at the mobile device.
  • another aspect of embodiments is that certain embodiments may use a high-level feature extraction module based on acoustic match and alignment.
  • the input features obtained from a front-end feature extraction module are converted into detected acoustic events in real-time.
  • Embodiments operate by finding an alignment of the detected acoustic events with expected acoustic events that would signify the presence of the keyword.
  • the expected acoustic events represent a standard dictionary pronunciation for the keyword of interest.
  • embodiments are able to extract features to characterize the acoustic match, which will be described in greater detail, below.
  • some implementations only extract features when an initial alignment is found, thereby
  • implementations discussed elsewhere in this specification discuss an implementation that detects a single keyword, implementations are not necessarily limited to detecting one keyword. In fact, some implementations may be used to detect a plurality of keywords. The keywords in these implementations may also be short phrases. Such implementations allow a user to select one of a certain number of actions, such as actions presented in a menu, by saying one of the menu entries. For example, implementations may use different keywords to trigger different actions such as taking a photo, sending an email, recording a note, and so on. Given a finite number of words and/or phrases to be detected, which will ordinarily not exceed 20 or so, this technology may be used. However, other implementations may be adapted to handle more words and/or phrases if required.
  • Module 1 is a front-end feature extraction module, which performs: a) speech activity detection; b) windowing of the acoustic signal; c) short-term Fourier transform; d) spectral subtraction, optionally; e) filter bank extraction; and f) log-energy transform of the filtered output.
  • Module 2 is an acoustic model, which can be one of: a) a neural network, possibly truncated of its last layers; or b) a Gaussian mixture model (GMM). If a neural network is used, the neural network may be truncated of its last layers.
  • GMM Gaussian mixture model
  • the input features may be converted into acoustic events by forward-propagation through the neural network (NN). If a Gaussian mixture model is used, it may provide a probabilistic model for representing the presence of subpopulations within an overall population in order to code the acoustic events.
  • Module 3 is a high level feature extraction module based on acoustic match/alignment. As discussed above, Module 3 finds an alignment of detected acoustic events with expected acoustic events. Module 3 also extracts certain information once an alignment has been found to characterize the match.
  • Module 4 is an output classifier, which takes as an input the output feature vector from module 3 and possibly some side information to yield a binary decision about the presence of the keyword. The output classifier can be for example: a) a support vector machine or b) a logistic regression.
  • FIG. 1 is a block diagram 100 that illustrates dataflow in an embodiment.
  • the data flow begins with an audio waveform 102 .
  • Audio waveform 102 represents audio received by an embodiment.
  • audio waveform 102 may be an analog or digital representation of sound in the environment of an embodiment that is captured by a microphone.
  • front-end feature extraction module 104 performs a series of stages, detailed in FIG. 2 , that take audio waveform 102 and transform it into a series of vectors for further processing.
  • front-end feature extraction module 104 has done the processing of audio waveform 102 , its output is sent to acoustic modeling module 106 .
  • Acoustic modeling module 106 may use a variety of techniques, detailed in FIG. 3 , to perform coding on the inputs to produce acoustic event vectors that are representative of features of audio waveform 102 over a period of time.
  • the acoustic event vectors from acoustic modeling module are sent to a high-level feature extraction module 108 that finds an alignment for the acoustic event vectors, as detailed in FIG. 4 , to further analyze characteristics of audio waveform 102 over a time interval to provide information that can be used to produce an output feature vector to detect if the keyword was uttered.
  • the output feature vector is sent to output classifier module 110 to make a determination about whether the keyword is present, as is discussed in FIG. 5 .
  • Embodiments approach the problem of keyword detection in advantageous ways. For example, one embodiment has the advantage that it only extracts features when a first level alignment is found, reducing high level feature computation.
  • the approaches used in these systems are advantageous because they only involve adaptation of a few parameters to adapt to change the keywords matched or to adapt to a given speaker's voice.
  • FIG. 2 is a block diagram 200 that illustrates dataflow in a front-end feature extraction process.
  • Audio waveform 102 includes analog and/or digital information about incoming sound that an embodiment can analyze to detect the presence of a keyword.
  • One way to capture audio waveform 102 for analysis is to divide it up into a plurality of analysis windows.
  • FIG. 2 shows an analysis window 204 that uses a vector to represent audio waveform 102 over a time period that is chosen as the size of analysis window 204 , for example a 25 ms time period.
  • Multiple analysis windows are obtained in succession by performing an analysis window shift 206 , for example a 10 ms time period. Analysis windows may be chosen to overlap.
  • one analysis window may represent audio waveform 102 from a start time of 0 ms to an end time of 25 ms, and a subsequent analysis window may represent audio waveform 102 from a start time of 10 ms to an end time of 35 ms.
  • the analysis windows 204 are obtained as part of speech activity detection 210 , in which an embodiment obtains information about available sound in its environment.
  • Speech activity detection 210 may be designed to occur regardless of whether there is sound in the surroundings of an embodiment, or it may, for example, occur only when a volume of sound greater than a threshold volume is received.
  • windowing of the acoustic signal 220 .
  • each window should be a fairly short time interval, such as 25 ms, that represents characteristics of audio waveform 102 over that time interval.
  • embodiments may perform a fast Fourier transform 230 on the windowed data so as to analyze the constituent frequencies present in the audio waveform.
  • embodiments may optionally perform spectral substitution 240 to minimize the effects of noise on the information provided by the other steps.
  • filter bank extraction 250 can allow the decomposition of the information from the previous steps by using filters to separate individual components of the audio data from one another.
  • performance of a log-energy transform 260 can help normalize the data in order to make it more meaningful.
  • the result of the processing performed in FIG. 2 is a moving window of a stack of frames 270 .
  • stack of frames 270 may include 11 frames, each including information about 25 ms of audio waveform 102 , with a shift of 10 ms between frames.
  • stack of frames 270 may include as few as 2 frames or any larger number of frames.
  • the end output of front-end feature extraction 200 is thus a stack of a plurality of frames 280 that represents features of audio waveform 102 by performing the aforementioned analytical techniques to obtain information about characteristics of the audio waveform 102 for successive time intervals.
  • FIG. 3 is a block diagram 300 that illustrates dataflow in an acoustic modeling process.
  • FIG. 3 begins with stack of the plurality of frames 280 produced by the process depicted in FIG. 2 .
  • FIG. 3 includes coding 310 of the plurality of frames to produce coded acoustic events for each stack 320 .
  • 3 ways in which this process may occur are included in FIG. 3 , including a neural network 310 A, or a Gaussian mixture model 310 B. If a neural network 310 A is used, the neural network may possibly be truncated of its last layers.
  • the goal of this coding is to produce coded acoustic events 320 that are single vectors that represent pluralities of initial vectors in the stack of plurality of frame 280 as single vectors with salient information about features of the audio waveform 102 over the interval that they model.
  • the input features may be converted into acoustic events by coding through the neural network 310 A or a Gaussian mixture model 310 B.
  • FIG. 4 is a block diagram 400 that illustrates dataflow in a high-level feature extraction process.
  • FIG. 4 begins with coded acoustic events for each stack 320 , as produced in FIG. 3 .
  • Coded acoustic events for each stack 320 are aligned with expected event vectors 410 .
  • Expected event vectors 410 include phonemes 420 A- 420 D, each of which is associated with a standardized pronunciation for that phoneme.
  • the high-level feature extraction operates by aligning 420 keyword 430 with coded acoustic events for each stack 320 based on detecting a preliminary alignment match with aligned phonemes 422 .
  • the alignment produces output feature vector 440 .
  • Output vector 440 includes information about the audio waveform 102 that has been distilled and processed so it is easy to draw conclusions about the presence of the keyword in audio waveform 102 over time window 410 that aligning 420 represents.
  • Aligning 420 may be accomplished by decoding with a graph, which automatically force aligns the audio to a keyword, such as “computer” or “google.” Back-epsilon arcs may allow such a graph to restart at any point, avoiding misses when the keyword is spoken while in the middle of the decoding graph. For example, implementations may generate a confusion network of pronunciations for the keyword by running a phone loop decoder on positive examples for the keyword and extract the most frequent pronunciations.
  • One way to obtain an alignment may include extracting features in a fixed window, after reaching a stable partial result, force align a phonetic sequence in that window and extract features from that alignment.
  • An alternative is to use an HMM hotword/garbage model, which may use a high bias and may only extract features if a hotword path is successfully decoded.
  • Yet another way is for positive examples, to force align or manually align phonetic sequences, and for negative examples, to find the alignment, whose score may satisfy a condition given current model parameters.
  • high-level feature extraction 400 extracts features to characterize the quality of the acoustic match. All of these features assume that there is an alignment for both positive and negative examples with respect to the true phonetic sequences p_k for keyword k.
  • the extracted information may include length of alignment, number of phones aligned, frame distance across phone boundaries, probability of the duration of each phone with respect to average duration of a phone in training data, speaker speaking rate, average acoustic score, worst acoustic score, best acoustic score, standard deviation of acoustic scores, start frame of the alignment, stability of the alignment, binary features representing changes related to the difference between detected acoustic events and expected acoustic events, and/or binary features representing changes related to the difference between detected acoustic events and acoustic events in an alignment window.
  • Binary features representing changes related to the difference between detected acoustic events and expected acoustic events may include identity/insertions/deletions of detected acoustic events from a GMM coding process.
  • Binary features representing changes related to the difference between detected acoustic events and acoustic events in an alignment window may include identity/insertions/deletions of detected acoustic events from a neural network coding process.
  • Frame distance may be found, given an identified segmentation or phoneme alignment, by computing the Euclidean distance d between frames at sequential distances from each phoneme boundary. The assumption is that if the hotword was uttered, then the phoneme alignment will be correct, and hence the distance between neighboring frames across phoneme boundaries will be large. If the hotword was not uttered, the phoneme alignment will be incorrect, and hence distance between neighboring frames at phoneme boundaries will be small. Frame distance may be found using Equation 1:
  • phoneme duration score Another feature is phoneme duration score, which computes the probability of the current duration using a Gaussian distribution with a mean and standard deviation equal to that of the average phoneme duration for phonemes encountered in training. Phoneme duration score may be found using Equation 2:
  • Sample Code 1 includes information that might be provided in a data structure that includes information about features of an alignment.
  • repeated bool ph_expectation_align 19; // Same as ph_expectation_align, but phones are detected from nn stream.
  • repeated bool ph_expectation_nn 20; // Phone expectation delete features: u is expected but *not* observed. // One feature per phone in dictionary pronunciation of the hotword.
  • repeated bool ph_expectation_delete_align 21; // Phone expectation insert features: u is *not* expected but observed. // One feature per phone in phone-list.
  • FIG. 5 is a block diagram 500 that illustrates dataflow in an output classification process.
  • FIG. 5 begins with output vector 440 that has been produced by FIG. 4 .
  • FIG. 5 takes a step to classify output 510 using classification module 520 .
  • classification module 520 may use support vector machine 520 A or logistic regression 520 B.
  • the goal of classification module 520 is to make a binary decision about whether the keyword was uttered during time window 410 associated with output vector 440 .
  • Classification module 520 produces classification result 530 . This may be an actual classification decision 550 , in terms of a Boolean decision confirming that the keyword was present or not.
  • classification result may also be a score, for example one that represents the likelihood that the keyword is present. If classification result 530 is a score, there may be a step to process the result 540 to yield classification decision 550 , for example comparing the score to a threshold value.
  • FIG. 6 is a flowchart 600 of the stages involved in an example process for detecting keyword utterances in an audio waveform.
  • stage 610 audio frame vectors are received.
  • stage 610 may be performed as in FIG. 2 , such that front-end feature extraction module 104 processes audio waveform 102 to yield the vectors, which are represented in FIG. 2 as stack of the plurality of frames 280 .
  • stage 620 subsets of vectors are selected.
  • stage 620 may be performed as in FIG. 2 , such that the processing of audio waveform 102 yields a stack of the plurality of frames 280 that constitutes the subset of vectors.
  • stage 630 event vectors are obtained by coding. For example, this step is performed by acoustic modeling module 106 as in FIG. 3 .
  • stage 640 the vectors are aligned. For example, this step may occur as aligning 420 as in FIG. 4 by high-level feature extraction module 108 .
  • the output vector is input to the classifier.
  • high-level feature extraction module 108 sends its output, output vector 440 to output classifier module 110 to make this determination as in FIG. 5 .
  • FIG. 7 is a block diagram 700 of an example system that can detect keyword utterances in an audio waveform.
  • the system contains a variety of constituent parts and modules that may be implemented through appropriate combinations of hardware, firmware, and software that allow computing device 700 to function as an embodiment of appropriate features.
  • Computing device 700 contains one or more processors 712 that may include various hardware devices designed to process data.
  • Processors 712 are communicatively coupled to other parts of computing device 700 .
  • processors 712 may be coupled to a speaker 702 and a microphone 704 that allow output and input of audio signals to and from the surroundings of computing device 700 .
  • Microphone 704 is of special import to the functioning of computing device 700 in that microphone 704 provides the raw signals that capture aspects of audio waveform 102 that are processed in other portions of computing device 700 .
  • computing device 700 may include persistent memory 706 .
  • Persistent memory may include a variety of memory storage devices that allow permanent retention and storage of information manipulated by processors 712 .
  • input device 708 allows the receipt of commands from a user
  • interface 714 allows computing device 700 to interact with other devices to allow information exchange.
  • processors 712 may be communicatively coupled to a display 710 that provides a graphical representation of information processed by computing device 700 for the user to view.
  • processors 712 may be communicatively coupled to a series of modules that perform the functionalities necessary to implement the method of embodiments that is presented in FIG. 6 .
  • These modules include front-end feature extraction module 716 , which performs as illustrated in FIG. 2 , acoustic modeling module 718 , which performs as illustrated in FIG. 3 , high-level feature extraction module 720 , which performs as illustrated in FIG. 4 , and output classifier module 722 , which performs as illustrated in FIG. 5 .
  • the task of hotword or keyword detection is an important component in many speech recognition applications. For example, when the vocabulary size is limited, or when the task requires activating a device, for example, a phone, by saying a word, keyword detection is applied to classify whether an utterance contains a word or not.
  • the task performed by some embodiments includes detecting a single word, for example, “Google,” that will activate a device in standby to perform a task. This device, thus, should be listening all the time for such word.
  • a single word for example, “Google”
  • This device should be listening all the time for such word.
  • a common problem in portable devices is battery life, and limited computation capabilities. Because of this, it is important to design a keyword detection system that is both accurate and computationally efficient.
  • This application begins by presenting embodiments, which include approaches to recognizing when a mobile device should activate or take other actions in response to receiving a keyword as a voice input.
  • the application describes how these approaches operate and discuss the advantageous results provided by the approaches. These approaches provide the potential to obtain good results while using resources efficiently.
  • Embodiments of the invention and all of the functional operations described in this specification may be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments of the invention may be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer readable medium for execution by, or to control the operation of, data processing apparatus.
  • the computer readable medium may be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter affecting a machine-readable propagated signal, or a combination of one or more of them.
  • data processing apparatus encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
  • the apparatus may include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • a propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus.
  • a computer program (also known as a program, software, software application, script, or code) may be written in any form of programming language, including compiled or interpreted languages, and it may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a computer program does not necessarily correspond to a file in a file system.
  • a program may be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code).
  • a computer program may be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • the processes and logic flows described in this specification may be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
  • the processes and logic flows may also be performed by, and apparatus may also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
  • a processor will receive instructions and data from a read only memory or a random access memory or both.
  • the essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer may be embedded in another device, e.g., a tablet computer, a mobile telephone, a personal digital assistant (PDA), a mobile audio player, a Global Positioning System (GPS) receiver, to name just a few.
  • Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
  • the processor and the memory may be supplemented by, or incorporated in, special purpose logic circuitry.
  • embodiments of the invention may be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user may provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input.
  • Embodiments of the invention may be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the invention, or any combination of one or more such back end, middleware, or front end components.
  • the components of the system may be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
  • LAN local area network
  • WAN wide area network
  • the computing system may include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephone Function (AREA)

Abstract

Embodiments pertain to automatic speech recognition in mobile devices to establish the presence of a keyword. An audio waveform is received at a mobile device. Front-end feature extraction is performed on the audio waveform, followed by acoustic modeling, high level feature extraction, and output classification to detect the keyword. Acoustic modeling may use a neural network or Gaussian mixture modeling, and high level feature extraction may be done by aligning the results of the acoustic modeling with expected event vectors that correspond to a keyword.

Description

    CROSS-REFERENCE TO REPLATED APPLICATIONS
  • This application claims the benefit of U.S. Provisional Application No. 61/788,749, filed Mar. 15, 2013, U.S. Provisional Application No. 61/786,251, filed Mar. 14, 2013 and U.S. Provisional Application No. 61/739,206, filed Dec. 19, 2012, which are incorporated herein by reference.
  • FIELD
  • This specification describes technologies related to voice recognition.
  • BACKGROUND
  • Automatic speech recognition is an important technology that is used in mobile devices. One task that is a common goal for this technology is to be able to use voice commands to wake up and have basic spoken interactions with the device. For example, it may be desirable to recognize a “hotword” that signals that the mobile device should activate when the mobile device is in a sleep state.
  • SUMMARY
  • The methods and systems described herein provide keyword recognition that is fast and low latency, power efficient, flexible, and optionally speaker adaptive. A designer or user can choose the keywords. Embodiments include various systems directed towards robust and efficient keyword detection.
  • In general, one innovative aspect of the subject matter described in this specification can be embodied in a process that is performed by a data processing apparatus. The process includes receiving a plurality of audio frame vectors that each model an audio waveform during a different period of time, selecting a non-empty subset of the audio frame vectors, obtaining a corresponding non-empty subset of detected acoustic event vectors that results from coding the subset of the audio frame vectors, aligning the detected acoustic event vectors and a set of expected event vectors that correspond to a keyword to generate an output feature vector that characterizes an acoustic match between the detected acoustic event vectors and the expected event vectors, and inputting the output feature vector into a keyword classifier.
  • Other embodiments include corresponding system, apparatus, and computer programs, configured to perform the actions of the method, encoded on computer storage devices.
  • These and other embodiments may each optionally include one or more of the following features. For instance, the process may include determining, using the keyword classifier, that a keyword was present in the audio waveform during an overall period of time modeled by the audio frame vectors. Embodiments may include embodiments in which the audio frame vectors are coded using a neural network and in which the audio frame vectors are coded using a Gaussian mixture model.
  • After aligning, the system extracts features to characterize the acoustic match, the features comprising one or more of: length of alignment, number of phones aligned, frame distance across phone boundaries, probability of the duration of each phone with respect to average duration of a phone in training data, speaker speaking rate, average acoustic score, worst acoustic score, best acoustic score, standard deviation of acoustic scores, start frame of the alignment, stability of the alignment, binary features representing changes related to the difference between detected acoustic events and expected acoustic events, and binary features representing changes related to the difference between detected acoustic events and acoustic events in an alignment window. The process may also include producing a plurality of audio frame vectors by performing front-end feature extraction on an acoustic signal.
  • Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. Embodiments provide a way to recognize whether or not a keyword was uttered in a way that provides a simple design that can obtain good results while minimizing the need for processing and power resources.
  • The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other potential features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram 100 that illustrates dataflow in an embodiment.
  • FIG. 2 is a block diagram 200 that illustrates dataflow in a front-end feature extraction process.
  • FIG. 3 is a block diagram 300 that illustrates dataflow in an acoustic modeling process.
  • FIG. 4 is a block diagram 400 that illustrates dataflow in a high-level feature extraction process.
  • FIG. 5 is a block diagram 500 that illustrates dataflow in an output classification process.
  • FIG. 6 is a flowchart 600 of the stages involved in an example process for detecting keyword utterances in an audio waveform.
  • FIG. 7 is a block diagram 700 of an example system that can detect keyword utterances in an audio waveform.
  • Like reference numbers and designations in the various drawings indicate like elements.
  • DETAILED DESCRIPTION
  • When using a mobile device, it is desirable to provide a way of turning the device on or performing another action based on the utterance of a keyword. For example, if a user says “Google,” it may cause a smartphone to activate. However, it requires power to constantly monitor and process the audio received by the mobile device, and hence it is important to provide an approach for recognizing whether or not the keyword has been uttered while minimizing the power consumption needed to “listen” for the keyword.
  • Embodiments may listen for keywords while minimizing resource usage through a variety of approaches. For example, a variety of acoustic modeling techniques may be used to obtain feature vectors that represent audio received at the mobile device. However, another aspect of embodiments is that certain embodiments may use a high-level feature extraction module based on acoustic match and alignment. The input features obtained from a front-end feature extraction module are converted into detected acoustic events in real-time. Embodiments operate by finding an alignment of the detected acoustic events with expected acoustic events that would signify the presence of the keyword. The expected acoustic events represent a standard dictionary pronunciation for the keyword of interest. After aligning the events, embodiments are able to extract features to characterize the acoustic match, which will be described in greater detail, below. However, some implementations only extract features when an initial alignment is found, thereby reducing high-level feature computation.
  • While some implementations discussed elsewhere in this specification discuss an implementation that detects a single keyword, implementations are not necessarily limited to detecting one keyword. In fact, some implementations may be used to detect a plurality of keywords. The keywords in these implementations may also be short phrases. Such implementations allow a user to select one of a certain number of actions, such as actions presented in a menu, by saying one of the menu entries. For example, implementations may use different keywords to trigger different actions such as taking a photo, sending an email, recording a note, and so on. Given a finite number of words and/or phrases to be detected, which will ordinarily not exceed 20 or so, this technology may be used. However, other implementations may be adapted to handle more words and/or phrases if required.
  • At a high level, one system embodiment comprises four modules. Module 1 is a front-end feature extraction module, which performs: a) speech activity detection; b) windowing of the acoustic signal; c) short-term Fourier transform; d) spectral subtraction, optionally; e) filter bank extraction; and f) log-energy transform of the filtered output. Module 2 is an acoustic model, which can be one of: a) a neural network, possibly truncated of its last layers; or b) a Gaussian mixture model (GMM). If a neural network is used, the neural network may be truncated of its last layers. In module 2, the input features may be converted into acoustic events by forward-propagation through the neural network (NN). If a Gaussian mixture model is used, it may provide a probabilistic model for representing the presence of subpopulations within an overall population in order to code the acoustic events. Module 3 is a high level feature extraction module based on acoustic match/alignment. As discussed above, Module 3 finds an alignment of detected acoustic events with expected acoustic events. Module 3 also extracts certain information once an alignment has been found to characterize the match. Module 4 is an output classifier, which takes as an input the output feature vector from module 3 and possibly some side information to yield a binary decision about the presence of the keyword. The output classifier can be for example: a) a support vector machine or b) a logistic regression.
  • Various embodiments will now be discussed in connection with the drawings to explain their operation.
  • FIG. 1 is a block diagram 100 that illustrates dataflow in an embodiment. The data flow begins with an audio waveform 102. Audio waveform 102 represents audio received by an embodiment. For example, audio waveform 102 may be an analog or digital representation of sound in the environment of an embodiment that is captured by a microphone. Once audio waveform 102 is introduced into the embodiment, it is sent to front-end feature extraction module 104. Front-end feature extraction module 104 performs a series of stages, detailed in FIG. 2, that take audio waveform 102 and transform it into a series of vectors for further processing. Once front-end feature extraction module 104 has done the processing of audio waveform 102, its output is sent to acoustic modeling module 106. Acoustic modeling module 106 may use a variety of techniques, detailed in FIG. 3, to perform coding on the inputs to produce acoustic event vectors that are representative of features of audio waveform 102 over a period of time. The acoustic event vectors from acoustic modeling module are sent to a high-level feature extraction module 108 that finds an alignment for the acoustic event vectors, as detailed in FIG. 4, to further analyze characteristics of audio waveform 102 over a time interval to provide information that can be used to produce an output feature vector to detect if the keyword was uttered. After the acoustic event vectors are aligned, the output feature vector is sent to output classifier module 110 to make a determination about whether the keyword is present, as is discussed in FIG. 5.
  • Various system embodiments are similar in their overall structure. They include modules that use similar architectures to accomplish similar goals: 1) front-end feature extraction, 2) acoustic model, 3) higher level feature extraction module, and a 4) classifier module. However, there are several embodiments that differ in certain respects.
  • Embodiments approach the problem of keyword detection in advantageous ways. For example, one embodiment has the advantage that it only extracts features when a first level alignment is found, reducing high level feature computation. The approaches used in these systems are advantageous because they only involve adaptation of a few parameters to adapt to change the keywords matched or to adapt to a given speaker's voice.
  • FIG. 2 is a block diagram 200 that illustrates dataflow in a front-end feature extraction process. Audio waveform 102, as illustrated in FIG. 2, includes analog and/or digital information about incoming sound that an embodiment can analyze to detect the presence of a keyword. One way to capture audio waveform 102 for analysis is to divide it up into a plurality of analysis windows. For example, FIG. 2 shows an analysis window 204 that uses a vector to represent audio waveform 102 over a time period that is chosen as the size of analysis window 204, for example a 25 ms time period. Multiple analysis windows are obtained in succession by performing an analysis window shift 206, for example a 10 ms time period. Analysis windows may be chosen to overlap. For example, one analysis window may represent audio waveform 102 from a start time of 0 ms to an end time of 25 ms, and a subsequent analysis window may represent audio waveform 102 from a start time of 10 ms to an end time of 35 ms.
  • The analysis windows 204 are obtained as part of speech activity detection 210, in which an embodiment obtains information about available sound in its environment. Speech activity detection 210 may be designed to occur regardless of whether there is sound in the surroundings of an embodiment, or it may, for example, occur only when a volume of sound greater than a threshold volume is received. Once speech activity detection 210 occurs, it is followed by windowing of the acoustic signal 220. As discussed, each window should be a fairly short time interval, such as 25 ms, that represents characteristics of audio waveform 102 over that time interval. After windowing, embodiments may perform a fast Fourier transform 230 on the windowed data so as to analyze the constituent frequencies present in the audio waveform. Additionally, embodiments may optionally perform spectral substitution 240 to minimize the effects of noise on the information provided by the other steps. Next, filter bank extraction 250 can allow the decomposition of the information from the previous steps by using filters to separate individual components of the audio data from one another. Finally, performance of a log-energy transform 260 can help normalize the data in order to make it more meaningful.
  • The result of the processing performed in FIG. 2 is a moving window of a stack of frames 270. For example, stack of frames 270 may include 11 frames, each including information about 25 ms of audio waveform 102, with a shift of 10 ms between frames. However, it is not necessary to use a stack of 11 frames, and stack of frames 270 may include as few as 2 frames or any larger number of frames. The end output of front-end feature extraction 200 is thus a stack of a plurality of frames 280 that represents features of audio waveform 102 by performing the aforementioned analytical techniques to obtain information about characteristics of the audio waveform 102 for successive time intervals.
  • FIG. 3 is a block diagram 300 that illustrates dataflow in an acoustic modeling process. FIG. 3 begins with stack of the plurality of frames 280 produced by the process depicted in FIG. 2. FIG. 3 includes coding 310 of the plurality of frames to produce coded acoustic events for each stack 320. For example, 3 ways in which this process may occur are included in FIG. 3, including a neural network 310A, or a Gaussian mixture model 310B. If a neural network 310A is used, the neural network may possibly be truncated of its last layers. The goal of this coding is to produce coded acoustic events 320 that are single vectors that represent pluralities of initial vectors in the stack of plurality of frame 280 as single vectors with salient information about features of the audio waveform 102 over the interval that they model. For example, the input features may be converted into acoustic events by coding through the neural network 310A or a Gaussian mixture model 310B.
  • FIG. 4 is a block diagram 400 that illustrates dataflow in a high-level feature extraction process. FIG. 4 begins with coded acoustic events for each stack 320, as produced in FIG. 3. Coded acoustic events for each stack 320 are aligned with expected event vectors 410. Expected event vectors 410 include phonemes 420A-420D, each of which is associated with a standardized pronunciation for that phoneme. The high-level feature extraction operates by aligning 420 keyword 430 with coded acoustic events for each stack 320 based on detecting a preliminary alignment match with aligned phonemes 422. The alignment produces output feature vector 440. Output vector 440 includes information about the audio waveform 102 that has been distilled and processed so it is easy to draw conclusions about the presence of the keyword in audio waveform 102 over time window 410 that aligning 420 represents.
  • Aligning 420 may be accomplished by decoding with a graph, which automatically force aligns the audio to a keyword, such as “computer” or “google.” Back-epsilon arcs may allow such a graph to restart at any point, avoiding misses when the keyword is spoken while in the middle of the decoding graph. For example, implementations may generate a confusion network of pronunciations for the keyword by running a phone loop decoder on positive examples for the keyword and extract the most frequent pronunciations.
  • Other ways to obtain an alignment are also possible. One way to obtain an alignment may include extracting features in a fixed window, after reaching a stable partial result, force align a phonetic sequence in that window and extract features from that alignment. An alternative is to use an HMM hotword/garbage model, which may use a high bias and may only extract features if a hotword path is successfully decoded. Yet another way is for positive examples, to force align or manually align phonetic sequences, and for negative examples, to find the alignment, whose score may satisfy a condition given current model parameters.
  • As part of the aligning 420, high-level feature extraction 400 extracts features to characterize the quality of the acoustic match. All of these features assume that there is an alignment for both positive and negative examples with respect to the true phonetic sequences p_k for keyword k. The extracted information may include length of alignment, number of phones aligned, frame distance across phone boundaries, probability of the duration of each phone with respect to average duration of a phone in training data, speaker speaking rate, average acoustic score, worst acoustic score, best acoustic score, standard deviation of acoustic scores, start frame of the alignment, stability of the alignment, binary features representing changes related to the difference between detected acoustic events and expected acoustic events, and/or binary features representing changes related to the difference between detected acoustic events and acoustic events in an alignment window.
  • Binary features representing changes related to the difference between detected acoustic events and expected acoustic events may include identity/insertions/deletions of detected acoustic events from a GMM coding process. Binary features representing changes related to the difference between detected acoustic events and acoustic events in an alignment window may include identity/insertions/deletions of detected acoustic events from a neural network coding process.
  • Frame distance may be found, given an identified segmentation or phoneme alignment, by computing the Euclidean distance d between frames at sequential distances from each phoneme boundary. The assumption is that if the hotword was uttered, then the phoneme alignment will be correct, and hence the distance between neighboring frames across phoneme boundaries will be large. If the hotword was not uttered, the phoneme alignment will be incorrect, and hence distance between neighboring frames at phoneme boundaries will be small. Frame distance may be found using Equation 1:
  • φ j ( x _ , p _ , s _ ) = 1 p _ l = 1 p _ - 1 d ( x - j + n l , x j + n l ) , j { 1 , 2 , 3 , 4 } Equation 1
  • Another feature is phoneme duration score, which computes the probability of the current duration using a Gaussian distribution with a mean and standard deviation equal to that of the average phoneme duration for phonemes encountered in training. Phoneme duration score may be found using Equation 2:
  • φ 6 ( x _ , p _ , s _ ) = 1 p _ l = 1 p _ log ( s l + 1 - s l ; μ ^ p l , σ ^ p l ) Equation 2
  • Another feature is speaker rate changes, which features local changes in speaking rate, given the assumption that changes should be smooth. It may be found using Equation 3.
  • φ 7 ( x _ , p _ , s _ ) = 1 p _ l = 2 p _ ( r l - r l - 1 ) 2 Equation 3
  • Speaking rate itself may be provided by Equation 4.

  • r l=(s l+1 −s l)/{circumflex over (μ)}pl
  • Sample Code 1, below, includes information that might be provided in a data structure that includes information about features of an alignment.
  • Sample Code 1
    message HotwordConfidenceFeature {
    // Description of the first four features in
    // speech/greco3/hotword/feature_extractor.h.
    optional float phone_duration_score = 1 [default = 0.0];
    optional float speaker_rate = 2 [default = 0.0];
    repeated float frame_distance = 3;
    optional float word_duration_frames = 4 [default = 0.0];
    // Baseline system detected hotword.
    optional bool baseline = 6 [default = false];
    // Features inherited from WordConfFeature.
    optional float num_phones = 8 [default = 0.0];
    // From WordConfFeature: word_duration. It corresponds to
    word duration
    // in frames divided by the number of phones.
    optional float normalized_word_duration = 9 [default =
    0.0];
    optional float ascore_mean = 10 [default = 0.0];
    optional float ascore_stddev = 11 [default = 0.0];
    optional float ascore_worst = 12 [default = 0.0];
    optional float ascore_meandiff = 13 [default = 0.0];
    optional float ascore_best = 14 [default = 0.0];
    optional float lm_score = 15 [default = 0.0];
    optional float dur_score = 16 [default = 0.0];
    optional float am_score = 17 [default = 0.0];
    // Start frame of the keyword.
    optional float start_frame = 18 [ default = 0.0];
    // Phone expectation match features: u is expected and
    observed.
    // One feature for each phone in dictionary pronunciation
    of the hotword.
    repeated bool ph_expectation_align = 19;
    // Same as ph_expectation_align, but phones are detected
    from nn stream.
    repeated bool ph_expectation_nn = 20;
    // Phone expectation delete features: u is expected but
    *not* observed.
    // One feature per phone in dictionary pronunciation of
    the hotword.
    repeated bool ph_expectation_delete_align = 21;
    // Phone expectation insert features: u is *not* expected
    but observed.
    // One feature per phone in phone-list.
    repeated bool ph_expectation_insert_align = 22;
    // Same as ph_expectation_delete_align and
    ph_expectation_insert_align resp.
    repeated bool ph_expectation_delete_nn = 24;
    repeated bool ph_expectation_insert_nn = 25;
    // Stability of the partial result.
    optional float stability = 23;
    }
  • FIG. 5 is a block diagram 500 that illustrates dataflow in an output classification process. FIG. 5 begins with output vector 440 that has been produced by FIG. 4. Based on output vector 440, FIG. 5 takes a step to classify output 510 using classification module 520. For example, classification module 520 may use support vector machine 520A or logistic regression 520B. The goal of classification module 520 is to make a binary decision about whether the keyword was uttered during time window 410 associated with output vector 440. Classification module 520 produces classification result 530. This may be an actual classification decision 550, in terms of a Boolean decision confirming that the keyword was present or not. Alternatively, classification result may also be a score, for example one that represents the likelihood that the keyword is present. If classification result 530 is a score, there may be a step to process the result 540 to yield classification decision 550, for example comparing the score to a threshold value.
  • FIG. 6 is a flowchart 600 of the stages involved in an example process for detecting keyword utterances in an audio waveform.
  • In stage 610, audio frame vectors are received. For example, stage 610 may be performed as in FIG. 2, such that front-end feature extraction module 104 processes audio waveform 102 to yield the vectors, which are represented in FIG. 2 as stack of the plurality of frames 280.
  • In stage 620 subsets of vectors are selected. For example stage 620 may be performed as in FIG. 2, such that the processing of audio waveform 102 yields a stack of the plurality of frames 280 that constitutes the subset of vectors.
  • In stage 630, event vectors are obtained by coding. For example, this step is performed by acoustic modeling module 106 as in FIG. 3.
  • In stage 640, the vectors are aligned. For example, this step may occur as aligning 420 as in FIG. 4 by high-level feature extraction module 108.
  • In stage 650, the output vector is input to the classifier. For example, high-level feature extraction module 108 sends its output, output vector 440 to output classifier module 110 to make this determination as in FIG. 5.
  • FIG. 7 is a block diagram 700 of an example system that can detect keyword utterances in an audio waveform. The system contains a variety of constituent parts and modules that may be implemented through appropriate combinations of hardware, firmware, and software that allow computing device 700 to function as an embodiment of appropriate features.
  • Computing device 700 contains one or more processors 712 that may include various hardware devices designed to process data. Processors 712 are communicatively coupled to other parts of computing device 700. For example, processors 712 may be coupled to a speaker 702 and a microphone 704 that allow output and input of audio signals to and from the surroundings of computing device 700. Microphone 704 is of special import to the functioning of computing device 700 in that microphone 704 provides the raw signals that capture aspects of audio waveform 102 that are processed in other portions of computing device 700. Additionally, computing device 700 may include persistent memory 706. Persistent memory may include a variety of memory storage devices that allow permanent retention and storage of information manipulated by processors 712. Furthermore, input device 708 allows the receipt of commands from a user, and interface 714 allows computing device 700 to interact with other devices to allow information exchange. Additionally, processors 712 may be communicatively coupled to a display 710 that provides a graphical representation of information processed by computing device 700 for the user to view.
  • Additionally, processors 712 may be communicatively coupled to a series of modules that perform the functionalities necessary to implement the method of embodiments that is presented in FIG. 6. These modules include front-end feature extraction module 716, which performs as illustrated in FIG. 2, acoustic modeling module 718, which performs as illustrated in FIG. 3, high-level feature extraction module 720, which performs as illustrated in FIG. 4, and output classifier module 722, which performs as illustrated in FIG. 5.
  • As discussed above, the task of hotword or keyword detection is an important component in many speech recognition applications. For example, when the vocabulary size is limited, or when the task requires activating a device, for example, a phone, by saying a word, keyword detection is applied to classify whether an utterance contains a word or not.
  • For example, the task performed by some embodiments includes detecting a single word, for example, “Google,” that will activate a device in standby to perform a task. This device, thus, should be listening all the time for such word. A common problem in portable devices is battery life, and limited computation capabilities. Because of this, it is important to design a keyword detection system that is both accurate and computationally efficient.
  • This application begins by presenting embodiments, which include approaches to recognizing when a mobile device should activate or take other actions in response to receiving a keyword as a voice input. The application describes how these approaches operate and discuss the advantageous results provided by the approaches. These approaches provide the potential to obtain good results while using resources efficiently.
  • A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. For example, various forms of the flows shown above may be used, with steps re-ordered, added, or removed.
  • Embodiments of the invention and all of the functional operations described in this specification may be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the invention may be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium may be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter affecting a machine-readable propagated signal, or a combination of one or more of them. The term “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus may include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. A propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus.
  • A computer program (also known as a program, software, software application, script, or code) may be written in any form of programming language, including compiled or interpreted languages, and it may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program may be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program may be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • The processes and logic flows described in this specification may be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows may also be performed by, and apparatus may also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
  • Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer may be embedded in another device, e.g., a tablet computer, a mobile telephone, a personal digital assistant (PDA), a mobile audio player, a Global Positioning System (GPS) receiver, to name just a few. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory may be supplemented by, or incorporated in, special purpose logic circuitry.
  • To provide for interaction with a user, embodiments of the invention may be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user may provide input to the computer. Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input.
  • Embodiments of the invention may be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the invention, or any combination of one or more such back end, middleware, or front end components. The components of the system may be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
  • The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • While this specification contains many specifics, these should not be construed as limitations on the scope of the invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of the invention. Certain features that are described in this specification in the context of separate embodiments may also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment may also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination may in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
  • Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems may generally be integrated together in a single software product or packaged into multiple software products.
  • Thus, particular embodiments of the invention have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims may be performed in a different order and still achieve desirable results.

Claims (18)

What is claimed is:
1. A computer-implemented method comprising:
receiving a plurality of audio frame vectors that each model an audio waveform during a different period of time;
selecting a non-empty subset of the audio frame vectors;
obtaining a corresponding non-empty subset of detected acoustic event vectors that results from coding the subset of the audio frame vectors;
aligning the detected acoustic event vectors and a set of expected event vectors that correspond to a keyword to generate an output feature vector that characterizes an acoustic match between the detected acoustic event vectors and the expected event vectors; and
inputting the output feature vector into a keyword classifier.
2. The method of claim 1, further comprising:
determining, using the keyword classifier, that a keyword was present in the audio waveform during an overall period of time modeled by the audio frame vectors.
3. The method of claim 1, wherein the audio frame vectors are coded using a neural network.
4. The method of claim 1, wherein the audio frame vectors are coded using a Gaussian mixture model.
5. The method of claim 1, wherein aligning comprises extracting features to characterize the acoustic match, the features comprising one or more of: length of alignment, number of phones aligned, frame distance across phone boundaries, probability of the duration of each phone with respect to average duration of a phone in training data, speaker speaking rate, average acoustic score, worst acoustic score, best acoustic score, standard deviation of acoustic scores, start frame of the alignment, stability of the alignment, binary features representing changes related to the difference between detected acoustic events and expected acoustic events, and binary features representing changes related to the difference between detected acoustic events and acoustic events in an alignment window.
6. The method of claim 1, further comprising:
producing a plurality of audio frame vectors by performing front-end feature extraction on an acoustic signal.
7. A system comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
receiving a plurality of audio frame vectors that each model an audio waveform during a different period of time;
selecting a non-empty subset of the audio frame vectors;
obtaining a corresponding non-empty subset of detected acoustic event vectors that results from coding the subset of the audio frame vectors;
aligning the detected acoustic event vectors and a set of expected event vectors that correspond to a keyword to generate an output feature vector that characterizes an acoustic match between the detected acoustic event vectors and the expected event vectors; and
inputting the output feature vector into a keyword classifier.
8. The system of claim 7, wherein the operations further comprise:
determining, using the keyword classifier, that a keyword was present in the audio waveform during an overall period of time modeled by the audio frame vectors.
9. The system of claim 7, wherein the audio frame vectors are coded using a neural network.
10. The system of claim 7, wherein the audio frame vectors are coded using a Gaussian mixture model.
11. The system of claim 7, wherein aligning comprises extracting features to characterize the acoustic match, the features comprising one or more of: length of alignment, number of phones aligned, frame distance across phone boundaries, probability of the duration of each phone with respect to average duration of a phone in training data, speaker speaking rate, average acoustic score, worst acoustic score, best acoustic score, standard deviation of acoustic scores, start frame of the alignment, stability of the alignment, binary features representing changes related to the difference between detected acoustic events and expected acoustic events, and binary features representing changes related to the difference between detected acoustic events and acoustic events in an alignment window.
12. The system of claim 7, the operations further comprising:
producing a plurality of audio frame vectors by performing front-end feature extraction on an acoustic signal.
13. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
receiving a plurality of audio frame vectors that each model an audio waveform during a different period of time;
selecting a non-empty subset of the audio frame vectors;
obtaining a corresponding non-empty subset of detected acoustic event vectors that results from coding the subset of the audio frame vectors;
aligning the detected acoustic event vectors and a set of expected event vectors that correspond to a keyword to generate an output feature vector that characterizes an acoustic match between the detected acoustic event vectors and the expected event vectors; and
inputting the output feature vector into a keyword classifier.
14. The medium of claim 13, wherein the operations further comprise:
determining, using the keyword classifier, that a keyword was present in the audio waveform during an overall period of time modeled by the audio frame vectors.
15. The medium of claim 13, wherein the audio frame vectors are coded using a neural network.
16. The medium of claim 13, wherein the audio frame vectors are coded using a Gaussian mixture model.
17. The medium of claim 13, wherein aligning comprises extracting features to characterize the acoustic match, the features comprising one or more of: length of alignment, number of phones aligned, frame distance across phone boundaries, probability of the duration of each phone with respect to average duration of a phone in training data, speaker speaking rate, average acoustic score, worst acoustic score, best acoustic score, standard deviation of acoustic scores, start frame of the alignment, stability of the alignment, binary features representing changes related to the difference between detected acoustic events and expected acoustic events, and binary features representing changes related to the difference between detected acoustic events and acoustic events in an alignment window.
18. The medium of claim 13, the operations further comprising:
producing a plurality of audio frame vectors by performing front-end feature extraction on an acoustic signal.
US13/861,020 2012-12-19 2013-04-11 Keyword detection based on acoustic alignment Abandoned US20150279351A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/861,020 US20150279351A1 (en) 2012-12-19 2013-04-11 Keyword detection based on acoustic alignment

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201261739206P 2012-12-19 2012-12-19
US201361786251P 2013-03-14 2013-03-14
US201361788749P 2013-03-15 2013-03-15
US13/861,020 US20150279351A1 (en) 2012-12-19 2013-04-11 Keyword detection based on acoustic alignment

Publications (1)

Publication Number Publication Date
US20150279351A1 true US20150279351A1 (en) 2015-10-01

Family

ID=54191269

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/861,020 Abandoned US20150279351A1 (en) 2012-12-19 2013-04-11 Keyword detection based on acoustic alignment
US13/860,982 Active 2034-10-21 US9378733B1 (en) 2012-12-19 2013-04-11 Keyword detection without decoding

Family Applications After (1)

Application Number Title Priority Date Filing Date
US13/860,982 Active 2034-10-21 US9378733B1 (en) 2012-12-19 2013-04-11 Keyword detection without decoding

Country Status (1)

Country Link
US (2) US20150279351A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3157005A1 (en) 2015-10-16 2017-04-19 Google, Inc. Hotword recognition
US9747926B2 (en) 2015-10-16 2017-08-29 Google Inc. Hotword recognition
US10141010B1 (en) * 2015-10-01 2018-11-27 Google Llc Automatic censoring of objectionable song lyrics in audio
US10186265B1 (en) * 2016-12-06 2019-01-22 Amazon Technologies, Inc. Multi-layer keyword detection to avoid detection of keywords in output audio
US10311876B2 (en) 2017-02-14 2019-06-04 Google Llc Server side hotwording
CN110120230A (en) * 2019-01-08 2019-08-13 国家计算机网络与信息安全管理中心 A kind of acoustic events detection method and device
WO2019166296A1 (en) * 2018-02-28 2019-09-06 Robert Bosch Gmbh System and method for audio event detection in surveillance systems
CN110334244A (en) * 2019-07-11 2019-10-15 出门问问信息科技有限公司 A kind of method, apparatus and electronic equipment of data processing
US10650828B2 (en) 2015-10-16 2020-05-12 Google Llc Hotword recognition
CN111149154A (en) * 2019-12-24 2020-05-12 广州国音智能科技有限公司 Voiceprint recognition method, device, equipment and storage medium
CN111489759A (en) * 2020-03-23 2020-08-04 天津大学 Noise evaluation method based on optical fiber voice time domain signal waveform alignment
CN111798840A (en) * 2020-07-16 2020-10-20 中移在线服务有限公司 Voice keyword recognition method and device
CN112652306A (en) * 2020-12-29 2021-04-13 珠海市杰理科技股份有限公司 Voice wake-up method and device, computer equipment and storage medium
WO2021076130A1 (en) * 2019-10-17 2021-04-22 Hewlett-Packard Development Company, L.P. Keyword detections based on events generated from audio signals
US11216724B2 (en) * 2017-12-07 2022-01-04 Intel Corporation Acoustic event detection based on modelling of sequence of event subparts
US20230019595A1 (en) * 2020-02-07 2023-01-19 Sonos, Inc. Localized Wakeword Verification

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9235799B2 (en) 2011-11-26 2016-01-12 Microsoft Technology Licensing, Llc Discriminative pretraining of deep neural networks
WO2013149123A1 (en) * 2012-03-30 2013-10-03 The Ohio State University Monaural speech filter
US9536528B2 (en) 2012-07-03 2017-01-03 Google Inc. Determining hotword suitability
US9842585B2 (en) * 2013-03-11 2017-12-12 Microsoft Technology Licensing, Llc Multilingual deep neural network
CN108256651B (en) * 2013-06-28 2022-09-06 D-波系统公司 Method for quantum processing of data
US10438593B2 (en) * 2015-07-22 2019-10-08 Google Llc Individualized hotword detection models
US10679643B2 (en) * 2016-08-31 2020-06-09 Gregory Frederick Diamos Automatic audio captioning
US11182665B2 (en) * 2016-09-21 2021-11-23 International Business Machines Corporation Recurrent neural network processing pooling operation
JP7134949B2 (en) 2016-09-26 2022-09-12 ディー-ウェイブ システムズ インコーポレイテッド Systems, methods, and apparatus for sampling from a sampling server
US11531852B2 (en) 2016-11-28 2022-12-20 D-Wave Systems Inc. Machine learning systems and methods for training with noisy labels
KR20180068475A (en) 2016-12-14 2018-06-22 삼성전자주식회사 Method and device to recognize based on recurrent model and to train recurrent model
CN106683680B (en) * 2017-03-10 2022-03-25 百度在线网络技术(北京)有限公司 Speaker recognition method and device, computer equipment and computer readable medium
US10789942B2 (en) * 2017-10-24 2020-09-29 Nec Corporation Word embedding system
WO2019118644A1 (en) 2017-12-14 2019-06-20 D-Wave Systems Inc. Systems and methods for collaborative filtering with variational autoencoders
KR102605736B1 (en) * 2018-03-15 2023-11-27 한국전자통신연구원 Method and apparatus of sound event detecting robust for frequency change
US11386346B2 (en) 2018-07-10 2022-07-12 D-Wave Systems Inc. Systems and methods for quantum bayesian networks
US11461644B2 (en) 2018-11-15 2022-10-04 D-Wave Systems Inc. Systems and methods for semantic segmentation
US11468293B2 (en) 2018-12-14 2022-10-11 D-Wave Systems Inc. Simulating and post-processing using a generative adversarial network
US11900264B2 (en) 2019-02-08 2024-02-13 D-Wave Systems Inc. Systems and methods for hybrid quantum-classical computing
US11625612B2 (en) 2019-02-12 2023-04-11 D-Wave Systems Inc. Systems and methods for domain adaptation
KR102635469B1 (en) 2019-03-18 2024-02-13 한국전자통신연구원 Method and apparatus for recognition of sound events based on convolutional neural network
CN110164443B (en) * 2019-06-28 2021-09-14 联想(北京)有限公司 Voice processing method and device for electronic equipment and electronic equipment
IT201900015506A1 (en) * 2019-09-03 2021-03-03 St Microelectronics Srl Process of processing an electrical signal transduced by a speech signal, electronic device, connected network of electronic devices and corresponding computer product
CN114207710A (en) * 2019-10-15 2022-03-18 谷歌有限责任公司 Detecting and/or registering a thermal command to trigger a response action by an automated assistant
CN113948085B (en) * 2021-12-22 2022-03-25 中国科学院自动化研究所 Speech recognition method, system, electronic device and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7212968B1 (en) * 1999-10-28 2007-05-01 Canon Kabushiki Kaisha Pattern matching method and apparatus
GB0027178D0 (en) * 2000-11-07 2000-12-27 Canon Kk Speech processing system
US20040210437A1 (en) * 2003-04-15 2004-10-21 Aurilab, Llc Semi-discrete utterance recognizer for carefully articulated speech
FR2940498B1 (en) * 2008-12-23 2011-04-15 Thales Sa METHOD AND SYSTEM FOR AUTHENTICATING A USER AND / OR CRYPTOGRAPHIC DATA
US20120155663A1 (en) * 2010-12-16 2012-06-21 Nice Systems Ltd. Fast speaker hunting in lawful interception systems
US9202465B2 (en) * 2011-03-25 2015-12-01 General Motors Llc Speech recognition dependent on text message content

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10141010B1 (en) * 2015-10-01 2018-11-27 Google Llc Automatic censoring of objectionable song lyrics in audio
US10262659B2 (en) 2015-10-16 2019-04-16 Google Llc Hotword recognition
US9934783B2 (en) 2015-10-16 2018-04-03 Google Llc Hotword recognition
EP3157005A1 (en) 2015-10-16 2017-04-19 Google, Inc. Hotword recognition
US9928840B2 (en) 2015-10-16 2018-03-27 Google Llc Hotword recognition
US10650828B2 (en) 2015-10-16 2020-05-12 Google Llc Hotword recognition
EP3157006A1 (en) 2015-10-16 2017-04-19 Google, Inc. Hotword recognition
EP3751561A2 (en) 2015-10-16 2020-12-16 Google LLC Hotword recognition
EP3157009A1 (en) 2015-10-16 2017-04-19 Google, Inc. Hotword recognition
US9747926B2 (en) 2015-10-16 2017-08-29 Google Inc. Hotword recognition
US10186265B1 (en) * 2016-12-06 2019-01-22 Amazon Technologies, Inc. Multi-layer keyword detection to avoid detection of keywords in output audio
US11699443B2 (en) 2017-02-14 2023-07-11 Google Llc Server side hotwording
US10706851B2 (en) 2017-02-14 2020-07-07 Google Llc Server side hotwording
US11049504B2 (en) 2017-02-14 2021-06-29 Google Llc Server side hotwording
US10311876B2 (en) 2017-02-14 2019-06-04 Google Llc Server side hotwording
US11216724B2 (en) * 2017-12-07 2022-01-04 Intel Corporation Acoustic event detection based on modelling of sequence of event subparts
WO2019166296A1 (en) * 2018-02-28 2019-09-06 Robert Bosch Gmbh System and method for audio event detection in surveillance systems
US11810435B2 (en) 2018-02-28 2023-11-07 Robert Bosch Gmbh System and method for audio event detection in surveillance systems
CN110120230B (en) * 2019-01-08 2021-06-01 国家计算机网络与信息安全管理中心 Acoustic event detection method and device
CN110120230A (en) * 2019-01-08 2019-08-13 国家计算机网络与信息安全管理中心 A kind of acoustic events detection method and device
CN110334244A (en) * 2019-07-11 2019-10-15 出门问问信息科技有限公司 A kind of method, apparatus and electronic equipment of data processing
WO2021076130A1 (en) * 2019-10-17 2021-04-22 Hewlett-Packard Development Company, L.P. Keyword detections based on events generated from audio signals
WO2021127994A1 (en) * 2019-12-24 2021-07-01 广州国音智能科技有限公司 Voiceprint recognition method, apparatus and device, and storage medium
CN111149154A (en) * 2019-12-24 2020-05-12 广州国音智能科技有限公司 Voiceprint recognition method, device, equipment and storage medium
US20230019595A1 (en) * 2020-02-07 2023-01-19 Sonos, Inc. Localized Wakeword Verification
US11961519B2 (en) * 2020-02-07 2024-04-16 Sonos, Inc. Localized wakeword verification
CN111489759A (en) * 2020-03-23 2020-08-04 天津大学 Noise evaluation method based on optical fiber voice time domain signal waveform alignment
CN111798840A (en) * 2020-07-16 2020-10-20 中移在线服务有限公司 Voice keyword recognition method and device
CN112652306A (en) * 2020-12-29 2021-04-13 珠海市杰理科技股份有限公司 Voice wake-up method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
US9378733B1 (en) 2016-06-28

Similar Documents

Publication Publication Date Title
US20150279351A1 (en) Keyword detection based on acoustic alignment
US9202462B2 (en) Key phrase detection
US11942083B2 (en) Recognizing speech in the presence of additional audio
US10930270B2 (en) Processing audio waveforms
US11900948B1 (en) Automatic speaker identification using speech recognition features
US20210117797A1 (en) Training multiple neural networks with different accuracy
US9754584B2 (en) User specified keyword spotting using neural network feature extractor
US9466289B2 (en) Keyword detection with international phonetic alphabet by foreground model and background model
US9715660B2 (en) Transfer learning for deep neural network based hotword detection
US11069352B1 (en) Media presence detection
US9799325B1 (en) Methods and systems for identifying keywords in speech signal
US20160012819A1 (en) Server-Side ASR Adaptation to Speaker, Device and Noise Condition via Non-ASR Audio Transmission
US9263033B2 (en) Utterance selection for automated speech recognizer training
CN109065026B (en) Recording control method and device
CN113327596B (en) Training method of voice recognition model, voice recognition method and device
KR102069693B1 (en) Apparatus and method for recognizing natural language dialogue speech

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOGLE INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NGUYEN, PATRICK AN PHU;SAN MARTIN, MARIA CAROLINA PARADA;SCHALKWYK, JOHAN;SIGNING DATES FROM 20130327 TO 20130409;REEL/FRAME:030560/0200

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: GOOGLE LLC, CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044144/0001

Effective date: 20170929