US11942091B2 - Alphanumeric sequence biasing for automatic speech recognition using a grammar and a speller finite state transducer - Google Patents
Alphanumeric sequence biasing for automatic speech recognition using a grammar and a speller finite state transducer Download PDFInfo
- Publication number
- US11942091B2 US11942091B2 US17/251,465 US202017251465A US11942091B2 US 11942091 B2 US11942091 B2 US 11942091B2 US 202017251465 A US202017251465 A US 202017251465A US 11942091 B2 US11942091 B2 US 11942091B2
- Authority
- US
- United States
- Prior art keywords
- contextual
- finite state
- generating
- alphanumeric
- grammar
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 238000000034 method Methods 0.000 claims abstract description 76
- 238000012545 processing Methods 0.000 claims abstract description 25
- 238000012549 training Methods 0.000 claims description 22
- 230000004927 fusion Effects 0.000 claims description 11
- 230000000306 recurrent effect Effects 0.000 claims description 9
- 230000004044 response Effects 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 6
- XDNBJTQLKCIJBV-GOSISDBHSA-N diethoxy-[4-[(r)-methylsulfinyl]phenoxy]-sulfanylidene-$l^{5}-phosphane Chemical compound CCOP(=S)(OCC)OC1=CC=C([S@@](C)=O)C=C1 XDNBJTQLKCIJBV-GOSISDBHSA-N 0.000 description 21
- 230000008569 process Effects 0.000 description 21
- 230000015654 memory Effects 0.000 description 9
- 230000007704 transition Effects 0.000 description 9
- 230000001186 cumulative effect Effects 0.000 description 8
- 238000013459 approach Methods 0.000 description 6
- 239000000470 constituent Substances 0.000 description 6
- 125000002015 acyclic group Chemical group 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 230000003190 augmentative effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000011521 glass Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012812 general test Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 235000019992 sake Nutrition 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/193—Formal grammars, e.g. finite state automata, context free grammars or word networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
Definitions
- ASR Automatic speech recognition
- ASR systems can include an ASR model for use in generating a set of candidate recognitions.
- the ASR system can select generated text from the set of candidate recognitions.
- Humans may engage in human-to-computer dialog with interactive software applications referred to herein as “automated assistants” (also referred to as “digital agents”, “chatbots”, “interactive personal assistants”, “intelligent personal assistants”, “assistant applications”, “conversational agents”, etc.).
- automated assistants also referred to as “digital agents”, “chatbots”, “interactive personal assistants”, “intelligent personal assistants”, “assistant applications”, “conversational agents”, etc.
- humans (which when they interact with automated assistants may be referred to as “users”) may provide commands and/or requests to an automated assistant using spoken natural language input (i.e., utterances), which may in some cases be converted into text (e.g., converted into text using ASR techniques) and then processed.
- spoken natural language input i.e., utterances
- Implementations disclosed herein are directed towards determining a text representation of a spoken utterance, that includes an alphanumeric sequence, using contextual biasing finite state transducers (FSTs).
- Contextual biasing FSTs are also referred to herein as contextual FSTs, contextual written domain alphanumeric grammar FSTs, and written domain alphanumeric grammar FSTs.
- the spoken utterance of “tell me about my flight to DAY” includes the alphanumeric sequence ‘DAY’ that indicates the Dayton International Airport.
- Alphanumeric sequences, as used herein, can include a combination of alphabet characters and/or numeric characters.
- alphanumeric sequences have an expected character length, an expected pattern of alphabet characters and/or numeric characters, and/or one or more additional expected characteristics.
- an airport code is an alphanumeric sequence with an expected length of three characters, and with an expected pattern of a first alphabet character, a second alphabet character, and a third alphabet character.
- a tracking number is an alphanumeric sequence with an expected length and/or an expected pattern of alphabet characters and/or numeric characters.
- Alphanumeric sequences can include tracking numbers, airport codes, radio station call numbers, personal identification numbers, times, zip codes, phone numbers, and/or additional or alternative sequences.
- An automatic speech recognition (ASR) system can be used to generate the text representation of the spoken utterance.
- the ASR system can be trained using a set of training instances, where the set of training instances does not include the alphanumeric sequence. Additionally or alternatively, the ASR system can be trained using a set of training instances, where the alphanumeric sequence occurs a number of times below a threshold value in the set of training instances.
- contextual biasing FSTs in accordance with implementations described herein can increase the recognition of alphanumeric sequences not present in the training data and/or infrequently encountered in the training data used to train the ASR system, thus increasing the accuracy of the ASR system without necessitating additional training of the ASR system based on such alphanumeric sequences.
- the ASR system can process the spoken utterance using an ASR model portion of the ASR system to generate one or more candidate recognitions of the spoken utterance.
- a variety of ASR models can be used in generating candidate recognition(s) of the spoken utterance including a recurrent neural network (RNN) model, a recurrent neural network-transformer (RNN-T) model, a sequence to sequence model, a listen attend spell (LAS) model, and/or one or more additional models for use in generating candidate recognitions of spoken utterances.
- RNN recurrent neural network
- RNN-T recurrent neural network-transformer
- LAS listen attend spell
- the ASR system can rescore the probabilities of the one or more candidate recognitions (e.g., increase and/or decrease probabilities of the candidate recognition(s)).
- the candidate recognitions and/or the rescored candidate recognitions can be used by the ASR system to generate the text representation of the spoken utterance.
- the ASR system can include a beam search portion used in rescoring the candidate recognitions.
- the ASR system can modify the beam search portion using contextual biasing FST(s).
- the beam search portion can iteratively interact with the ASR model in building candidate recognition(s) one unit at a time.
- the ASR system can include a RNN-T model as the ASR model.
- the ASR system modify the candidate recognition(s) generated using the RNN-T model using the beam search portion, where the beam search portion is modified using the contextual FST(s) via shallow fusion.
- the contextual biasing FST can be selected using a FST engine based on contextual information corresponding to the spoken utterance. Put another way, the contextual information can be used, directly or indirectly, in selecting one or more contextual biasing FSTs that are mapped to the contextual information.
- the spoken utterance can be processed using a context engine to determine relevant contextual information. Additionally or alternatively, additional information can be processed using the context engine to determine the contextual information.
- Additional information can include, for example: one or more previous turns in a dialog between a user and a computing system that precede the spoken utterance; information about the user stored in a user profile; information about the computing system; information about one or more networked hardware devices (e.g., a networked smart thermostat, a networked smart oven, a networked camera, etc.); and/or additional information relating to the context of the utterance.
- networked hardware devices e.g., a networked smart thermostat, a networked smart oven, a networked camera, etc.
- additional information relating to the context of the utterance e.g., contextual information relating to thermostat temperature can be determined based on a spoken utterance of “turn the temperature up to 72 degrees”.
- the contextual information relating to thermostat temperature can be based on presence, in the spoken utterance, of certain term(s) that are mapped (individually or collectively) to thermostat temperature, such as “turn”, “temperature”, “up” and/or “degrees”.
- additional information related to a networked smart thermostat can be used in determining the contextual information.
- additional information can be utilized based on the networked smart thermostat being linked with a device via which the spoken utterance is received.
- the spoken utterance of “turn the temperature up to 72 degrees” and the additional information about the networked smart thermostat can be processed using the context engine to determine the thermostat temperature contextual information.
- This contextual information can be processed using the FST engine to select the temperature contextual biasing FST.
- the temperature contextual biasing FST can have an expected length of one or two digits, an expected structure of a first numeric character followed by a second numeric character, and an expected range of values (e.g., an expected range between 50 to 90).
- the FST engine can select multiple contextual biasing FSTs related to the contextual information. For instance, the FST engine can select a Fahrenheit temperature contextual biasing FST and a Celsius temperature contextual biasing FST.
- the Fahrenheit temperature contextual biasing FST can have an expected range of Fahrenheit temperatures while the Celsius contextual biasing FST can have a distinct expected range of Celsius temperatures.
- a contextual biasing FST can be generated based on an alphanumeric grammar FST.
- the alphanumeric grammar FST can accept valid alphanumeric sequences for the type of alphanumeric sequence.
- an airport code alphanumeric grammar for an airport code alphanumeric sequence can accept a variety of combinations of a three character sequence of a first alphabet character, a second alphabet character, and a third alphabet character.
- the airport code alphanumeric grammar may only accept the subset of possible three letter combinations which correspond to valid airport codes.
- the airport alphanumeric grammar can be constructed to only accept three character sequences of alphabet characters that correspond to an actual airport code instead of accepting all of the over 17,000 potential three character sequences of alphabet characters.
- an unweighted wordpiece based acceptor grammar can be generated based on the alphanumeric grammar FST and a speller FST, where the speller FST can map wordpieces to their constituent graphemes.
- the unweighted wordpiece based acceptor grammar can accept valid wordpiece tokenizations of the alphanumeric grammar. For example, an unweighted wordpiece based acceptor grammar can accept “LAX”, “L AX”, or “LA X” as valid wordpiece tokenizations of the airport code “LAX” indicating the Los Angeles International Airport.
- a factored FST can be generated by applying a factor operation to the unweighted wordpiece based acceptor grammar, where all paths leading to any given state in the factored FST traverse the same number of arcs.
- the contextual biasing FST can be generated by applying constant weights per arc to the factored FST. Additionally or alternatively, the contextual biasing FST can be generated by causing failure transitions to be equal to the cumulative weight accumulated up to each state to the factored FST.
- various implementations set forth techniques for processing audio data to generate text representations of alphanumeric sequences—and do so in a manner that enables speech recognition to be more efficient and/or accurate.
- Implementations disclosed herein increase the accuracy of text representations of alphanumeric sequences even when the alphanumeric sequence has a low prior probability (i.e., is infrequently encountered in the training data used to train the ASR system) or is out of vocabulary for the underlying ASR system.
- accurate speech recognition can be performed for a given alphanumeric sequence without prior utilization of resources in training the ASR system on multiple (or even any) training examples that include the given alphanumeric sequence.
- accurately recognizing captured audio data can conserve system resources (e.g., processor cycles, memory, battery power, and/or additional resources of a computing system) by eliminating the need for a user to repeat audio data incorrectly recognized by the system (or manually type a correction to an incorrect recognition). This shortens the duration of user input and the duration of a user-system dialog. Yet further, the user experience can be improved by eliminating the need for the user to repeat audio data incorrectly recognized by the system.
- system resources e.g., processor cycles, memory, battery power, and/or additional resources of a computing system
- FIG. 1 A illustrates an example of generating a text representation of captured audio data in accordance with various implementations disclosed herein.
- FIG. 1 B illustrates an example environment where various implementations disclosed herein can be implemented.
- FIG. 2 illustrates an example of an alphanumeric grammar FST in accordance with various implementations disclosed herein.
- FIG. 3 illustrates an example of an unweighted wordpiece based acceptor grammar in accordance with various implementations disclosed herein.
- FIG. 4 illustrates an example of a factored FST in accordance with various implementations disclosed herein.
- FIG. 5 illustrates an example of a contextual biasing FST in accordance with various implementations disclosed herein.
- FIG. 6 is a flowchart illustrating an example process in accordance with implementations disclosed herein.
- FIG. 7 illustrates another example environment in which implementations disclosed herein can be implemented.
- FIG. 8 illustrates another example environment in where various implementations disclosed herein can be implemented.
- FIG. 9 illustrates an example architecture of a computing device.
- Accurate recognition of alphanumeric sequences is an important task across a number of applications that use automatic speech recognition (ASR). Recognizing alphanumeric sequences such as addresses, times, dates, digit sequences, etc. can be a difficult task that is often needed in specific contexts, such as a dialog context. For example, a user might give a command to a voice controlled digital assistant to set an alarm or calendar event, and in turn be asked to specify the alarm time or calendar event duration. In these kind of contextual scenarios, speech recognition errors can significantly affect user experience.
- ASR automatic speech recognition
- a location can be used as a contextual signal as zip codes and phone numbers in different countries might have different formats.
- a dialog turn a user might be asked to enter digit sequences of a specific length.
- An ASR system that fails to take advantage of this contextual knowledge can potentially provide poor recognition accuracy of numeric entities.
- Contextual information can be used to improve recognition accuracy of conventional ASR systems.
- Various contextual signals e.g., location, dialog states, surface, etc.
- On-the-fly rescoring and finite state transducer contextual models can be used in contextual ASR systems.
- Conventional ASR systems have addressed improving quality on numeric sequences using verbalizers and class based systems together with contextual information.
- contextual information can be utilized in end to end (E2E) ASR systems.
- E2E ASR systems For example, a general approach to contextual speech recognition in E2E ASR systems can utilize a shallow fusion language model.
- An additional and/or alternative technique can include adapting the inference process and take advantage of contextual signals by adjusting the output likelihoods at each step in the beam search.
- the proposed method can be evaluated on a Listen Attend Spell (LAS) E2E model showing its effectiveness at incorporating context into the prediction of an E2E system.
- LAS Listen Attend Spell
- the system can be evaluated using a recurrent neural network transducer (RNN-T).
- RNN-T recurrent neural network transducer
- the output likelihoods are rescored by an external model before pruning.
- the external model can provide probabilities for wordpiece outputs, either positive or negative weights.
- the system increases probabilities of prefixes of words present in contextual models, and retracts the probabilities for failed matches. With this method, the probabilities of paths that could potentially lead to a word from a contextual model are increased during beam search. If a prefix stops matching, the score is retracted. This technique has proven promising in a wide variety of contextual speech applications, including communication and media tasks.
- a limitation in this system is that it can only build contextual models from a list of phrases.
- a numeric class e.g. a 10-digit numeric sequence
- all 10-digit numbers would need to be enumerated. This is prohibitively computationally expensive (e.g., memory wise and/or processor wise) for classes with large numbers of elements (e.g. zip codes, phone numbers, etc.).
- CLAS Contextual Listen, Attend and Spell
- E2E ASR systems do not utilize explicit verbalizers or lexicons and have only limited contextual ASR capabilities, so they might not perform as well on certain classes of alphanumeric entities that are not well represented in E2E model training data and are needed only in certain contexts, such as alphanumeric entities with especially long tail numeric entities with complex verbalization rules.
- Various implementations are directed towards contextual speech recognition techniques for recognizing alphanumeric entities using shallow fusion between an E2E speech recognizer and written domain alphanumeric grammars represented as finite state transducers.
- An approach is provided for automatic conversion of grammars representing alphanumeric entities in the written domain to FSTs that can be used to increase the probability of hypotheses matching these grammars during beam search in E2E speech recognition systems.
- An E2E RNN-T based ASR system can be utilized in some implementations.
- Contextual ASR can be achieved by fusing a wordpiece based RNN-T with a contextual FST during beam search.
- Techniques describe herein introduce a technique for adding alphanumeric context to E2E ASR systems. This technique addresses building alphanumeric contextual FSTs from numeric grammars and integrating them into a contextual E2E ASR system.
- An alphanumeric grammar can be processed to generate a corresponding alphanumeric grammar based contextual FST.
- the resulting contextual FSTs can be stored in an alphanumeric contextual FST inventory.
- contextual information can be analyzed in order to fetch contextually relevant FSTs from this inventory.
- These FSTs provide weights which are interpolated with the RNN-T weights during beam search to increase the probability of hypotheses matching the given numeric grammar.
- Contextual information used in order to determine if an alphanumeric contextual FST should be fetched can be, for example, a particular dialog state. For example, if a user was prompted for an alarm time, the time numeric contextual FST can be fetched from the inventory. Similarly, contextual information could be the type of device the user is using. If the user is using a smart radio system, the radio station contextual FST will be fetched and used for contextual ASR.
- a written domain alphanumeric grammar To build a contextual FST for a particular type of alphanumeric entity, various implementations begin with a written domain alphanumeric grammar.
- the technique is illustrated with an example of a simple grammar that accepts the strings “200” or “202”.
- alphanumeric grammars could be complex, representing, for instance, a time, a percentage, a digit sequence, etc.
- the written domain alphanumeric grammar is denoted as G.
- the character _ is used to denote the start of a word.
- the alphanumeric grammar FST needs to be converted to wordpieces using a technique similar to a “speller” technique.
- the speller FST S transduces from wordpieces to the constituent graphemes of that wordpiece.
- an unweighted wordpiece based acceptor grammar U can be created using the FST operations in Equation 1.
- the unweighted wordpiece based acceptor grammar FST accepts all valid wordpiece level tokenizations of strings in the grammar G.
- the number “_200” can be tokenized into the wordpieces “_20 0”, “_2 0 0”, or “2 00”.
- U min(det det(project( S ⁇ G ))) (1)
- the contextual FST can have failure transitions to subtract penalties for partial matches. Therefore, all paths leading to the same state should have the same cumulative weight. Applying a constant weight per matching arc in the contextual FST can result in a problem of ambiguous failure weights.
- the FactorFST operation can be used with factor weight one to ensure that at any state, the cumulative path weight to that state is the same. Note that at any state, the cumulative number of arcs traversed to reach that state is now the same, independent of the path taken to reach that state.
- An alternative solution to the factor operation would be to create a tree FST. However, in some implementations, such a tree can grow exponentially large for simple grammars such as N-digit numbers.
- weights and failure transitions are applied to the factored FST resulting in the weighted contextual FST C.
- a constant weight per arc can be applied, however other strategies might apply a weight as a function of the number of arcs traversed to reach a given state. Both strategies result in a constant cumulative path weight at any given state, resulting in consistent weights for failure transitions. Note that for this FST, different tokenizations of the same numeric sequence get different weights. Additionally or alternatively, weights can be applied as a function of the grapheme depth (the cumulative number of graphemes required to reach a state in the FST), which would result in all paths that spell the same numeric sequence having equal weight, a potentially desirable property for some use cases.
- the weights in the contextual FST can be used to modify probabilities during beam search according to Equation 2.
- the log probability of a label y denoted s(y)
- s(y) is modified by the factor ⁇ log log p c (y), where ⁇ is an interpolation weight, and log log p c (y) is the weight provided by the contextual FST, in the case where the label y matches an arc in C. If the label y does not match an arc in C the log probability s(y) is unmodified.
- s ( y ) log log p ( x ) ⁇ log log p c ( y ) (2)
- the process for building and using contextual FSTs can be summarized as follows: (1) Build an acyclic character based FST G. (2) Construct a “speller” FST S, mapping wordpieces to their constituent graphemes. (3) Use S and G to compute an unweighted wordpiece based acceptor grammar U, as described in Equation 1. (4) Apply the factor operation to produce a factored FST F, to ensure that any path leading to a given state traverses the same number of arcs. (5) Apply constant weights per arc and failure transitions equal to the cumulative weight accumulated up to this point to obtain the contextual biasing FST C. (6) Use the contextual FST to modify probabilities of matching labels during beam search as described in Equation 2.
- FIG. 1 A illustrates an example of generating a text representation of audio data using contextual biasing finite state transducers in accordance with a variety of implementation.
- audio data including one or more alphanumeric sequences 102 is processed using a context engine 104 and an ASR model 114 .
- Audio data 102 can be captured audio data such as spoke natural language captured by a microphone of a client device.
- Context engine 104 can process audio data 102 to generate contextual information 108 corresponding to the processed audio data.
- context engine 104 can determine contextual information relating to a flight based on the spoken utterance of “tell me about my flight to DAY”, where DAY is an airport code alphanumeric sequence indicating the Dayton International Airport.
- context engine 104 can process additional information 106 to generate contextual information 108 .
- Additional information 106 can include one or more previous turns in the user-computing system dialog preceding audio data 102 ; user information stored in a user profile; client device information; and/or additional information relating to the current dialog session, a previous dialog session, the device, the user, and/or other contextual information.
- additional information 106 can include a calendar entry indicating a flight to Dayton, Ohio.
- context engine 104 can process additional information 106 in addition to audio data 102 to generate contextual information 108 .
- context engine 104 can process audio data 102 of “tell me about my flight to DAY” and additional information 106 of a calendar entry indicating a flight to Dayton, Ohio to generate contextual information 108 .
- context engine 104 can process additional information 106 without processing audio data 102 to generate contextual information 108 .
- FST engine 110 can process contextual information 108 to determine one or more contextual FSTs 112 corresponding to the contextual information.
- Contextual FSTs 112 can be used to modify the probabilities of one or more candidate text recognitions 116 of the audio data 102 .
- FST engine 110 can determine an airport code contextual FST 112 based on contextual information 108 indicating a flight.
- audio data 102 can be processed using an ASR engine.
- audio data 102 can be processed using an ASR model 114 portion of the ASR engine to generate one or more candidate text recognitions 116 of the audio data.
- ASR model 114 can include a variety of machine learning models including a recurrent neural network (RNN) model, a recurrent neural network-transformer (RNN-T) model, a sequence to sequence model, a listen attend spell (LAS) model, and/or one or more additional models for use in generating candidate recognitions of spoken utterances.
- the ASR engine can include a beam search portion which can be utilized in selecting a text representation from the candidate recognitions 116 .
- contextual FST(s) can modify the probabilities of one or more candidate recognitions 116 by modifying the beam search portion of the ASR engine 118 .
- contextual FST(s) 112 can be used to modify the beam search portion of the ASR engine 118 via shallow fusion.
- the airport code contextual FST can be used to increase the probability of candidate text recognition(s) of “to DAY” and/or decrease the probability of candidate text recognition(s) of “today”.
- FIG. 1 B illustrates an example environment in which various implementations disclosed herein may be implemented.
- Example environment 150 includes a client device 152 .
- Client device 152 can include ASR engine 115 , context engine 104 , FST engine 110 , additional or alternative engine(s) (not depicted), ASR model 114 , and/or additional or alternative model(s) (not depicted). Additionally or alternatively, client device 152 may be associated with contextual FST(s) 112 .
- contextual FST(s) 112 can be stored locally on a client device and can be used to modify the probability of candidate recognitions of spoken utterances, such as a spoken utterance including an alphanumeric sequences, where the candidate recognitions are used in generating a text representation of the spoken utterance.
- client device 152 may include user interface input/output devices (not depicted), which may include, for example, a physical keyboard, a touch screen (e.g., implementing a virtual keyboard or other textual input mechanisms), a microphone, a camera, a display screen, and/or speaker(s).
- the user interface input/output devices may be incorporated with one or more client devices 152 of a user.
- a mobile phone of the user may include the user interface input output devices; a standalone digital assistant hardware device may include the user interface input/output devise; a first computing device may include the user interface input device(s) and a separate computing device may include the user interface output device(s); etc.
- all or aspects of client device 152 may be implemented on a client device that also contains the user interface input/output devices.
- client device 152 may include an automated assistant (not depicted), and all or aspects of the automated assistant may be implemented on computing device(s) that are separate and remote from the client device that contains the user interface input/output devices (e.g., all or aspects may be implemented “in the cloud”).
- those aspects of the automated assistant may communicate with the computing device via one or more networks such as a local area network (LAN) and/or a wide area network (WAN) (e.g., the Internet).
- LAN local area network
- WAN wide area network
- client device 152 include one or more of: a desktop computing device, a laptop computing device, a standalone hardware device at least in part dedicated to an automated assistant, a tablet computing device, a mobile phone computing device, a computing device of a vehicle (e.g., an in-vehicle communications system, and in-vehicle entertainment system, an in-vehicle navigation system, an in-vehicle navigation system), or a wearable apparatus of the user that includes a computing device (e.g., a watch of the user having a computing device, glasses of the user having a computing device, a virtual or augmented reality computing device). Additional and/or alternative computing systems may be provided.
- a desktop computing device e.g., a laptop computing device, a standalone hardware device at least in part dedicated to an automated assistant
- a tablet computing device e.g., an in-vehicle communications system, and in-vehicle entertainment system, an in-vehicle navigation system, an in-vehicle navigation system
- Client device 152 may include one or more memories for storage of data and software applications, one or more processors for accessing data and executing applications, and other components that facilitate communication over a network.
- the operations performed by client device 152 may be distributed across multiple computing devices. For example, computing programs running on one or more computers in one or more locations can be coupled to each other through a network.
- contextual FST generation engine 156 can generate contextual FST(s) 112 .
- contextual FST generation engine 156 can determine a written domain alphanumeric grammar FST which can accept an alphanumeric sequence.
- alphanumeric grammar FST 200 of FIG. 2 can accept the alphanumeric sequences of “200” and “202”.
- alphanumeric grammar FST is an acyclic directed graph.
- contextual FST generation engine 156 can generate the alphanumeric grammar FST. Additionally or alternatively, contextual FST generation engine 156 can select an alphanumeric grammar FST generated by an additional engine.
- Contextual FST generation engine 156 can determine a speller FST which maps wordpieces to the constituent graphemes of that wordpiece. In some implementations, contextual FST generation engine 156 can generate the speller FST. Additionally or alternatively, contextual FST generation engine 156 can select a speller FST generated by an additional engine (not depicted). In some implementations, contextual FST generation engine 156 can generate an unweighted wordpiece based acceptor grammar FST using the speller FST and the alphanumeric grammar FST. The unweighted wordpiece acceptor grammar may accept valid wordpiece level tokenization of strings of the alphanumeric grammar FST. For example, unweighted wordpiece based acceptor grammar 300 of FIG. 3 can accept the strings “2 00”, “2 0 0”, “2 0 2”, “20 2,” and “20 0”.
- contextual FST generation engine 156 can generate a factored FST by applying a factor operation to the unweighted wordpiece based acceptor grammar, where any path leading to a given state in the factored FST traverses the same number of arcs. For example, all paths leading to a given state in factored FST 400 of FIG. 4 have the same length. Furthermore, contextual FST generation engine 156 can generate contextual FST(s) 112 by applying a constant weight per arc to the factored FST. Contextual biasing FST 500 of FIG. 5 is an illustrative example of a contextual FST.
- audio data such as audio data captured using a microphone of client device 152 can be processed using context engine 104 to determine contextual information corresponding to the audio data.
- context engine 104 can determine contextual information based on additional information.
- additional information can include one or more previous turns in the dialog preceding audio data between component(s) of the client device and the user; user information stored in a user profile; client device information; and/or additional information relating to the current dialog session, a previous dialog session, the device, the user, and/or relating to other information.
- FST engine 110 can select one or more contextual FSTs 112 based on the contextual information. For example, FST engine 110 can select a temperature contextual FST based on contextual information relating to a thermostat temperature.
- ASR engine 154 can determine one or more candidate recognitions of the audio data using ASR model 114 .
- ASR model 114 can be a recurrent neural network-transformer (RNN-T) model, and ASR engine can determine a set of candidate recognitions of the audio data using the RNN-T model.
- ASR engine 154 can include a beam search portion to generate the text representation of the audio data based on the candidate recognitions.
- Contextual FST(s) 112 selected by FST engine 110 based on context information determined using context engine 104 can be used to modify the probabilities of candidate recognitions during the beam search.
- the contextual FST(s) 112 can modify the beam search portion via shallow fusion.
- a contextual FST can increase and/or decrease probabilities of candidate recognitions of an alphanumeric sequence indicated by the contextual FST(s).
- FIG. 2 illustrates an example of an alphanumeric grammar FST.
- Alphanumeric grammar FSTs can be directed acyclic graphs such as grapheme based acceptor grammars.
- alphanumeric grammar FST 200 accepts the alphanumeric sequence “200” or “202”.
- the character “_” denotes the start of an alphanumeric sequence.
- alphanumeric grammar FST 200 accepts “_200” and “_202”.
- alphanumeric grammar FST 200 begins with vertex 0 and ends with vertex 4 .
- Alphanumeric sequence 200 includes a variety of additional vertices between vertex 0 and vertex 4 including vertex 1 , vertex 2 , and vertex 3 .
- the edge between vertex 0 and vertex 1 represents “_” (i.e., a token indicating the beginning of an alphanumeric sequence).
- the edge between vertex 1 and vertex 2 represents “2”, the first character in the accepted sequence.
- the edge between vertex 2 and vertex 3 represents “0”, the second character in the accepted sequence.
- Two edges are between vertex 3 and vertex 4 .
- the first edge represents the character “2”, the third character in the accepted sequence.
- the second edge represents the character “0”, the third character in the additional sequence accepted by the alphanumeric grammar.
- FIG. 3 illustrates an example unweighted wordpiece based acceptor grammar.
- unweighted wordpiece based acceptor grammar 300 can be generated by processing the alphanumeric grammar FST 200 of FIG. 2 using a speller FST.
- the speller FST can map wordpieces to their constituent graphemes.
- Unweighted wordpiece based acceptor grammar 300 begins with vertex 0 and ends with vertex 3 . A variety of additional vertices are between vertex 0 and vertex 3 including vertex 1 and vertex 2 .
- the edge between vertex 0 and vertex 1 represents “_2”; the edge between vertex 1 and vertex 3 represents “00”; the edge between vertex 0 and vertex 2 represents “_20”; the edge between vertex 1 and vertex 2 represents “0”; the first edge between vertex 2 and vertex 3 represents “2”; and the second edge between vertex 2 and vertex 3 represents “0”.
- unweighted wordpiece based acceptor grammar 300 can accept the strings “2 00”, “2 0 0”, “20 0”, “2 0 2”, and “20 2”.
- FIG. 4 illustrates an example factored FST.
- factored FST 400 is generated by applying a factor operation to unweighted wordpiece based acceptor grammar 300 .
- Factored FST 400 begins with vertex 0 and ends with either vertex 4 or vertex 5 .
- Factored FST 400 also includes vertex 1 , vertex 2 , and vertex 3 .
- the edge between vertex 0 and vertex 1 represents “_2”.
- the edge between vertex 1 and vertex 3 represents “0”.
- the first edge between vertex 3 and vertex 5 represents “2”.
- the second edge between vertex 3 and vertex 5 represents “0”.
- the edge between vertex 1 and vertex 4 represents “00”.
- the edge between vertex 0 and vertex 2 represents “_20”.
- the first edge between vertex 2 and vertex 4 represents “2”.
- the second edge between vertex 2 and vertex 4 represents “0”.
- FIG. 5 illustrates an example of a contextual biasing FST.
- contextual biasing FST 500 has a constant weight of 1 per arc.
- the contextual biasing FST has additional weights per arc (e.g., a constant weight of 2 per arc, a constant weight of 10 per arc, and/or an additional constant weight per arc).
- Contextual biasing FST 500 can be generated by applying a constant weight per arc and/or failure transitions equal to the cumulative weight accumulated up to a given state of factored FST 400 .
- the weights in contextual FST 500 can be used to modify probabilities during a beam search portion of automatic speech recognition.
- Contextual biasing FST 500 begins at vertex 0 and ends at either vertex 4 or vertex 5 . Additional vertices in contextual biasing FST 500 include vertex 1 , vertex 2 , and vertex 3 .
- an edge between vertex 0 and vertex 1 represents “_2/1”, where _2 represents the starting character and the first digit of the accepted number, and 1 represents the weight.
- the edge between vertex 1 and vertex 3 represents “0/1”, where 0 is the second digit of the accepted number and 1 represents the weight.
- the first edge between vertex 3 and vertex 5 represents “2/1”, where 2 is the third digit of the accepted number and 1 represents the weight.
- the second edge between vertex 3 and vertex 5 represents “0/1”, where 0 is the third digit of the accepted number and 1 represents the weight.
- the edge between vertex 1 and vertex 4 represents “00/1”, where 00 are the second and third digits of the accepted number and 1 is the weight.
- the edge between vertex 0 and vertex 2 represents “_20/1”, where _ is the starting character, 20 is the first and second digits of the accepted number, and 1 is the weight.
- the first edge between vertex 2 and vertex 4 represents “2/1” where 2 is the third digit of the accepted number and 2 is the weight.
- the second edge between vertex 2 and vertex 4 represents “0/1” where 0 is the third digit of the accepted number and 1 is the weight.
- the edge between vertex 1 and vertex 0 represents “ ⁇ epsilon>/ ⁇ 1” indicating a failure transition of ⁇ 1.
- the edge between vertex 3 and vertex 0 represents “ ⁇ epsilon>/ ⁇ 2” indicating a failure transition of ⁇ 2.
- the edge between vertex 2 and vertex 0 represents “ ⁇ epsilon>/ ⁇ 1” indicating a failure transition of ⁇ 1.
- FIG. 6 is a flowchart illustrating a process 600 of generating a contextual biasing FST according to implementations disclosed herein.
- This system may include various components of various computer systems, such as one or more components of client device 152 of FIG. 1 B .
- operations of process 600 are shown in a particular order, this is not meant to be limiting.
- One or more operations may be reordered, omitted, and/or added.
- the system selects an alphanumeric grammar FST.
- the alphanumeric grammar FST is an acyclic character based FST.
- alphanumeric grammar FST 200 of FIG. 2 accepts the sequences “200” and “202”.
- the system selects a speller FST, where the speller FST maps wordpieces to their constituent graphemes.
- the system generates an unweighted wordpiece based acceptor grammar FST based on the alphanumeric grammar FST and the speller FST.
- the system generates the unweighted wordpiece based acceptor grammar FST based on the alphanumeric grammar FST selected at block 602 and the speller FST selected at block 604 .
- unweighted wordpiece based acceptor grammar FST 300 of FIG. 3 can accept the strings “2 00”, “2 0 0”, “2 0 2”, “20 2,” and “20 0”.
- the system generates a factored FST by processing the unweighted wordpiece based acceptor grammar using a factor operation.
- the any path leading to a given state in a factored FST traverses the same number of arcs.
- Factored FST 400 of FIG. 4 is an example of a factored FST.
- Contextual biasing FST 500 of FIG. 5 is an example contextual biasing FST.
- FIG. 7 is a flowchart illustrating a process 700 of generating a generating a text representation of a spoken utterance using a contextual biasing FST according to implementations disclosed herein.
- This system may include various components of various computer systems, such as one or more components of client device 152 of FIG. 1 B .
- operations of process 700 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted, and/or added.
- the system receives audio data capturing a spoken utterance including an alphanumeric sequence.
- the audio data is captured using a microphone of a client device.
- the system can receive audio data capturing the spoken utterance of “tell me about my flight to DAY”, where DAY is an airport code alphanumeric sequence indicating the Dayton International Airport.
- the system can receive audio data capturing the spoken utterance of “when will my package with the tracking number of ABC-123 arrive?”.
- the system determines contextual information for the alphanumeric sequence.
- the system can determine the contextual information based on the spoken utterance. For example, the system can determine contextual information indicating a tracking number for the spoken utterance “when will my package with the tracking number of ABC-123 arrive”.
- the system can determine contextual information based on additional information including one or more previous turn in a dialog between a user and a computing system that precede the spoken utterance; information about the user stored in a user profile; information about the computing system; information about one or more networked hardware devices (e.g., a networked smart thermostat, a networked smart oven, a networked camera, etc.); and/or additional information relating to the context of the utterance.
- a dialog between a user and a computing system can include: user—“when will my package arrive”; computing system—“what is the tracking number of the package” user—“it is ABC-123”.
- a user profile of the user can include a calendar entry relating to a flight from San Francisco International Airport to Los Angeles International Airport which can be used to determine contextual information relating to a flight.
- contextual information relating to a thermostat temperature can be determined based on a smart thermostat associated with the system.
- the system selects one or more contextual biasing FSTs corresponding to the contextual information. For example, the system can select an airport code contextual biasing FST based on contextual information indicating a flight. In some implementations, the system can select multiple contextual FSTs corresponding to the contextual information. For example, the system can select a Fahrenheit temperature contextual biasing FST and a Celsius temperature contextual biasing FST corresponding to the contextual information indicating smart thermostat temperature. In some implementations, the one or more contextual biasing FSTs can be generated in accordance with process 600 of FIG. 6 . In some implementations, the system can select the contextual biasing FST(s) from a predetermined set of contextual FSTs stored locally at a client device.
- the system can select the contextual biasing FST(s) from a predetermined set of contextual FSTs stored remotely from the client device on a server.
- the system can select the contextual biasing FST(s) that are generated on the fly by the system, such as a contextual biasing FST generated on the fly based on a client provided numeric grammar.
- the system generates a set of candidate recognitions of the spoken utterance including the alphanumeric sequence by processing the spoken utterance using an ASR model portion of an ASR engine.
- the system can generate a set of candidate recognitions of the spoken utterance by processing the audio data received at block 702 using a RNN-T model.
- the system generates a text representation of the spoken utterance based on the set of candidate recognitions and the one or more selected contextual FSTs.
- the system can generate the text representation of the spoken utterance based on the set of candidate recognitions generated at block 708 and the one or more contextual FSTs selected at block 702 .
- a beam search portion of the ASR system can be modified via shallow fusion using the contextual FSTs to change the probabilities of candidate recognitions generated using the ASR model at block 708 .
- FIG. 8 an example environment in which implementations disclosed herein can be implemented.
- FIG. 8 includes a client computing device 802 , which execute an instance of an automated assistant client 804 .
- One or more cloud-based automated assistant components 810 can be implemented on one or more computing systems (collectively referred to as a “cloud” computing system) that are communicatively coupled to client device 802 via one or more local and/or wide area networks (e.g., the Internet) indicated generally at 808 .
- cloud computing systems
- local and/or wide area networks e.g., the Internet
- An instance of an automated assistant client 804 by way of its interactions with one or more cloud-based automated assistant components 810 , may form what appears to be, from the user's perspective, a logical instance of an automated assistant 800 with which the user may engage in a human-to-computer dialog. It thus should be understood that in some implementations, a user that engages with an automated assistant client 804 executing on client device 802 may, in effect, engage with his or her own logical instance of an automated assistant 800 .
- automated assistant As used herein as “serving” a particular user will often refer to the combination of an automated assistant client 804 executing on a client device 802 operated by the user and one or more cloud-based automated assistant components 810 (which may be shared amongst multiple automated assistant clients of multiple client computing devices). It should also be understood that in some implementations, automated assistant 800 may respond to a request from any user regardless of whether the user is actually “served” by that particular instance of automated assistant 800 .
- the client computing device 802 may be, for example: a desktop computing device, a laptop computing device, a tablet computing device, a mobile smartphone computing device, a standalone interactive speaker, a smart appliance, and/or a wearable apparatus of the user that includes a computing device (e.g., a watch of the user having a computing device, glasses of the user having a computing device, a virtual or augmented reality computing device). Additional and/or alternative client computing devices may be provided. Additionally or alternatively, operations of client computing device 802 may be distributed between multiple computing devices. For example, one or more operations of client computing device 802 may be distributed between a mobile smartphone and a vehicle computing device. Furthermore, operations of client computing device 802 may be repeated between multiple computing devices (which in some cases may be communicatively coupled).
- a mobile smartphone as well as a vehicle interface device may each implement operations of automated assistant 800 , such as a mobile smartphone and a vehicle interface device both including an invocation engine (described below).
- the client computing device 802 may optionally operate one or more other applications that are in additional to automated assistant client 1304 , such as a message exchange client (e.g., SMS, MMS, online chat), a browser, and so forth.
- a message exchange client e.g., SMS, MMS, online chat
- a browser e.g., a browser, and so forth.
- one or more of the other applications can optionally interface (e.g. via an application programming interface) with the automated assistant 804 , or include their own instance of an automated assistant application (that may also interface with the cloud-based automated assistant component(s) 810 ).
- Automated assistant 800 engages in human-to-computer dialog sessions with a user via user interface input and output devices of the client device (not pictured). To preserve user privacy and/or to conserve resources, in many situations a user must often explicitly invoke the automated assistant 800 before the automated assistant will fully process a spoken utterance.
- the explicit invocation of the automated assistant 800 can occur in response to certain user interface input received at the client device 802 .
- user interface inputs that can invoke the automated assistant 800 via the client device 802 can optionally include actuations of a hardware and/or virtual button of the client device 802 .
- the automated assistant client can include one or more local engines 806 , such as an invocation engine that is operable to detect the presence of one or more spoken invocation phrases.
- the invocation engine can invoke the automated assistant 800 in response to detection of one or more of the spoken invocation phrases.
- the invocation engine can invoke the automated assistant 800 in response to detecting a spoken invocation phrase such as “Hey Assistant”, “OK Assistant”, and/or “Assistant”.
- the invocation engine can continuously process (e.g., if not in an “inactive” mode) a stream of audio data frames that are based on output from one or more microphones of the client device 802 , to monitor for an occurrence of a spoken invocation phrase. While monitoring for the occurrence of the spoken invocation phrase, the invocation engine discards (e.g., after temporary storage in a buffer) any audio data frames that do not include the spoken invocation phrase.
- the invocation engine can invoke the automated assistant 800 .
- “invoking” the automated assistant 800 can include causing one or more previously inactive functions of the automated assistant 800 to be activated.
- invoking the automated assistant 800 can include causing one or more local engines 806 and/or cloud-based automated assistant components 810 to further process audio data frames based on which the invocation phrase was detected, and/or one or more following audio data frames (whereas prior to invoking no further processing of audio data frames was occurring).
- the one or more local engine(s) 806 of automated assistant 804 are optional, and can include, for example, the invocation engine described above, a local speech-to-text (“STT”) engine (that converts captured audio to text), a local text-to-speech (“TTS”) engine (that converts text to speech), a local natural language processor (that determines semantic meaning of audio and/or text converted from audio), and/or other local components.
- STT speech-to-text
- TTS text-to-speech
- a local natural language processor that determines semantic meaning of audio and/or text converted from audio
- the client device 802 is relatively constrained in terms of computing resources (e.g., processor cycles, memory, battery, etc.)
- the local engines 806 may have limited functionality relative to any counterparts that are included in cloud-based automated assistant components 810 .
- Automated assistant client 804 can additionally include a context engine and/or FST engine (not depicted).
- the context engine such as context engine 104 of FIG. 1 A and FIG. 1 B , which can be utilized by automated assistant client 804 in determining contextual information based on audio data and/or additional information related to the audio data.
- the FST engine such as FST engine 110 of FIG. 1 A and FIG. 1 B , can be utilized by automated assistant client 804 in selecting one or more contextual biasing FSTs.
- TTS engine 812 can utilize the one or more selected contextual biasing FSTs in generating a text representation of the audio data, such as by modifying a beam search with the contextual biasing FST(s) via shallow fusion.
- Cloud-based automated assistant components 810 leverage the virtually limitless resources of the cloud to perform more robust and/or more accurate processing of audio data, and/or other user interface input, relative to any counterparts of the local engine(s) 806 .
- the client device 802 can provide audio data and/or other data to the cloud-based automated assistant components 810 in response to the invocation engine detecting a spoken invocation phrase, or detecting some other explicit invocation of the automated assistant 800 .
- the illustrated cloud-based automated assistant components 810 include a cloud-based TTS module 812 , a cloud-based STT module 814 , and a natural language processor 816 .
- one or more of the engines and/or modules of automated assistant 800 may be omitted, combined, and/or implemented in a component that is separate from automated assistant 800 .
- automated assistant 800 can include additional and/or alternative engines and/or modules.
- Cloud-based STT module 814 can convert audio data into text, which may then be provided to natural language processor 816 .
- the cloud-based STT module 814 can convert audio data into text based at least in part on indications of speaker labels and assignments that are provided by an assignment engine (not illustrated).
- Cloud-based TTS module 812 can convert textual data (e.g., natural language responses formulated by automated assistant 800 ) into computer-generated speech output.
- TTS module 812 may provide the computer-generated speech output to client device 802 to be output directly, e.g., using one or more speakers.
- textual data (e.g., natural language responses) generated by automated assistant 800 may be provided to one of the local engine(s) 806 , which may then convert the textual data into computer-generated speech that is output locally.
- Natural language processor 816 of automated assistant 800 processes free form natural language input and generates, based on the natural language input, annotated output for use by one or more other components of the automated assistant 800 .
- the natural language processor 816 can process natural language free-form input that is textual input that is a conversion, by STT module 814 , of audio data provided by a user via client device 802 .
- the generated annotated output may include one or more annotations of the natural language input and optionally one or more (e.g., all) of the terms of the natural language input.
- the natural language processor 816 is configured to identify and annotate various types of grammatical information in natural language input.
- the natural language processor 816 may include a part of speech tagger (not depicted) configured to annotate terms with their grammatical roles. Also, for example, in some implementations the natural language processor 816 may additionally and/or alternatively include a dependency parser (not depicted) configured to determine syntactic relationships between terms in natural language input.
- the natural language processor 816 may additionally and/or alternatively include an entity tagger (not depicted) configured to annotate entity references in one or more samples such as references to people (including, for instance, literary characters, celebrities, public figures, etc.), organizations, locations (real and imaginary), and so forth.
- entity tagger of the natural language processor 816 may annotate references to an entity at a high level of granularity (e.g., to enable identification of all references to an entity class such as people) and/or a lower level of granularity (e.g., to enable identification of all references to a particular entity such as a particular person).
- the entity tagger may rely on content of the natural language input to resolve a particular entity and/or may optionally communicate with a knowledge graph or other entity database to resolve a particular entity.
- the natural language processor 816 may additionally and/or alternatively include a coreference resolver (not depicted) configured to group, or “cluster,” references to the same entity based on one or more contextual cues.
- the coreference resolver may be utilized to resolve the term “there” to “Hypothetical Café” in the natural language input “I liked Hypothetical Café last time we ate there.”
- one or more components of the natural language processor 816 may rely on annotations from one or more other components of the natural language processor 816 .
- the named entity tagger may rely on annotations from the coreference resolver and/or dependency parser in annotating all mentions to a particular entity.
- the coreference resolver may rely on annotations from the dependency parser in clustering references to the same entity.
- one or more components of the natural language processor 816 may use related prior input and/or other related data outside of the particular natural language input to determine one or more annotations.
- FIG. 9 is a block diagram of an example computing device 910 that may optionally be utilized to perform one or more aspects of techniques described herein.
- one or more of a client computing device, and/or other component(s) may comprise one or more components of the example computing device 910 .
- Computing device 910 typically includes at least one processor 914 which communicates with a number of peripheral devices via bus subsystem 912 .
- peripheral devices may include a storage subsystem 924 , including, for example, a memory subsystem 925 and a file storage subsystem 926 , user interface output devices 920 , user interface input devices 922 , and a network interface subsystem 916 .
- the input and output devices allow user interaction with computing device 910 .
- Network interface subsystem 916 provides an interface to outside networks and is coupled to corresponding interface devices in other computing devices.
- User interface input devices 922 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices.
- pointing devices such as a mouse, trackball, touchpad, or graphics tablet
- audio input devices such as voice recognition systems, microphones, and/or other types of input devices.
- use of the term “input device” is intended to include all possible types of devices and ways to input information into computing device 910 or onto a communication network.
- User interface output devices 920 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
- the display subsystem may include a cathode ray tube (“CRT”), a flat-panel device such as a liquid crystal display (“LCD”), a projection device, or some other mechanism for creating a visible image.
- the display subsystem may also provide non-visual display such as via audio output devices.
- output device is intended to include all possible types of devices and ways to output information from computing device 910 to the user or to another machine or computing device.
- Storage subsystem 924 stores programming and data constructs that provide the functionality of some or all of the modules described herein.
- the storage subsystem 924 may include the logic to perform selected aspects of one or more of the processes of FIG. 8 , as well as to implement various components depicted in FIG. 1 B .
- Memory 925 used in the storage subsystem 924 can include a number of memories including a main random access memory (“RAM”) 930 for storage of instructions and data during program execution and a read only memory (“ROM”) 932 in which fixed instructions are stored.
- a file storage subsystem 926 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
- the modules implementing the functionality of certain implementations may be stored by file storage subsystem 926 in the storage subsystem 924 , or in other machines accessible by the processor(s) 914 .
- Bus subsystem 912 provides a mechanism for letting the various components and subsystems of computing device 910 communicate with each other as intended. Although bus subsystem 912 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.
- Computing device 910 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 910 depicted in FIG. 9 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 910 are possible having more or fewer components than the computing device depicted in FIG. 9 .
- the systems described herein collect personal information about users (or as often referred to herein, “participants”), or may make use of personal information
- the users may be provided with an opportunity to control whether programs or features collect user information (e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current geographic location), or to control whether and/or how to receive content from the content server that may be more relevant to the user.
- user information e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current geographic location
- certain data may be treated in one or more ways before it is stored or used, so that personal identifiable information is removed.
- a user's identity may be treated so that no personal identifiable information can be determined for the user, or a user's geographic location may be generalized where geographic location information is obtained (such as to a city, ZIP code, or state level), so that a particular geographic location of a user cannot be determined.
- geographic location information such as to a city, ZIP code, or state level
- the user may have control over how information is collected about the user and/or used.
- a method implemented by one or more processors includes generating a text representation of audio data capturing a spoken utterance, including an alphanumeric sequence, using an automatic speech recognition (“ASR”) engine.
- ASR automatic speech recognition
- generating the text representation of the audio data capturing the spoken utterance, including the alphanumeric sequence, using the ASR engine includes determining contextual information for the alphanumeric sequence.
- the method includes selecting, based on the contextual information, one or more contextual finite state transducers for the alphanumeric sequence.
- the method includes generating a set of candidate recognitions of the spoken utterance based on processing the audio data using an ASR model portion of the ASR engine.
- the method includes generating the text representation of the spoken utterance, wherein the text representation includes the alphanumeric sequence, and wherein generating the text representation is based on the generated set of candidate recognitions and the one or more selected contextual finite state transducers.
- the ASR model is a recurrent neural network transducer (RNN-T) model, and the ASR engine further comprises a beam search portion.
- generating the text representation of the spoken utterance based on the generated set of candidate recognitions and the one or more selected contextual finite state transducers includes modifying the beam search portion of the ASR engine using the one or more contextual finite state transducers.
- modifying the beam search portion of the ASR engine using the one or more contextual finite state transducers includes modifying, via shallow fusion, the beam search portion of the ASR engine using the one or more contextual finite state transducers.
- generating the text representation of the spoken utterance based on the generated set of candidate recognitions and the one or more selected contextual finite state transducers further includes determining a corresponding probability measure for each candidate recognition in the set of candidate recognitions. In some implementations, the method further includes modifying the corresponding probability measures using the beam search portion of the ASR engine modified using the one or more contextual finite state transducers. In some implementations, the method further includes selecting a candidate recognition, from the set of candidate recognitions, based on determining that the corresponding probability measure for the candidate recognition satisfies one or more conditions. In some implementations, the method further includes generating the text representation of the spoken utterance based on the selected candidate recognition.
- the audio data is captured via one or more microphones of a client device.
- determining the contextual information for the alphanumeric sequence is based on the audio data capturing the spoken utterance, and determining the contextual information for the alphanumeric sequence based on the audio data includes generating the contextual information for the alphanumeric sequence based on one or more recognized terms of the audio data, the one or more recognized terms being in addition to the alphanumeric sequence.
- determining the contextual information for the alphanumeric sequence is based on a rendered system prompt that immediately preceded the spoken utterance, and determining the contextual information based on the rendered system prompt includes determining the contextual information based on at least one predicted response to the rendered system prompt.
- the method further includes generating at least a given contextual finite state transducer, of the one or more contextual finite state transducers.
- generating the given contextual finite state transducer includes selecting an alphanumeric grammar finite state transducer corresponding to an alphanumeric sequence.
- the method further includes selecting a speller finite state transducer which maps wordpieces to constitute graphemes.
- the method further includes generating an unweighted wordpiece based acceptor grammar based on the alphanumeric grammar finite state transducer and the speller finite state transducer.
- generating the given contextual finite state transducers further includes generating a factored finite state transducer based on processing the unweighted wordpiece based acceptor grammar using a factor operation. In some implementations, the method further includes generating the given contextual finite state transducer based on applying a constant weight to each arc in the unweighted wordpiece based acceptor grammar.
- the alphanumeric sequence includes at least one number and includes at least one letter.
- the ASR model portion of the ASR engine is an end-to-end speech recognition model.
- the ASR engine is trained using a set of training instances, and wherein the alphanumeric sequence is not in the set of training instances.
- the ASR engine is trained using a set of training instances, and wherein the alphanumeric sequence occurs a number of times, in the set of training instances, that is below a threshold value.
- a method implemented by one or more processors includes generating a contextual finite state transducer for use in modifying one or more probabilities of candidate recognitions of an alphanumeric sequence of a spoken utterance during automatic speech recognition.
- generating the contextual finite state transducer includes selecting an alphanumeric grammar finite state transducer corresponding to the alphanumeric sequence.
- the method includes selecting a speller finite state transducer which maps wordpieces to constitute graphemes.
- the method includes generating an unweighted wordpiece based acceptor grammar based on the alphanumeric grammar finite state transducer and the speller finite state transducer.
- the method includes generating a factored finite state transducer based on processing the unweighted wordpiece based acceptor grammar using a factor operation. In some implementations, the method includes generating the contextual finite state transducer based on applying a constant weight to each arc in the unweighted wordpiece based acceptor grammar.
- some implementations include one or more processors (e.g., central processing unit(s) (CPU(s)), graphics processing unit(s) (GPU(s), and/or tensor processing unit(s) (TPU(s)) of one or more computing devices, where the one or more processors are operable to execute instructions stored in associated memory, and where the instructions are configured to cause performance of any of the methods described herein.
- processors e.g., central processing unit(s) (CPU(s)), graphics processing unit(s) (GPU(s), and/or tensor processing unit(s) (TPU(s)
- Some implementations also include one or more transitory or non-transitory computer readable storage media storing computer instructions executable by one or more processors to perform any of the methods described herein.
Abstract
Description
U=min(det det(project(S∘G))) (1)
s(y)=log log p(x)−α log log p c(y) (2)
Claims (19)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2020/014141 WO2021145893A1 (en) | 2020-01-17 | 2020-01-17 | Alphanumeric sequence biasing for automatic speech recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
US20220013126A1 US20220013126A1 (en) | 2022-01-13 |
US11942091B2 true US11942091B2 (en) | 2024-03-26 |
Family
ID=69724068
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/251,465 Active 2041-08-26 US11942091B2 (en) | 2020-01-17 | 2020-01-17 | Alphanumeric sequence biasing for automatic speech recognition using a grammar and a speller finite state transducer |
Country Status (6)
Country | Link |
---|---|
US (1) | US11942091B2 (en) |
EP (1) | EP4073789B1 (en) |
JP (1) | JP7400112B2 (en) |
KR (1) | KR20220128397A (en) |
CN (1) | CN114981885A (en) |
WO (1) | WO2021145893A1 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11455467B2 (en) * | 2020-01-30 | 2022-09-27 | Tencent America LLC | Relation extraction using full dependency forests |
US11893983B2 (en) * | 2021-06-23 | 2024-02-06 | International Business Machines Corporation | Adding words to a prefix tree for improving speech recognition |
US11880645B2 (en) | 2022-06-15 | 2024-01-23 | T-Mobile Usa, Inc. | Generating encoded text based on spoken utterances using machine learning systems and methods |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000267691A (en) | 1999-03-19 | 2000-09-29 | Meidensha Corp | Recognition dictionary selecting method in voice recognition system |
JP2010066493A (en) | 2008-09-10 | 2010-03-25 | Denso Corp | Code recognition apparatus and route retrieval device |
US8972243B1 (en) | 2012-11-20 | 2015-03-03 | Amazon Technologies, Inc. | Parse information encoding in a finite state transducer |
US20150194149A1 (en) | 2014-01-08 | 2015-07-09 | Genesys Telecommunications Laboratories, Inc. | Generalized phrases in automatic speech recognition systems |
JP2017219769A (en) | 2016-06-09 | 2017-12-14 | 国立研究開発法人情報通信研究機構 | Voice recognition device and computer program |
US9886946B2 (en) | 2015-03-30 | 2018-02-06 | Google Llc | Language model biasing modulation |
US9971765B2 (en) | 2014-05-13 | 2018-05-15 | Nuance Communications, Inc. | Revising language model scores based on semantic class hypotheses |
US10176802B1 (en) | 2016-03-21 | 2019-01-08 | Amazon Technologies, Inc. | Lattice encoding using recurrent neural networks |
JP2019020597A (en) | 2017-07-18 | 2019-02-07 | 日本放送協会 | End-to-end japanese voice recognition model learning device and program |
US20200349923A1 (en) * | 2019-05-03 | 2020-11-05 | Google Llc | Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models |
US20200357388A1 (en) * | 2019-05-10 | 2020-11-12 | Google Llc | Using Context Information With End-to-End Models for Speech Recognition |
US20210035566A1 (en) * | 2019-08-02 | 2021-02-04 | International Business Machines Corporation | Domain specific correction of output from automatic speech recognition |
US11232799B1 (en) * | 2018-10-31 | 2022-01-25 | Amazon Technologies, Inc. | Speech recognition routing in a provider network |
-
2020
- 2020-01-17 KR KR1020227027865A patent/KR20220128397A/en unknown
- 2020-01-17 JP JP2022543558A patent/JP7400112B2/en active Active
- 2020-01-17 WO PCT/US2020/014141 patent/WO2021145893A1/en unknown
- 2020-01-17 US US17/251,465 patent/US11942091B2/en active Active
- 2020-01-17 CN CN202080093228.3A patent/CN114981885A/en active Pending
- 2020-01-17 EP EP20707891.6A patent/EP4073789B1/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000267691A (en) | 1999-03-19 | 2000-09-29 | Meidensha Corp | Recognition dictionary selecting method in voice recognition system |
JP2010066493A (en) | 2008-09-10 | 2010-03-25 | Denso Corp | Code recognition apparatus and route retrieval device |
US8972243B1 (en) | 2012-11-20 | 2015-03-03 | Amazon Technologies, Inc. | Parse information encoding in a finite state transducer |
US20150194149A1 (en) | 2014-01-08 | 2015-07-09 | Genesys Telecommunications Laboratories, Inc. | Generalized phrases in automatic speech recognition systems |
US9971765B2 (en) | 2014-05-13 | 2018-05-15 | Nuance Communications, Inc. | Revising language model scores based on semantic class hypotheses |
US9886946B2 (en) | 2015-03-30 | 2018-02-06 | Google Llc | Language model biasing modulation |
US10176802B1 (en) | 2016-03-21 | 2019-01-08 | Amazon Technologies, Inc. | Lattice encoding using recurrent neural networks |
JP2017219769A (en) | 2016-06-09 | 2017-12-14 | 国立研究開発法人情報通信研究機構 | Voice recognition device and computer program |
JP2019020597A (en) | 2017-07-18 | 2019-02-07 | 日本放送協会 | End-to-end japanese voice recognition model learning device and program |
US11232799B1 (en) * | 2018-10-31 | 2022-01-25 | Amazon Technologies, Inc. | Speech recognition routing in a provider network |
US20200349923A1 (en) * | 2019-05-03 | 2020-11-05 | Google Llc | Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models |
US20200357388A1 (en) * | 2019-05-10 | 2020-11-12 | Google Llc | Using Context Information With End-to-End Models for Speech Recognition |
US20210035566A1 (en) * | 2019-08-02 | 2021-02-04 | International Business Machines Corporation | Domain specific correction of output from automatic speech recognition |
Non-Patent Citations (5)
Title |
---|
European Patent Office; Intention to Grant issued in Application No. 20707891.6; 55 pages; dated May 19, 2023. |
European Patent Office; International Search Report and Written Opinion of PCT Ser. No. PCT/US2020/014141; 11 pages; dated Oct. 15, 2020. |
Intellectual Property India, First Examination Report issued in Application 202227038620; 7 pages; dated Oct. 20, 2022. |
Serrino, J. et al., "Contextual Recovery of Out-of-Lattice Named Entities in Automatic Speech Recognition;" Interspeech 2019; 5 pages; Sep. 15, 2019. |
Williams, I. et al., "Contextual Speech Recognition in End-to-End Neural Network Systems Using Beam Search;" Proceedings of Interspeech 2018; 5 pages; Sep. 2, 2018. |
Also Published As
Publication number | Publication date |
---|---|
US20220013126A1 (en) | 2022-01-13 |
JP7400112B2 (en) | 2023-12-18 |
EP4073789B1 (en) | 2023-11-08 |
KR20220128397A (en) | 2022-09-20 |
EP4073789A1 (en) | 2022-10-19 |
JP2023511091A (en) | 2023-03-16 |
CN114981885A (en) | 2022-08-30 |
WO2021145893A1 (en) | 2021-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11797772B2 (en) | Word lattice augmentation for automatic speech recognition | |
US11393476B2 (en) | Automatically determining language for speech recognition of spoken utterance received via an automated assistant interface | |
US11817080B2 (en) | Using corrections, of predicted textual segments of spoken utterances, for training of on-device speech recognition model | |
US10984784B2 (en) | Facilitating end-to-end communications with automated assistants in multiple languages | |
JP7104247B2 (en) | On-device speech synthesis of text segments for training on-device speech recognition models | |
US11942091B2 (en) | Alphanumeric sequence biasing for automatic speech recognition using a grammar and a speller finite state transducer | |
US20220415305A1 (en) | Speech generation using crosslingual phoneme mapping | |
US20220284049A1 (en) | Natural language understanding clarifications | |
US20230419964A1 (en) | Resolving unique personal identifiers during corresponding conversations between a voice bot and a human | |
US20230252995A1 (en) | Altering a candidate text representation, of spoken input, based on further spoken input | |
US20240112673A1 (en) | Identifying and correcting automatic speech recognition (asr) misrecognitions in a decentralized manner |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
AS | Assignment |
Owner name: GOOGLE LLC, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HAYNOR, BENJAMIN;ALEKSIC, PETAR;REEL/FRAME:054665/0134 Effective date: 20200116 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: AWAITING TC RESP., ISSUE FEE NOT PAID |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |